| | |
Last updated on May 7, 2026. This conference program is tentative and subject to change
Technical Program for Tuesday June 2, 2026
| |
| TuI1I Interactive Session, Hall C |
Add to My Program |
| Interactive Session 1 |
|
| |
| |
| 09:00-10:30, Paper TuI1I.1 | Add to My Program |
| Real-Time Sit-To-Stand Phase Classification with a Mobile Assistive Robot from Close Proximity Utilizing 3D Visual Skeleton Recognition |
|
| Mahdi, Anas | University of Waterloo |
| Dong, Zonghao | Tohoku University |
| Lin, Jonathan Feng-Shun | University of Waterloo |
| Hu, Yue | University of Waterloo |
| Hirata, Yasuhisa | Tohoku University |
| Mombaur, Katja | Karlsruhe Institute of Technology |
Keywords: Physically Assistive Devices, Physical Human-Robot Interaction
Abstract: Sit-to-stand (STS) transfer is a fundamental but challenging movement that plays a vital role in older adults’ daily activities. The decline in muscular strength and coordination ability can result in difficulties performing STS and, therefore, the need for mobility assistance by humans or assistive devices. Robotics rollators are being developed to provide active mobility assistance to older adults, including STS assistance. In this paper, we consider the robotic walker SkyWalker, which can provide active STS assistance by moving the handles upwards and forward to bring the user to a standing configuration. In this context, it is crucial to monitor if the user is performing the STS and adapt the rollator’s control accordingly. To achieve this, we utilized a standard vision-based method for estimating the human pose during the STS movement using Mediapipe pose tracking. Since estimating a user’s state from extreme proximity to the camera is challenging, we compared the pose identification results from Mediapipe to ground truth data obtained from Vicon marker-based motion capture to assess accuracy and reliability of the STS motion. The fourteen kinematic features critical for accurate pose estimation were selected based on literature review and the specific requirements of our robot’s STS method. By employing these features, we have implemented a phase classification system that enables the SkyWalker to classify the user’s STS phase in real-time. The selected kinematics from vision-based human state estimation method and trained classifier can be furthermore generalized to other types of motion support, including adaptive STS path planning and emergency stops for safety insurance during STS.
|
| |
| 09:00-10:30, Paper TuI1I.2 | Add to My Program |
| Manydepth2: Motion-Aware Self-Supervised Monocular Depth Estimation in Dynamic Scenes |
|
| Zhou, Kaichen | University of Oxford |
| Bian, Jiawang | University of Adelaide |
| Zheng, Jian-Qing | University of Oxford |
| Zhong, Jia-Xing | University of Oxford |
| Xie, Qian | University of Leeds |
| Markham, Andrew | Oxford University |
| Trigoni, Niki | University of Oxford |
Keywords: SLAM, Visual-Inertial SLAM, Visual Learning
Abstract: Despite advancements in self-supervised monocular depth estimation, challenges persist in dynamic scenarios due to the dependence on assumptions about a static world. In this paper, we present Manydepth2, to achieve precise depth estimation for both dynamic objects and static backgrounds, all while maintaining computational efficiency. To tackle the challenges posed by dynamic content, we incorporate optical flow and coarse monocular depth to create a pseudo-static reference frame. This frame is then utilized to build a motion-aware cost volume in collaboration with the vanilla target frame. Furthermore, to improve the accuracy and robustness of the network architecture, we propose an attention-based depth network that effectively integrates information from feature maps at different resolutions by incorporating both channel and non-local attention mechanisms. Compared to methods with similar computational costs, Manydepth2 achieves a significant reduction of approximately five percent in root-mean-square error for self-supervised monocular depth estimation on the KITTI-2015 dataset. The code could found https://github.com/kaichen-z/Manydepth2.
|
| |
| 09:00-10:30, Paper TuI1I.3 | Add to My Program |
| FAST-LIEO: Fast and Real-Time LiDAR-Inertial-Event-Visual Odometry |
|
| Wang, Zirui | SUSTech |
| Ge, Yangtao | Southern University of Science and Technology |
| Dong, Kewei | Southern University of Science and Technology (SUSTech) |
| Chen, I-Ming | Nanyang Technological University |
| Wu, Jing | Nanyang Technological University |
Keywords: SLAM, Localization, Sensor Fusion
Abstract: Unlike standard camera that relies on exposure to obtain output frame by frame, event camera only output an event when the change of brightness intensity in a pixel exceed a threshold, and the outputs of different pixels are independent to each other. Benefited from its bio-inspired design, event camera has the advantages of low latency and high dynamic range. The researches on multi-sensor fusion with event camera are few so far. In this paper, we propose FAST-LIEO, a framework for fast and real-time LiDAR-inertial-event odometry. The framework tightly fuses LiDAR and event camera measurements without any feature extraction or matching. Besides, our system supports both LIEO and LIEVO (extended with RGB camera fusion). We design a novel EIO subsystem For LiDAR-event fusion. The EIO subsystem maintains a semi-dense event map and estimates the state by aligning the event representation to map. The semi-dense event map is built from LiDAR points by utilizing the edge information and temporal information provided by event representations. Besides testing our method on public benchmark dataset, we also collected real-world data by utilizing our sensor suite and conducted experiments on our self-captured dataset. The experiment results show the high robustness and accuracy of our method in challenging conditions with high real-time ability. To the best of our knowledge, our FAST-LIEO is the first system that can tightly fuse LiDAR, IMU, event camera and standard camera measurements in simultaneously localization and mapping. The source code of FAST-LIEO and our dataset are available at: https://github.com/wsjpla/FAST-LIEO.
|
| |
| 09:00-10:30, Paper TuI1I.4 | Add to My Program |
| Diversity-Aware Crowd Model for Robust Robot Navigation in Human Populated Environment |
|
| Wu, Jiaxu | The University of Tokyo |
| Wang, Yusheng | The University of Tokyo |
| Chen, Tong | The University of Tokyo |
| Jiang, Jun | The University of Tokyo |
| Wang, Yongdong | The University of Tokyo |
| An, Qi | The University of Tokyo |
| Yamashita, Atsushi | The University of Tokyo |
Keywords: Autonomous Vehicle Navigation, Human-Aware Motion Planning, Reinforcement Learning
Abstract: Robot navigation in human-populated environments poses challenges due to the diversity of human behaviors and the unpredictability of human paths. However, existing Reinforcement Learning (RL)-based methods often rely on simulators that lack sufficient diversity in human behavior, resulting in navigation policies that overfit specific human behavior and perform poorly in unseen environments. To address this, we propose a diversity-aware crowd model based on Reinforcement Learning, employing Constrained Variational Exploration (VE) with a Mutual Information (MI)-based auxiliary reward to capture fine-grained behavioral diversity.The proposed model leverages a Centralized Training Decentralized Execution (CTDE) paradigm, which ensures stable exploration under multi-agent settings. Using the proposed diversity-aware model for training, we obtain robust robot navigation policies capable of handling diverse unseen scenarios. Extensive simulation and real-world experiments demonstrate the superior performance of our approach in achieving diverse crowd behaviors and enhancing robot navigation robustness. These findings highlight the potential of our method to advance safe and efficient robot operations in complex dynamic environments.
|
| |
| 09:00-10:30, Paper TuI1I.5 | Add to My Program |
| On Motion Blur and Deblurring in Visual Place Recognition |
|
| Ismagilov, Timur | University of Southampton |
| Ferrarini, Bruno | MyWay Srl |
| Milford, Michael J | Queensland University of Technology |
| Tuyen, Nguyen Tan Viet | University of Southampton |
| Ramchurn, Sarvapali | University of Southampton |
| Ehsan, Shoaib | University of Essex |
Keywords: Localization, Vision-Based Navigation, Data Sets for Robotic Vision
Abstract: Visual Place Recognition (VPR) in mobile robotics enables robots to localize themselves by recognizing previously visited locations using visual data. While the reliability of VPR methods has been extensively studied under conditions such as changes in illumination, season, weather and viewpoint, the impact of motion blur is relatively unexplored despite its relevance not only in rapid motion scenarios but also in low-light conditions where longer exposure times are necessary. Similarly, the role of image deblurring in enhancing VPR performance under motion blur has received limited attention so far. This paper bridges these gaps by introducing a new benchmark designed to evaluate VPR performance under the influence of motion blur and image deblurring. The benchmark includes three datasets that encompass a wide range of motion blur intensities, providing a comprehensive platform for analysis. Experimental results with several well-established VPR and image deblurring methods provide new insights into the effects of motion blur and the potential improvements achieved through deblurring. Building on these findings, the paper proposes adaptive deblurring strategies for VPR, designed to effectively manage motion blur in dynamic, real-world scenarios.
|
| |
| 09:00-10:30, Paper TuI1I.6 | Add to My Program |
| Far-Field Image-Based Traversability Mapping for a Priori Unknown Natural Environments |
|
| Fahnestock, Ethan | MIT |
| Fuentes, Erick | Massachusetts Institute of Technology |
| Prentice, Samuel | Massachusetts Institute of Technology |
| Vasilopoulos, Vasileios | Samsung Research America |
| Osteen, Philip | U.S. Army Research Laboratory |
| Howard, Thomas | University of Rochester |
| Roy, Nicholas | Massachusetts Institute of Technology |
Keywords: Vision-Based Navigation, Field Robots
Abstract: While navigating unknown environments, robots rely primarily on proximate features for guidance in decision making, such as depth information from lidar or stereo to build a costmap, or local semantic information from images. The limited range over which these features can be used may result in poor robot behavior when assumptions about the cost of the map beyond the range of proximate features misguide the robot. Integrating far-field image features that originate beyond these proximate features into the mapping pipeline has the promise of enabling more intelligent and aware navigation through unknown terrain. To navigate with far-field features, key challenges must be overcome. As far-field features are typically too distant to localize precisely, they are difficult to place in a map. Additionally, the large distance between the robot and these features makes connecting these features to their navigation implications more challenging. We propose FITAM, an approach that learns to use far-field features to predict costs to guide navigation through unknown environments from previous experience in a self-supervised manner. Unlike previous work, our approach does not rely on flat ground plane assumptions or range sensors to localize observations. We demonstrate the benefits of our approach through simulated trials and real-world deployment on a Clearpath Robotics Warthog navigating through a forest environment.
|
| |
| 09:00-10:30, Paper TuI1I.7 | Add to My Program |
| Iterative Shaping of Multi-Particle Aggregates Based on Action Trees and VLM |
|
| Lee, Hoi-Yin | The Hong Kong Polytechnic University |
| Zhou, Peng | Great Bay University |
| Duan, Anqing | Mohamed Bin Zayed University of Artificial Intelligence |
| Yang, Chenguang | The Hong Kong Polytechnic University |
| Navarro-Alarcon, David | The Hong Kong Polytechnic University |
Keywords: Bimanual Manipulation, Manipulation Planning
Abstract: In this paper, we address the problem of manipulating multi-particle aggregates using a bimanual robotic system. Our approach enables the autonomous transport of dispersed particles through a series of shaping and pushing actions using robotically-controlled tools. Achieving this advanced manipulation capability presents two key challenges: high-level task planning and trajectory execution. For task planning, we leverage Vision Language Models (VLMs) to enable primitive actions such as tool affordance grasping and non-prehensile particle pushing. For trajectory execution, we represent the evolving particle aggregate's contour using truncated Fourier series, providing efficient parametrization of its closed shape. We adaptively compute trajectory waypoints based on group cohesion and the geometric centroid of the aggregate, accounting for its spatial distribution and collective motion. Through real-world experiments, we demonstrate the effectiveness of our methodology in actively shaping and manipulating multi-particle aggregates while maintaining high system cohesion.
|
| |
| 09:00-10:30, Paper TuI1I.8 | Add to My Program |
| What Matters in Learning a Zero-Shot Sim-To-Real RL Policy for Quadrotor Control? a Comprehensive Study |
|
| Chen, Jiayu | Tsinghua University |
| Yu, Chao | Tsinghua University |
| Xie, Yuqing | Tsinghua University |
| Gao, Feng | Tsinghua University |
| Chen, Yinuo | Tsinghua University |
| Yu, Shu'ang | Tsinghua University |
| Tang, Wenhao | Tsinghua University |
| Ji, Shilong | Tsinghua University |
| Mu, Mo | Tsinghua University |
| Wu, Yi | Tsinghua University |
| Yang, Huazhong | Tsinghua University |
| Wang, Yu | Tsinghua University |
Keywords: Reinforcement Learning, Machine Learning for Robot Control, Aerial Systems: Applications
Abstract: Precise and agile flight maneuvers are essential for quadrotor applications, yet traditional control methods are limited by their reliance on flat trajectories or computationally intensive optimization. Reinforcement learning (RL)-based policies offer a promising alternative by directly mapping observations to actions, reducing dependency on system knowledge and actuation constraints. However, the sim-to-real gap remains a significant challenge, often causing instability in real-world deployments. In this work, we identify five key factors for learning robust RL-based control policies capable of zero-shot real-world deployment: (1) integrating velocity and rotation matrix into actor inputs, (2) incorporating time vector into critic inputs, (3) regularizing action differences for smoothness, (4) applying system identification with selective randomization, and (5) using large batch sizes during training. Based on these insights, we develop textit{SimpleFlight}, a PPO-based framework that integrates these techniques. Extensive experiments on the Crazyflie quadrotor demonstrate that SimpleFlight reduces trajectory tracking error by over 50% compared to state-of-the-art RL baselines. It excels in both smooth polynomial and challenging infeasible zigzag trajectories, particularly on small thrust-to-weight quadrotors, where baseline methods often fail. To enhance reproducibility and further research, we integrate SimpleFlight into the GPU-based Omnidrones simulator and provide open-source code and model checkpoints. For more details, visit our project website at url{https://sites.google.com/view/simpleflight/}.
|
| |
| 09:00-10:30, Paper TuI1I.9 | Add to My Program |
| A Hyperspectral Imaging Guided Robotic Grasping System |
|
| Sun, Zheng | The Chinese University of Hong Kong |
| Dong, Zhipeng | The Chinese University of Hong Kong |
| Wang, Shixiong | The Chinese University of Hong Kong |
| Chu, Zhongyi | Beihang University |
| Chen, Fei | T-Stone Robotics Institute, the Chinese University of Hong Kong |
Keywords: Perception for Grasping and Manipulation, Software-Hardware Integration for Robot Systems, Grasping
Abstract: Hyperspectral imaging is an advanced technique for precisely identifying and analyzing materials or objects. However, its integration with robotic grasping systems has so far been explored due to the deployment complexities and prohibitive costs. Within this paper, we introduce a novel hyperspectral imaging-guided robotic grasping system. The system consists of PRISM (Polyhedral Reflective Imaging Scanning Mechanism) and the SpectralGrasp framework. PRISM is designed to enable high-precision, distortion-free hyperspectral imaging while simplifying system integration and costs. SpectralGrasp generates robotic grasping strategies by effectively leveraging both the spatial and spectral information from hyperspectral images. The proposed system demonstrates substantial improvements in both textile recognition compared to human performance and sorting success rate compared to RGB-based methods. Additionally, a series of comparative experiments further validates the effectiveness of our system. The study highlights the potential benefits of integrating hyperspectral imaging with robotic grasping systems, showcasing enhanced recognition and grasping capabilities in complex and dynamic environments. The project is available at: https://zainzh.github.io/PRISM.
|
| |
| 09:00-10:30, Paper TuI1I.10 | Add to My Program |
| Zero-Shot Denoiser for Enhanced Acoustic Inspection: Blind Signal Separation and Text-Guided Audio Reconstruction |
|
| Shoda, Koki | The University of Tokyo |
| Louhi Kasahara, Jun Younes | The University of Tokyo |
| An, Qi | The University of Tokyo |
| Yamashita, Atsushi | The University of Tokyo |
Keywords: Robotics and Automation in Construction, Industrial Robots, Surveillance Robotic Systems
Abstract: Acoustic inspection is crucial for infrastructure maintenance, but its effectiveness is often hampered by environmental noise. Conventional denoising methods rely on prior knowledge or training data, limiting their practicability. This paper presents Zero-Shot Denoiser, a novel approach achieving noise reduction without pre-collected target sound samples or noise knowledge. Our method synergistically combines Blind Signal Separation (BSS) for unsupervised audio decomposition and Artifact-Resilient Attention (AR-Attention) for text-guided audio reconstruction. AR-Attention leverages pre-trained audio-language models and dual normalization to mitigate BSS artifacts and identify target sounds semantically. We introduce pseudo Signal-to-Noise Ratio, derived from the audio-language model, for automatic BSS hyperparameter optimization. In experiments using public datasets, our method, operating in a true zero-shot setting, achieved performance comparable to that of state-of-the-art supervised denoising methods, and experiments targeting hammering tests confirmed the effectiveness of our approach for real-world acoustic inspections. Our approach overcomes the limitations of data-dependent techniques and offers a versatile noise reduction solution for acoustic inspection and broader acoustic tasks.
|
| |
| 09:00-10:30, Paper TuI1I.11 | Add to My Program |
| ExFMan: Rendering 3D Dynamic Humans with Hybrid Monocular Blurry Frames and Events |
|
| Chen, Kanghao | Hong Kong University of Science and Technology (Guangzhou) |
| Wang, Zeyu | The Hong Kong University of Science and Technology (Guangzhou) |
| Wang, Lin | Nanyang Technological University (NTU) |
Keywords: Deep Learning for Visual Perception, Visual Learning, Sensor Fusion
Abstract: Recent advances in neural rendering have enabled the 3D reconstruction of dynamic humans from monocular videos, with applications in robotics. However, it is still challenging to reconstruct clear humans from in-the-wild video encountering motion blur, causing shape and appearance inconsistencies, especially in blurry regions like hands and legs. In this paper, we propose ExFMan, the first neural rendering framework that unveils the possibility of rendering high-quality humans in rapid motion with a hybrid frame-based RGB and bio-inspired event camera. The ``out-of-the-box'' insight is to leverage the high temporal information of event data in a complementary manner and adaptively reweight the effect of losses for both RGB frames and events in the local regions, according to the velocity of the rendered human. This significantly mitigates the inconsistency associated with motion blur in the RGB frames. Specifically, we first formulate a velocity field of the 3D body in the canonical space and render it to image space to identify the body parts with motion blur. We then propose two novel losses, i.e., velocity-aware photometric loss and velocity-relative event loss, to optimize the neural human for both modalities under the guidance of the estimated velocity. In addition, we incorporate novel pose regularization and alpha losses to facilitate continuous pose and clear boundary. Extensive experiments on synthetic and real-world datasets demonstrate that ExFMan can reconstruct sharper and higher quality humans over the compared baselines and the state-of-the-art methods for diverse blurry subjects.
|
| |
| 09:00-10:30, Paper TuI1I.12 | Add to My Program |
| Distributional Treatment of Real2Sim2Real for Object-Centric Agent Adaptation in Vision-Driven DLO Manipulation |
|
| Kamaras, Georgios | The University of Edinburgh |
| Ramamoorthy, Subramanian | The University of Edinburgh |
Keywords: Probabilistic Inference, Reinforcement Learning, Perception for Grasping and Manipulation
Abstract: We present an integrated (or end-to-end) framework for the Real2Sim2Real problem of manipulating deformable linear objects (DLOs) based on visual perception. Working with a parameterised set of DLOs, we use likelihood-free inference (LFI) to compute the posterior distributions for the physical parameters using which we can approximately simulate the behaviour of each specific DLO. We use these posteriors for domain randomisation while training, in simulation, object-specific visuomotor policies (i.e. assuming only visual and proprioceptive sensory) for a DLO reaching task, using model-free reinforcement learning. We demonstrate the utility of this approach by deploying sim-trained DLO manipulation policies in the real world in a zero-shot manner, i.e. without any further fine-tuning. In this context, we evaluate the capacity of a prominent LFI method to perform fine classification over the parametric set of DLOs, using only visual and proprioceptive data obtained in a dynamic manipulation trajectory. We then study the implications of the resulting domain distributions in sim-based policy learning and real-world performance.
|
| |
| 09:00-10:30, Paper TuI1I.13 | Add to My Program |
| MAD-BA: 3D LiDAR Bundle Adjustment -- from Uncertainty Modelling to Structure Optimization |
|
| Cwian, Krzysztof | Poznan University of Technology |
| Di Giammarino, Luca | Sapienza University of Rome |
| Ferrari, Simone | Sapienza University of Rome |
| Ciarfuglia, Thomas Alessandro | Sapienza University of Rome |
| Grisetti, Giorgio | Sapienza University of Rome |
| Skrzypczynski, Piotr | Poznan University of Technology |
Keywords: Mapping, Range Sensing, SLAM
Abstract: The joint optimization of sensor poses and 3D structure is fundamental for state estimation in robotics and related fields. Current LiDAR systems often prioritize pose optimization, with structure refinement either omitted or treated separately using implicit representations. This paper introduces a framework for simultaneous optimization of sensor poses and 3D map, represented as surfels. A generalized LiDAR uncertainty model is proposed to address less reliable measurements in varying scenarios. Experimental results on public datasets demonstrate improved performance over most comparable state-of-the-art methods. The system is provided as open-source software to support further research.
|
| |
| 09:00-10:30, Paper TuI1I.14 | Add to My Program |
| Limiting Kinetic Energy through Control Barrier Functions: Analysis and Experimental Validation |
|
| Califano, Federico | University of Twente |
| Logmans, Daniël Dylan | Saxion University of Applied Sciences |
| Roozing, Wesley | University of Twente |
Keywords: Robot Safety, Safety in HRI, Compliance and Impedance Control
Abstract: In the context of safety-critical control, we propose and analyse the use of Control Barrier Functions (CBFs) to limit the kinetic energy of torque-controlled robots. The proposed scheme is able to modify a nominal control action in a minimally invasive manner to achieve the desired kinetic energy limit. We show how this safety condition is achieved by appropriately injecting damping in the underlying robot dynamics independently of the nominal controller structure. We present an extensive experimental validation of the approach on a 7-Degree of Freedom (DoF) Franka Emika Panda robot. The results demonstrate that this approach provides an effective, minimally invasive safety layer that is straightforward to implement and is robust in real experiments.
|
| |
| 09:00-10:30, Paper TuI1I.15 | Add to My Program |
| Infinite-Horizon Value Function Approximation for Model Predictive Control |
|
| Jordana, Armand | New York University |
| Kleff, Sebastien | Inria Center at the University of Bordeaux |
| Haffemayer, Arthur | CTU, CIIRC |
| Ortiz-Haro, Joaquim | TU Berlin |
| Carpentier, Justin | INRIA |
| Mansard, Nicolas | CNRS |
| Righetti, Ludovic | New York University |
Keywords: Optimization and Optimal Control, Machine Learning for Robot Control, Motion and Path Planning
Abstract: Model Predictive Control has emerged as a popular tool for robots to generate complex motions. However, the real-time requirement has limited the use of hard constraints and large preview horizons, which are necessary to ensure safety and stability. In practice, practitioners have to carefully design cost functions that can imitate an infinite horizon formulation, which is tedious and often results in local minima. In this work, we study how to approximate the infinite horizon value function of constrained optimal control problems with neural networks using value iteration and trajectory optimization. Furthermore, we experimentally demonstrate how using this value function approximation as a terminal cost provides global stability to the model predictive controller. The approach is validated on two toy problems and a real-world scenario with online obstacle avoidance on an industrial manipulator where the value function is conditioned to the goal and obstacle.
|
| |
| 09:00-10:30, Paper TuI1I.16 | Add to My Program |
| Design and Optimization of a Samara-Inspired Lightweight Monocopter for Extended Endurance |
|
| Cai, Xinyu | Singapore University of Technology and Design |
| Zhong, Shangkun | Xi'an Automatic Flight Control Research Institute |
| Tan, Tee Meng | Singapore University of Technology & Design |
| Ang, Wei Jun | Singapore University of Technology & Design |
| Foong, Shaohui | Singapore University of Technology and Design |
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications
Abstract: Small multirotors demonstrate significant potential due to their simple airframe and human-friendly operation. However, the reduced size results in substantially higher energy consumption, which severely limits their flight endurance and restricts their range of applications. Ornithopters, while offering better aerodynamic efficiency, experience energy losses due to the mechanical complexity required to generate reciprocating motion. In this work, inspired by the samara, we present a lightweight aircraft with an exceptionally simple design featuring a single actuator and a mono airfoil. To optimize the flight configuration for minimal power consumption, we employed a Surrogate optimization method that integrates spinning airfoil dynamics, motor-propeller efficiency, and hovering equilibrium. As a result, the proposed vehicle achieves position-controlled hovering flight for up to 26 minutes with a takeoff weight of only 32 grams. Its superior power efficiency is demonstrated by a high power loading of 9.1 grams per watt. Compared to state-of-the-art systems, the proposed design shows significant improvements in both flight endurance and power efficiency. The reliable and stable position-holding flight over an extended period further validates the effectiveness of the proposed methods and the practical applicability of the fabricated prototype.
|
| |
| 09:00-10:30, Paper TuI1I.17 | Add to My Program |
| Continuous-Time Line-Of-Sight Constrained Trajectory Planning for 6-Degree of Freedom Systems |
|
| Hayner, Christopher | University of Washington |
| Carson, John | NASA Johnson Space Center |
| Acikmese, Behcet | University of Washington |
| Leung, Karen | University of Washington |
Keywords: Constrained Motion Planning, Optimization and Optimal Control, Aerial Systems: Perception and Autonomy
Abstract: Perception algorithms are ubiquitous in modern autonomy stacks, providing necessary environmental information to operate in the real world. Many of these algorithms depend on the visibility of keypoints, which must remain within the robot’s line-of-sight (LoS), for reliable operation. This paper tackles the challenge of maintaining LoS on such keypoints during robot movement. We propose a novel method that addresses these issues by ensuring applicability to various sensor footprints, adaptability to arbitrary nonlinear system dynamics, and constant enforcement of LoS throughout the robot's path. Our experiments show that the proposed approach achieves significantly reduced LoS violation and runtime compared to existing state-of-the-art methods in several representative and challenging scenarios.
|
| |
| 09:00-10:30, Paper TuI1I.18 | Add to My Program |
| Tightly-Coupled LiDAR-IMU-Leg Odometry with Online Learned Leg Kinematics Incorporating Foot Tactile Information |
|
| Okawara, Taku | Information Technology and Human Factors |
| Koide, Kenji | National Institute of Advanced Industrial Science and Technology |
| Takanose, Aoki | National Institute of Advanced Industrial Science and Technology |
| Oishi, Shuji | National Institute of Advanced Industrial Science and Technology (AIST) |
| Yokozuka, Masashi | Nat. Inst. of Advanced Industrial Science and Technology |
| Uno, Kentaro | Tohoku University |
| Yoshida, Kazuya | Tohoku University |
|
|
| |
| 09:00-10:30, Paper TuI1I.19 | Add to My Program |
| SAGA-SLAM: Scale-Adaptive 3D Gaussian Splatting for Visual SLAM |
|
| Park, Kun | Seoul National University |
| Seo, Seung-Woo | Seoul National University |
Keywords: SLAM, RGB-D Perception
Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a powerful technique for representing 3D scenes. Its superior high-fidelity rendering quality and speed have driven its rapid adoption in many applications. Among them, Visual Simultaneous Localization and Mapping (VSLAM) is the most prominent application, as it requires real-time simultaneous mapping and position tracking of navigating objects. However, from our comprehensive study, we observed a fundamental hur- dle in directly applying the current 3DGS technique to VSLAM, which we define as the scale adaptation problem. The scale adaptation problem refers to the inability of existing 3DGS- based SLAM methods to address varying scales, specifically the extent of camera pose difference from the perspective of tracking, and environmental size in terms of mapping and the addition of new 3D Gaussians. To overcome this limitation, we propose SAGA-SLAM, the first scale-adaptive RGB-D Dense SLAM framework based on 3DGS. We optimize the tracking and mapping stages robustly over various scales by utilizing the Polyak step size and momentum. Additionally, we present gaussian fission method to address the scale problem during the addition of 3D Gaussians. Experiments show that our method achieves state-of-the-art results robustly on both large and small scales, such as KITTI, Replica, and TUM-RGBD. By adapting without the need for hyperparameter tuning, our method demonstrates both superior performance and practical applicability.
|
| |
| 09:00-10:30, Paper TuI1I.20 | Add to My Program |
| An Enhanced Soft Growing Robot with Mixed-Layer Jamming for Superior Load Capacity and Improved Mobility |
|
| Li, Zheyu | Harbin Institute of Technology |
| Sun, Kui | Harbin Institute of Technology |
| Li, XueAi | Harbin Institute of Technology |
| Zou, Yanjiang | Harbin Institute of Technology |
| Liu, Hong | Harbin Institute of Technology |
Keywords: Soft Robot Materials and Design, Biologically-Inspired Robots, Mechanism Design
Abstract: Soft robots have gained widespread attention due to their lightweight nature and inherent safety. Among them, soft growing robots (SGRs) are inspired by the growth mechanism of vines, achieving movement through tip eversion. However, their load-bearing capacity remains a significant challenge due to material limitations. The stiffness modulation approach based on layer jamming is constrained in high-curvature tip regions, preventing it from fully exhibiting its potential in unstructured environments. In this letter, motivated by enhancing the load-bearing capacity of SGR and optimizing their tip motion performance, we propose a novel mixed-layer soft growing robot (MLSGR) and introduce an innovative modification to the conventional layer jamming fabrication method. Furthermore, we establish a more accurate kinematics model and, for the first time, propose a statics model to characterize tip behavior. Experimental results demonstrate that, compared to previous work, MLSGR exhibits more than twice in load capacity, a 9% reduction in energy consumption and mechanical resistance for tip growth, a 17% improvement in tip retraction capability, and a 41.2% enhancement in kinematic model prediction accuracy (MAPE).
|
| |
| 09:00-10:30, Paper TuI1I.21 | Add to My Program |
| CLEVER: Stream-Based Active Learning for Robust Semantic Perception from Human Instructions |
|
| Lee, Jongseok | German Aerospace Center (DLR) |
| Birr, Timo | Karlsruhe Institute of Technology (KIT) |
| Triebel, Rudolph | German Aerospace Center (DLR) |
| Asfour, Tamim | Karlsruhe Institute of Technology (KIT) |
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception, Probability and Statistical Methods
Abstract: We propose CLEVER, an active learning system for robust semantic perception with Deep Neural Networks (DNNs). For data arriving in streams, our system seeks human support when encountering failures and adapts DNNs online based on human instructions. In this way, CLEVER can eventually accomplish the given semantic perception tasks. Our main contribution is the design of a system that meets several desiderata of realizing the aforementioned capabilities. The key enabler herein is our Bayesian formulation that encodes domain knowledge through priors. Empirically, we not only motivate CLEVER's design but further demonstrate its capabilities with a user validation study as well as experiments on humanoid and deformable objects. To our knowledge, we are the first to realize stream-based active learning on a real robot, providing evidence that the robustness of the DNN-based semantic perception can be improved in practice. The project website can be accessed at https://sites.google.com/view/thecleversystem.
|
| |
| 09:00-10:30, Paper TuI1I.22 | Add to My Program |
| STLCG++: A Masking Approach for Differentiable Signal Temporal Logic Specification |
|
| Kapoor, Parv | Carnegie Mellon University |
| Mizuta, Kazuki | University of Washington |
| Kang, Eunsuk | Carnegie Mellon University |
| Leung, Karen | University of Washington |
Keywords: Formal Methods in Robotics and Automation, Deep Learning Methods, Software Tools for Robot Programming
Abstract: Signal Temporal Logic (STL) offers a concise yet expressive framework for specifying and reasoning about spatio-temporal behaviors of robotic systems. Attractively, STL admits the notion of robustness, the degree to which an input signal satisfies or violates an STL specification, thus providing a nuanced evaluation of system performance. Notably, the differentiability of STL robustness enables direct integration to robotics workflows that rely on gradient-based optimization, such as trajectory optimization and deep learning. However, existing approaches to evaluating and differentiating STL robustness rely on recurrent computations, which become inefficient with longer sequences, limiting their use in time- sensitive applications. In this paper, we present STLCG++, a masking-based approach that parallelizes STL robustness evaluation and backpropagation across timesteps, achieving significant speed-ups compared to a recurrent approach. We also introduce a smoothing technique to enable differentiation of time interval bounds, expanding STL’s applicability in gradient-based optimization tasks over spatial and temporal variables. Finally, we demonstrate STLCG++’s benefits through three robotics use cases and provide JAX and PyTorch libraries for seamless integration into modern robotics workflows.
|
| |
| 09:00-10:30, Paper TuI1I.23 | Add to My Program |
| Deep Reinforcement Learning-Based Motion Planning and PDE Control for Flexible Manipulators |
|
| Barjini, Amir Hossein | Tampere University |
| Alizadeh Kolagar, Seyed Adel | Tampere University |
| Yaqubi, Sadeq | Tampere University |
| Mattila, Jouni | Tampere University |
Keywords: Flexible Robotics, Motion Control, Motion and Path Planning
Abstract: This article presents a motion planning and control framework for flexible robotic manipulators, integrating deep reinforcement learning (DRL) with a nonlinear partial differential equation (PDE) controller. Unlike conventional approaches that focus solely on control, we demonstrate that the desired trajectory significantly influences endpoint vibrations. To address this, a DRL motion planner, trained using the soft actor-critic (SAC) algorithm, generates optimized trajectories that inherently minimize vibrations. The PDE nonlinear controller then computes the required torques to track the planned trajectory while ensuring closed-loop stability using Lyapunov analysis. The proposed methodology is validated through both simulations and real-world experiments, demonstrating superior vibration suppression and tracking accuracy compared to traditional methods. The results underscore the potential of combining learning-based motion planning with model-based control for enhancing the precision and stability of flexible robotic manipulators.
|
| |
| 09:00-10:30, Paper TuI1I.24 | Add to My Program |
| Learning Fast, Tool-Aware Collision Avoidance for Collaborative Robots |
|
| Lee, Joonho | Neuromeka |
| Kim, Yunho | Neuromeka |
| Kim, Seok Joon | Georgia Institute of Technology |
| Nguyen, Van Quan | Neuromeka |
| Heo, Young Jin | POSTECH |
Keywords: Collision Avoidance, Reinforcement Learning, Engineering for Robotic Systems
Abstract: Ensuring safe and efficient operation of collaborative robots in human environments is challenging, especially in dynamic settings where both obstacle motion and tasks change over time. Current robot controllers typically assume full visibility and fixed tools, which can lead to collisions or overly conservative behavior. In our work, we introduce a tool aware collision avoidance system that adjusts in real time to different tool sizes and modes of tool-environment interaction. Using a learned perception model, our system filters out robot and tool components from the point cloud, reasons about occluded area, and predicts collision under partial observability. We then use a control policy trained via constrained reinforcement learning to produce smooth avoidance maneuvers in under 10 milliseconds. In simulated and real world tests, our approach outperforms traditional approaches(APF, MPPI) in dynamic environments, while maintaining sub-millimeter accuracy. Moreover, our system operates with approximately 60 % lower computational overhead compared to a state-of-the-art GPU-based planner. Our approach provides modular, efficient, and effective collision avoidance for robots operating in dynamic environments. We integrate our method into a collaborative robot application and demonstrate its practical use for safe and responsive operation.
|
| |
| 09:00-10:30, Paper TuI1I.25 | Add to My Program |
| IMH-MOT: Interactive Multi-Hierarchical Image and Point Cloud Fusion for Multi-Object Tracking |
|
| Qin, Wenyuan | Beihang University |
| Zhou, Zhiyan | Beihang University |
| Luo, Jiong | Beihang University |
| Pan, Chengwei | Beihang University |
| Xu, Hao | Beihang Unverisity |
| Dong, Xiwang | Beihang University |
| Wang, Danwei | Nanyang Technological University |
Keywords: Visual Tracking, Sensor Fusion, Computer Vision for Automation
Abstract: Multi-object tracking (MOT) plays a critical role in applications such as autonomous driving and surveillance. Camera-based approaches offer rich texture features for object association, while LiDAR-based methods provide accurate geometric information for spatial reasoning. Although each modality addresses different challenges, their intrinsic discrepancies hinder effective cross-modal fusion and unified representation learning. To overcome these limitations, we propose IMH-MOT, an interactive multi-hierarchical MOT framework comprising three key modules. The Multi-modality~Alignment~Module~(MMAM) enhances spatial representations by sampling and clustering instance-level point clouds. From different modalities are motion cues integrated by the Multi-modality~Motion~Estimation~Module~(MMEM) to build a unified motion model. To mitigate the impact of occlusion on single-frame appearance features, the Long-term~Appearance~Module~(LAM) captures temporal appearance consistency by constructing a long-term appearance embedding. Guided by modality-aware cues from MMAM, MMEM generates reliable spatial representations, while LAM encodes robust long-term appearance features. These components are jointly integrated through a Multi-hierarchical~Data~Association~(MHDA) strategy, enabling stable and accurate tracking. Extensive experiments on the KITTI MOT benchmark demonstrate the effectiveness of our framework, achieving 80.90% HOTA, 89.73% MOTA, and 470 IDSW, outperforming state-of-the-art methods in both standard and challenging scenarios.
|
| |
| 09:00-10:30, Paper TuI1I.26 | Add to My Program |
| A Robust LiDAR-Inertial Multi Constraint-Based Localization for Agricultural Environments |
|
| Longani, Narayan | Chungbuk National University |
| Kim, Gon-Woo | Chungbuk National University |
Keywords: SLAM, Field Robots, Robotics and Automation in Agriculture and Forestry
Abstract: Accurate state estimation is essential for an autonomous agricultural robot’s reliable operations. The effectiveness of state estimation is influenced by a number of factors, such as sensor-fusion algorithms, the environment, and sensor quality. When the robot traverses in large-scale scenarios, the distance travelled and high-speed mobility produce a drift in the estimation process and should be carefully considered. Moreover, the time-varying noise in sensors affects odometry accuracy further; this is especially noticeable in long travel. This research work is related to the multi-constraints-based state estimation in large unstructured environments with uneven terrain, with a focus on agricultural applications. Using LiDAR-IMU based fusion, our goal is to provide a reliable & accurate localization solution in complex environments like agricultural fields. Furthermore, the agricultural environments become more challenging due to the uneven terrain and lack of features. Our research proposes a hybrid framework which combines factor graph-based optimization & adaptive Kalman filtering to address these challenges in complex environments. Furthermore, performance evaluation is conducted on self-collected datasets from agricultural environments as well as on open-access datasets such as GRACO & KITTI.
|
| |
| 09:00-10:30, Paper TuI1I.27 | Add to My Program |
| Baseline Policy Adapting and Abstraction of Shared Autonomy for High-Level Robot Operations |
|
| Yousefi, Ehsan | McGill University |
| Chen, Mo | Simon Fraser University |
| Sharf, Inna | McGill University |
Keywords: Shared Autonomy, Cognitive Human-Robot Interaction, AI-Based Methods, Motion and Path Planning
Abstract: This paper presents a novel shared autonomy and baseline policy adapting framework for human-robot interactions in high-level context-aware robotic tasks. With a unique methodology that leverages hierarchies in decision-making as well as variational analysis of human policy, we propose a mathematical model of shared autonomy policy. The framework aims at interpretable high-level decision-making for efficient robot operation with human in the loop. We modeled the decision-making process using hierarchical Markov decision processes (MDPs) in an algorithm we called policy adapting, where the autonomous system policy is adapted, and hence, shaped by incorporating design variables contextual to the robot, human, task, and pre-training. By integrating deep reinforcement learning within a multi-agent hierarchical context, we present an end-to-end algorithm to train a baseline policy designed for shared autonomy. We showcase the effectiveness of our framework, and particularly the interplay between different design elements and human's skill level, in a pilot study with a human user in a simulated sequence of high-level pick-and-place tasks. The proposed framework advances the state-of-the-art in shared autonomy for robotic tasks, but can also be applied to other domains of autonomous operation.
|
| |
| 09:00-10:30, Paper TuI1I.28 | Add to My Program |
| Learning Maximal Safe Sets Using Hypernetworks for MPC-Based Local Trajectory Planning in Unknown Environments |
|
| Derajic, Bojan | Technical University of Berlin |
| Bouzidi, Mohamed-Khalil | Continental, FU Berlin |
| Bernhard, Sebastian | Aumovio |
| Hoenig, Wolfgang | TU Berlin |
Keywords: Collision Avoidance, Machine Learning for Robot Control, Robot Safety
Abstract: This paper presents a novel learning-based approach for online estimation of maximal safe sets for local trajectory planning in unknown static environments. The neural representation of a set is used as the terminal set constraint for a model predictive control (MPC) local planner, resulting in improved recursive feasibility and safety. To achieve real-time performance and desired generalization properties, we employ the idea of hypernetworks. We use the Hamilton-Jacobi (HJ) reachability analysis as the source of supervision during the training process, allowing us to consider general nonlinear dynamics and arbitrary constraints. The proposed method is extensively evaluated against relevant baselines in simulations for different environments and robot dynamics. The results show a success rate increase of up to 52 % compared to the best baseline while maintaining comparable execution speed. Additionally, we deploy our proposed method, NTC-MPC, on a physical robot and demonstrate its ability to safely avoid obstacles in scenarios where the baselines fail.
|
| |
| 09:00-10:30, Paper TuI1I.29 | Add to My Program |
| EP-Diffuser: An Efficient Diffusion Model for Traffic Scene Generation and Prediction Via Polynomial Representations |
|
| Yao, Yue | Freie Universitaet Berlin, Aumovio SE |
| Bouzidi, Mohamed-Khalil | Aumovio SE, Freie Universitaet Berlin |
| Goehring, Daniel | Freie Universitaet Berlin |
| Reichardt, Joerg | Aumovio SE |
Keywords: Autonomous Agents, Deep Learning Methods, Performance Evaluation and Benchmarking
Abstract: As the prediction horizon increases, predicting the future evolution of traffic scenes becomes increasingly difficult due to the multi-modal nature of agent motion. Most state-of-the-art (SotA) prediction models primarily focus on forecasting the most likely future. However, for the safe operation of autonomous vehicles, it is equally important to cover the distribution for plausible motion alternatives. To address this, we introduce EP-Diffuser, a novel parameter-efficient diffusion-based generative model designed to capture the distribution of possible traffic scene evolutions. Conditioned on road layout and agent history, our model acts as a predictor and generates diverse, plausible scene continuations. We benchmark EP-Diffuser against two SotA models in terms of plausibility, diversity, and accuracy of predictions on the Argoverse 2 dataset. Despite its significantly smaller model size, our approach achieves both highly plausible and diverse traffic scene predictions with comparable accuracy. We further evaluate model generalization in an out-of-distribution (OoD) test setting using Waymo Open dataset and show superior robustness of our approach. The code and model checkpoints will be made publicly available to ensure reproducibility.
|
| |
| 09:00-10:30, Paper TuI1I.30 | Add to My Program |
| Learning Behaviours for Decentralised Multi-Robot Collision Avoidance in Constrained Pathways Using Curriculum Reinforcement Learning |
|
| Komol, Md Mostafizur Rahman | Queensland University of Technology |
| Tidd, Brendan | CSIRO |
| Browne, Will | Queensland University of Technology |
| Maire, Frederic | Queensland University of Technology |
| Williams, Jason | CSIRO |
| Howard, David | CSIRO |
Keywords: Field Robots, Search and Rescue Robots, Reinforcement Learning
Abstract: Mobile robot teams often require decentralised autonomous navigation through narrow gaps in limited commu- nication environments (e.g., underground search-and-rescue op- erations). Existing navigation approaches exhibit suboptimal per- formance for avoiding multi-robot collisions in such bottlenecks due to an inability to address the dynamic nature of the robots. Initial work utilising reinforcement learning has demonstrated success in navigating a single robot through narrow gaps. However, when training agents to produce give-way behaviour for navigat- ing through constrained gaps, end-to-end reinforcement learning using simple rewards suffers from slow convergence due to the increased search space of viable policies. This paper introduces a novel curriculum reinforcement learning framework, incorpo- rating a multi-robot bootstrap curriculum with preprogrammed behaviour to guide initial policy formation, subsequently refined by a gap curriculum that progressively reduces training complexity towards an optimal policy. This framework learns multi-robot in- teraction behaviours, which are impractical to program manually. Our model achieves a 99% success-rate in give-way behaviour generation without inter-agent communications in high-fidelity simulations. The success-rate reduced to 73% in simulations incor- porating noisy sensors, and 60% in field-robot tests, substantiating our model’s practical viability despite sensor noise and real-world uncertainties. The simple benchmark methods lack efficiency in basic interaction behaviours.
|
| |
| 09:00-10:30, Paper TuI1I.31 | Add to My Program |
| State Estimation and Environment Recognition for Articulated Structures Via Proximity Sensors Distributed Over the Whole Body |
|
| Iwao, Kengo | Kyushu University |
| Arita, Hikaru | Kyushu University |
| Tahara, Kenji | Kyushu University |
Keywords: SLAM, Sensor Fusion, Modeling, Control, and Learning for Soft Robots
Abstract: For robots with low rigidity, determining the robot's state based solely on kinematics is challenging. This is particularly crucial for a robot whose entire body is in contact with the environment, as accurate state estimation is essential for environmental interaction. We propose a method for simultaneous articulated robot posture estimation and environmental mapping by integrating data from proximity sensors distributed over the whole body. Our method extends the discrete-time model, typically used for state estimation, to the spatial direction of the articulated structure. The simulations demonstrate that this approach significantly reduces estimation errors.
|
| |
| 09:00-10:30, Paper TuI1I.32 | Add to My Program |
| PAPL-SLAM: Principal Axis-Anchored Monocular Point-Line SLAM |
|
| Li, Guanghao | Fudan University |
| Cao, Yu | Fudan University |
| Chen, Qi | Fudan University |
| Gao, Xin | Fudan University |
| Yang, Yifan | Fudan University |
| Pu, Jian | Fudan University |
Keywords: Localization, SLAM, Vision-Based Navigation
Abstract: In point-line Simultaneous Localization and Mapping (SLAM) systems, the utilization of line structural information and the optimization of lines are two significant problems. The former is usually addressed through structural regularities, while the latter typically involves using minimal parameter representations of lines in optimization. However, separating these two steps leads to the loss of constraint information to each other. To solve both problems, we anchor lines with similar directions to one principal axis. Precisely, our method models the line-axis probabilistic data association using the Expectation Maximization (EM) algorithm and provides the pipelines for axis creation, updating, and optimization, enhancing the system's robustness and avoiding mismatch. Our system can optimize n co-directional lines with only n+2 parameters, significantly reducing the number of line parameters to be optimized and enabling rapid mapping and tracking. Additionally, considering that most real-world scenes conform to the Atlanta World (AW) hypothesis, we provide an AW constraint by detecting structural lines based on vertical priors and vanishing points. Experimental results and ablation studies on various indoor and outdoor datasets demonstrate the effectiveness of our system.
|
| |
| 09:00-10:30, Paper TuI1I.33 | Add to My Program |
| Markerless Hand-Eye Calibration by Flange Ellipse Detection |
|
| Jia, Ruoyu | The University of Tokyo |
| Fan, Ruomeng | Imperial College London |
| Guo, Qitong | The University of Tokyo |
| Shi, Xiaohang | The University of Tokyo |
| Hirano, Masahiro | The University of Tokyo |
| Yamakawa, Yuji | The University of Tokyo |
Keywords: Calibration and Identification, Deep Learning Methods, Recognition
Abstract: This paper proposes a simple yet effective markerless hand-eye calibration method that achieves low cost, high accuracy, and strong generalization across different types of robots. The method utilizes a circular flange, a standardized structure in industrial robots, for calibration via the perspective-n-point (PnP) algorithm, achieving superior performance with a simpler pipeline. The entire system is built using mature, off-the-shelf components, avoiding complex architectures. By combining a lightweight object detection network (e.g., Faster R-CNN) with classical geometric techniques, we construct a flange detector that is both accurate and robust. The training process requires no manual annotations, and the resulting model generalizes well across various robot platforms. Experiments demonstrate that our method achieves higher calibration accuracy than more complex existing approaches. Notably, the method maintains consistent precision even when applied to previously unseen robots. Code and pre-trained models will be made available.
|
| |
| 09:00-10:30, Paper TuI1I.34 | Add to My Program |
| Rapid Adaptation of Particle Dynamics for Generalized Deformable Object Mobile Manipulation |
|
| Wu, Bohan | Stanford University |
| Martín-Martín, Roberto | University of Texas at Austin |
| Fei-Fei, Li | Stanford University |
Keywords: Reinforcement Learning, Machine Learning for Robot Control, Mobile Manipulation
Abstract: We address the challenge of learning to manipulate deformable objects with unknown dynamics. In non-rigid objects, the dynamics parameters define how they react to interactions --how they stretch, bend, compress, and move-- and they are critical to determining the optimal actions to perform a manipulation task successfully. In other robotic domains, such as legged locomotion and in-hand rigid object manipulation, state-of-the-art approaches can handle unknown dynamics using Rapid Motor Adaptation (RMA). Through a supervised procedure in simulation that encodes each rigid object's dynamics, such as mass and position, these approaches learn a policy that conditions actions on a vector of latent dynamic parameters inferred from sequences of state-actions. However, in deformable object manipulation, the object's dynamics not only includes its mass and position, but also how the shape of the object changes. Our key insight is that the recent ground-truth particle positions of a deformable object in simulation capture changes in the object's shape, making it possible to extend RMA to deformable object manipulation. This key insight allows us to develop RAPiD, a two-phase method that learns to perform real-robot deformable object mobile manipulation by: 1) learning a visuomotor policy conditioned on the object's dynamics embedding, which is encoded from the object's privileged information in simulation, such as its mass and ground-truth particle positions, and 2) learning to infer this embedding using non-privileged information instead, such as robot visual observations and actions, so that the learned policy can transfer to the real world. On a mobile manipulator with 22 degrees of freedom, RAPiD enables over 80%+ success rates across two vision-based deformable object mobile manipulation tasks in the real world, under unseen object dynamics, categories, and instances.
|
| |
| 09:00-10:30, Paper TuI1I.35 | Add to My Program |
| A New Approach to Real-Time Odometry Calibration Using an Adaptive Particle Filter Design |
|
| Fariña, Bibiana | Universidad De La Laguna |
| Toledo, Jonay | Universidad De La Laguna |
| Acosta Sánchez, Leopoldo | Universidad De La Laguna |
Keywords: Sensor Fusion, Wheeled Robots, Probability and Statistical Methods
Abstract: This paper presents a novel calibration system for odometric sensors using an Adaptive Particle Filter (Adaptive-PF) to achieve high precision pose estimation and improve localization in wheeled mobile robots. The system compensates for intrinsic systematic errors in the odometric sensor by adjusting its parameters in realtime. Likewise, a comparative analysis of resampling methods —multinomial, stratified, systematic, and residual resampling— is conducted to evaluate their impact on calibration performance. The system validation is demonstrated by its implementation in an autonomous wheelchair, where the localization module integrates wheel encoders, an Inertial Measurement Unit (IMU), and a LIDAR sensor, providing robust navigation in dynamic environments. Experimental results demonstrate that systematic approach and resampling based on the effective number of particles, N(eff), yield the best performance. Additionally, the system dynamically adjusts prediction error based on the differences between LIDAR and odometry data. It also adapts the number of particles according to the dispersion and uncertainty, optimizing computational time without sacrificing accuracy. The proposed system outperforms another well-known method, namely the DKF (Dual Kalman Filter). Consequently, this research introduces a new Adaptive-PF for odometric parameter calibration under changing conditions.
|
| |
| 09:00-10:30, Paper TuI1I.36 | Add to My Program |
| The More the Better? Confidence-Driven Residual Weighting and Depth Fusion for Multi-RGB-D Inertial Odometry |
|
| Yun, Seungsang | Seoul National University, SNU |
| Shin, Jaeho | University of Michigan |
| Cha, Jaekwang | Hyundai Motor Company |
| Kim, Ayoung | Seoul National University |
Keywords: Visual-Inertial SLAM, SLAM, Vision-Based Navigation
Abstract: Multi-camera systems hold considerable promise for enhancing visual odometry by expanding the field of view, yet simply adding more cameras does not guarantee higher accuracy. Because increasing the number of cameras also raises the likelihood of degraded or misaligned views, appropriate handling is essential to prevent severe outliers and corrupted global pose estimates. Previous methods discard points in back-end optimization based on residuals, which has been a bottleneck for real-time performance since erroneous measurements are inevitably incorporated into the main pipeline before removal. In response, we propose a direct Multi-RGB-D Inertial Odometry framework driven by confidence-based weighting, which adaptively down-weights unreliable cameras based on photometric quality and viewpoint alignment. To manage the heavy data load typical of multi-camera setups, we also incorporate a motion-guided selection strategy, filtering out non-informative points before costly alignment. This early pruning reduces computation yet retains critical constraints for odometry. By combining these techniques, our system achieves robust, scale-consistent pose estimation in real time, even with four cameras, as validated through challenging indoor-outdoor experiments involving saturation, occlusions, low-light conditions, and severe glare. We publicly release our multi-RGB-D-inertial dataset at https://github.com/seungsang07/multi-rgbd-inertial-dataset.
|
| |
| 09:00-10:30, Paper TuI1I.37 | Add to My Program |
| Distributed NMPC for Cooperative Aerial Manipulation of Cable-Suspended Loads |
|
| De Carli, Nicola | KTH |
| Belletti, Riccardo | Univ Rennes, CNRS, Inria, IRISA |
| Buzzurro, Emanuele | Univ Rennes, CNRS, Inria, IRISA |
| Testa, Andrea | University of Bologna |
| Notarstefano, Giuseppe | University of Bologna |
| Tognon, Marco | Inria Rennes |
Keywords: Aerial Systems: Mechanics and Control, Multi-Robot Systems, Optimization and Optimal Control
Abstract: In this paper, we address the problem of cooperative manipulation of a cable-suspended load by a team of aerial robots. Unlike classical approaches that rely on centralized controllers, we propose a Distributed Nonlinear Model Predictive Control (DNMPC) framework in which the UAVs communicate over a peer-to-peer network a reduced amount of variables. In the proposed method, each robot handles only a small subset of the global optimization problem. The optimal motion computed by the distributed DNMPC loop is then used as a reference for local nonlinear controllers that track the trajectory and compute the robot's actuation inputs. We validate the proposed scheme both through numerical simulations and real-world experiments on the Fly-Crane system: a rigid platform connected to three robots by pairs of cables.
|
| |
| 09:00-10:30, Paper TuI1I.38 | Add to My Program |
| A Miniaturized Tendon-Driven Continuum Robot for Direct Laser Deposition |
|
| Raimondi, Luca | University of Nottingham |
| Russo, Matteo | University of Rome Tor Vergata |
| Dong, Xin | University of Nottingham |
| Norton, Andy | Rolls-Royce Plc |
| Axinte, Dragos | University of Nottingham |
Keywords: Mechanism Design, Tendon/Wire Mechanism
Abstract: Direct laser deposition, a specialized form of additive manufacturing, shows good potential in numerous high-value applications such as the repair of aeroengine blades. However, the traditional setup for this technique is bulky and not suited for in-situ repair, requiring the costly disassembly of the aeroengine. This letter presents a miniaturized high-repeatability tendon-driven robot that showed good potential for delivering additive manufacturing equipment for in-situ techniques like direct laser deposition. The integrated actuation and ruggedized control unit make the robot portable and suitable for a variety of aeroengines. The design of the robot actuation prevents excessive bending and damage to the fiber optic. Continuum robots have the advantage of flexible and redundant structures but present limited accuracy and repeatability. The optimized kinematics and actuation of the robot presented permitted to achieve an excellent repeatability with a standard deviation of 0.02 mm on a linear path and below 0.1 mm on a path that simulates the reconstruction of a blade. The robot showed excellent linearity on each segment of the path with a coefficient of determination to the 3D best-fit line of 0.999, while maintaining the commanded velocity magnitude of the end effector with a standard deviation along the whole path of 0.05 mm/s.
|
| |
| 09:00-10:30, Paper TuI1I.39 | Add to My Program |
| One-Shot Demonstration for Slicing and Cutting Everyday Food Items |
|
| Liu, Yi | University of Southern Denmark |
| Verleysen, Andreas | Ghent University |
| Wyffels, Francis | Ghent University |
| |
| 09:00-10:30, Paper TuI1I.40 | Add to My Program |
| Active Contact Engagement for Aerial Navigation in Unknown Environments with Glass |
|
| Chen, Xinyi | Hong Kong University of Science and Technology |
| Zhang, Yichen | Hong Kong University of Science and Technology |
| Zou, Hetai | Hong Kong University of Science and Technology |
| Wang, Junzhe | Hong Kong University of Science and Technology |
| Shen, Shaojie | Hong Kong University of Science and Technology |
Keywords: Aerial Systems: Perception and Autonomy, Vision-Based Navigation, Software-Hardware Integration for Robot Systems
Abstract: Autonomous aerial robots are increasingly being deployed in real-world scenarios, where transparent glass obstacles present significant challenges to reliable navigation. Researchers have investigated the use of non-contact sensors and passive contact-resilient aerial vehicle designs to detect glass surfaces, which are often limited in terms of robustness and efficiency. In this work, we propose a novel approach for robust autonomous aerial navigation in unknown environments with transparent glass obstacles, combining the strengths of both sensor-based and contact-based glass detection. The proposed system begins with the incremental detection and information maintenance about potential glass surfaces using visual sensor measurements. The vehicle then actively engages in touch actions with the visually detected potential glass surfaces using a pair of lightweight contact-sensing modules to confirm or invalidate their presence. Following this, the volumetric map is efficiently updated with the glass surface information and safe trajectories are replanned on the fly to circumvent the glass obstacles. We validate the proposed system through real-world experiments in various scenarios, demonstrating its effectiveness in enabling efficient and robust autonomous aerial navigation in complex real-world environments with glass obstacles.
|
| |
| 09:00-10:30, Paper TuI1I.41 | Add to My Program |
| DKPMV: Dense Keypoints Fusion from Multi-View RGB Frames for 6D Pose Estimation of Textureless Objects |
|
| Chen, Jiahong | National University of Defense Technology |
| Wang, JingHao | National University of Defense Technology |
| Wang, Zi | National University of Defense Technology |
| Wang, Ziwen | National University of Defense Technology |
| Guan, Banglei | National University of Defense Technology |
| Yu, Qifeng | National University of Defense Technology |
Keywords: Deep Learning for Visual Perception, Recognition, Computer Vision for Automation
Abstract: 6D pose estimation of textureless objects is valu- able for industrial robotic applications, yet remains challenging due to the frequent loss of depth information. Current multi-view methods either rely on depth data or insufficiently exploit multi-view geometric cues, limiting their performance. In this paper, we propose DKPMV, a pipeline that achieves dense keypoint-level fusion using only multi-view RGB images as input. We design a three-stage progressive pose optimization strategy that leverages dense multi-view keypoint geometry information. To enable effective dense keypoint fusion, we enhance the keypoint network with attentional aggregation and symmetry-aware training, improving prediction accuracy and resolving ambiguities on symmetric objects. Extensive experiments on the ROBI dataset demonstrate that DKPMV outperforms state-of-the-art multi-view RGB and RGB-D approaches. The code will be available at https://github.com/chenjiahongbq/DKPMV.
|
| |
| 09:00-10:30, Paper TuI1I.42 | Add to My Program |
| Smooth Human-Robot Shared Control for Autonomous Orchard Monitoring with UGVs (I) |
|
| El Bou, Cheikh Melainine | Free University of Bolzano |
| Focchi, Michele | Università Di Trento |
| Chang, Michael | Libera Università Di Bolzano |
| Camurri, Marco | University of Trento |
| von Ellenrieder, Karl Dietrich | Libera Universita Di Bolzano |
Keywords: Robotics and Automation in Agriculture and Forestry, Collision Avoidance, Field Robots
Abstract: Precision agriculture offers the opportunity to auto- mate routine or difficult tasks in orchards and vineyards, such as spraying or inspection, with Unmanned Ground Vehicles (UGV). In this context, human operators should be kept in the closed-loop control of the robot for safety and reliability. This work is motivated by the challenges of deploying effectively human-robot shared control in the field. First, an asymptotically stable controller must keep the robot on the desired trajectory between rows of trees, whose distance is on the order of the robot’s width. Second, the robot must efficiently avoid static and moving obstacles (e.g. a rock or a human) in its path. Third, the control inputs must not exceed the actuator limits, which can degrade the trajectory tracking performance, cause instability, or damage critical hardware. Finally, in real-life scenarios, user intervention is sometimes required to manage unpredictable situations. To overcome these challenges, we propose and deploy a shared controller that smoothly blends automatic trajectory tracking, collision avoidance, and human commands. At the same time, it guarantees the system is stable and control actions are within the actuator limits at all times. We extensively tested our approach in simulation and field experiments in an apple orchard.
|
| |
| 09:00-10:30, Paper TuI1I.43 | Add to My Program |
| MSDNet: Efficient 4D Radar Super-Resolution Via Multi-Stage Distillation |
|
| Huang, Minqing | Tongji University |
| Lu, Shouyi | Tongji University |
| Zheng, Boyuan | Tongji University |
| Li, Ziyao | Tongji University |
| Tang, Xiao | Tongji University |
| Zhuo, Guirong | Tongji University |
Keywords: Computer Vision for Transportation, Sensor Fusion, Localization
Abstract: 4D radar super-resolution, which aims to reconstruct sparse and noisy point clouds into dense and geometrically consistent representations, is a foundational problem in autonomous perception. However, existing methods often suffer from high training cost or rely on complex diffusion-based sampling, resulting in high inference latency and poor generalization, making it difficult to balance accuracy and efficiency. To address these limitations, we propose MSDNet, a multi-stage distillation framework that efficiently transfers dense LiDAR priors to 4D radar features to achieve both high reconstruction quality and computational efficiency. The first stage performs reconstruction-guided feature distillation (RGFD), aligning and densifying the student’s features through feature reconstruction. In the second stage, we propose diffusion-guided feature distillation (DGFD), which treats the stage-one distilled features as a noisy version of the teacher's representations and refines them via a lightweight diffusion network. Furthermore, we introduce a noise adapter that adaptively aligns the noise level of the feature with a predefined diffusion timestep, enabling a more precise denoising. Extensive experiments on the VoD and in-house datasets demonstrate that MSDNet achieves both high-fidelity reconstruction and low-latency inference in the task of 4D radar point cloud super-resolution, and consistently improves performance on downstream tasks.
|
| |
| 09:00-10:30, Paper TuI1I.44 | Add to My Program |
| 3D Foundation Model-Based Loop Closing for Decentralized Collaborative SLAM |
|
| Lajoie, Pierre-Yves | École Polytechnique De Montréal |
| Ramtoula, Benjamin | University of Oxford |
| De Martini, Daniele | University of Oxford |
| Beltrame, Giovanni | Ecole Polytechnique De Montreal |
Keywords: Multi-Robot SLAM, SLAM, Localization
Abstract: Decentralized Collaborative Simultaneous Localization And Mapping (C-SLAM) techniques often struggle to identify map overlaps due to significant viewpoint variations among robots. Motivated by recent advancements in 3D foundation models, which can register images despite large viewpoint differences, we propose a robust loop closing approach that leverages these models to establish inter-robot measurements. In contrast to resource-intensive methods requiring full 3D reconstruction within a centralized map, our approach integrates foundation models into existing SLAM pipelines, yielding scalable and robust multi-robot mapping. Our contributions include: (1) integrating 3D foundation models to reliably estimate relative poses from monocular image pairs within decentralized C-SLAM; (2) introducing robust outlier mitigation techniques critical to the use of these relative poses; and (3) developing specialized pose graph optimization formulations that efficiently resolve scale ambiguities. We evaluate our method against state-of-the-art approaches, demonstrating improvements in localization and mapping accuracy, alongside significant gains in computational and memory efficiency. These results highlight the potential of our approach for deployment in large-scale multi-robot scenarios.
|
| |
| 09:00-10:30, Paper TuI1I.45 | Add to My Program |
| PRED-MPPI: Disturbance-Preview and Efficient MPPI for Robust Quadrotor Tracking with Hardware Validation |
|
| Zhang, Haodi | Southeast University |
| Ge, Junwei | Southeast University |
| Su, Jinya | Southeast University |
| Pan, Yongping | Southeast University |
| Yang, Jun | Loughborough University |
| Chen, Wen-Hua | The Hong Kong Polytechnic University |
| Li, Shihua | Southeast University |
Keywords: Robust/Adaptive Control, Optimization and Optimal Control, Motion Control
Abstract: We propose PRED-MPPI, the first MPPI variant that seamlessly integrates real-time disturbance preview and adaptive discretization for quadrotor tracking control under significant model inaccuracies and time-varying disturbances. Unlike prior MPPI variants (e.g., mathcal{L}_1-MPPI, DA-MPPI), which assume constant or matched disturbances, PRED-MPPI leverages a high-order Generalized Extended State Observer for disturbance preview and a Variable Discretization Grid (VDG) to reduce computation and control variance. The synergy enables real-time (50 Hz) quadrotor control under time-varying and mismatched disturbances. Extensive comparative simulation and real-world Crazyflie experiments demonstrate substantial performance gains. In AirSim simulation, PRED-MPPI reduces computation time by over 30%, and mean RMSE by 10.3%, 13.5%, and 14.6% compared to baseline MPPI, and by 2.59%, 3.62%, and 5.80% compared to DA-MPPI across three representative scenarios. In real-world Crazyflie experiments, for ground-effect-disturbed hovering, PRED-MPPI reduces mean and standard deviation (Std) of X–Y plane error by 14.2%/17.9% and 6.03%/21.6% compared to MPPI and DA-MPPI; for fan-induced wind experiments, PRED-MPPI yields improvements of 23.4%/36.8% and 13.8%/25.0% in RMSE and tracking error Std. These results establish PRED-MPPI as the first disturbance-preview MPPI achieving real-world UAV robustness and efficiency, paving the way for deployment on resource-limited robotic platforms. GitHub page with videos is at https://pred-mppi.github.io/
|
| |
| 09:00-10:30, Paper TuI1I.46 | Add to My Program |
| Learning to Design Soft Hands Using Reward Models |
|
| Bai, Xueqian | University of California, San Diego |
| Hansen, Nicklas | University of California San Diego |
| Singh, Adabhav | University of California San Diego (UCSD) |
| Tolley, Michael T. | University of California, San Diego |
| Duan, Yan | Amazon |
| Abbeel, Pieter | Amazon |
| Wang, Xiaolong | UC San Diego |
| Yi, Sha | UC San Diego |
Keywords: Modeling, Control, and Learning for Soft Robots, Deep Learning in Grasping and Manipulation, Telerobotics and Teleoperation
Abstract: Soft robotic hands promise to provide compliant and safe interaction with objects and environments. However, designing soft hands to be both compliant and functional across diverse use cases remains challenging. Although co-design of hardware and control better couples morphology to behavior, the resulting search space is high-dimensional, and even simulation-based evaluation is computationally expensive. In this paper, we propose a Cross-Entropy Method with Reward Model (CEM-RM) framework that efficiently optimizes tendon-driven soft robotic hands based on teleoperation control policy, reducing design evaluations by more than half compared to pure optimization while learning a distribution of optimized hand designs from pre-collected teleoperation data. We derive a design space for a soft robotic hand composed of flexural soft fingers and implement parallelized training in simulation. The optimized hands are then 3D-printed and deployed in the real world using both teleoperation data and real-time teleoperation. Experiments in both simulation and hardware demonstrate that our optimized design significantly outperforms baseline hands in grasping success rates across a diverse set of challenging objects.
|
| |
| 09:00-10:30, Paper TuI1I.47 | Add to My Program |
| KAN Policy: Learning Efficient and Smooth Robotic Trajectories Via Kolmogorov-Arnold Networks |
|
| Chen, Zikang | Institute of Software Chinese Academy of Sciences |
| Gao, Fei | Advanced Institute of Information Technology (AIIT), Peking University |
| Yu, Ziya | Hohai University |
| Li, Peng | Institute of Software, Chinese Academy of Sciences |
|
|
| |
| 09:00-10:30, Paper TuI1I.48 | Add to My Program |
| GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation |
|
| Jiang, Guangqi | University of California, San Diego |
| Chang, Haoran | University of California, San Diego |
| Qiu, Ri-Zhao | University of California, San Diego |
| Liang, Yutong | University of California San Diego |
| Ji, Mazeyu | University of California, San Diego |
| Zhu, Jiyue | University of California, San Diego |
| Zou, Xueyan | University of California, San Diego |
| Dong, Zhao | Meta Reality Labs Research |
| Wang, Xiaolong | University of California, San Diego |
Keywords: Deep Learning in Grasping and Manipulation, Grippers and Other End-Effectors, Imitation Learning
Abstract: This paper presents GSWorld, a robust, photo-realistic simulator for robotics manipulation that combines 3D Gaussian Splatting with physics engines. Our framework advocates ‘closing the loop’ of developing manipulation policies with reproducible evaluation of policies learned from real-robot data and sim2real policy training without using real robots. To enable photo-realistic rendering of diverse scenes, we propose a new asset format, which we term GSDF (Gaussian Scene Description File), that infuses Gaussian-on-Mesh representation with robot URDF and other objects. With a streamlined reconstruction pipeline, we curate a database of GSDF that contains 3 robot embodiments for single-arm and bimanual manipulation, as well as more than 40 objects. Combining GSDF with physics engines, we demonstrate several immediate interesting applications: (1) learning zero-shot sim2real pixel-to- action manipulation policy with photo-realistic rendering, (2) automated high-quality DAgger data collection for adapting policies to deployment environments, (3) reproducible bench- marking of real-robot manipulation policies in simulation, (4) simulation data collection by virtual teleoperation, and (5) sim2real visual reinforcement learning. We plan to open-source both the simulation assets and code.
|
| |
| 09:00-10:30, Paper TuI1I.49 | Add to My Program |
| GenJAPNet: A Generalizable Joint Angle Prediction Network with Non-Redundant Muscle Synergy Features for Lower-Limb Exoskeletons |
|
| Zhang, Hairong | Changchun University of Science and Technology |
| Bai, Yu | Changchun University of Science and Technology |
| Ziming, Kou | Taiyuan University of Technology |
| Juan, Wu | Taiyuan University of Technology |
| Qin, Pengjie | Chongqing University |
| Gao, Fei | Chinese Academy of Sciences |
| Shang, Wenze | Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences |
| Teng, Yue | Shenzhen Institute of Advanced Technology,Chinese Academy of Sciences |
| Tian, Dingkui | Shenzhen Advanced Technology Research Institute, Chinese Academy of Sciences |
| Wu, Xinyu | CAS |
Keywords: Prosthetics and Exoskeletons, Rehabilitation Robotics, Deep Learning Methods
Abstract: Lower-limb exoskeleton robots play a significant role in both rehabilitation and assisted walking, where accurate prediction of lower-limb joint angles is crucial for achieving natural gait. However, due to inter-subject variability and differences across locomotion modes, achieving cross-task generalization in joint angle prediction remains a major challenge. This work proposes a novel framework for multi-joint angle prediction in the lower-limb, which includes a non-redundant muscle synergy feature extraction algorithm and a Generalizable Joint Angle Prediction Network (GenJAPNet) across speeds and subjects. The feature extraction algorithm employs Non-negative Matrix Factorization (NMF) to extract activation coefficient matrix from Surface Electromyography (sEMG) signals, followed by further dimensionality reduction using Uniform Manifold Approximation and Projection (UMAP) to obtain more discriminative and non-redundant features. GenJAPNet leverages pre-trained shared features and few-shot fine-tuning to rapidly adapt to new task. Through feature extraction algorithm comparison experiments, cross-speed and cross-subject experiments, and exoskeleton-assisted walking physical experiments, the effectiveness and generalizability of this method are validated, demonstrating its potential for enhancing the performance of lower-limb exoskeleton rehabilitation and assistive applications.
|
| |
| 09:00-10:30, Paper TuI1I.50 | Add to My Program |
| DRL-SFM: Learning Social Navigation from Costmaps and Social Forces for Mobile Robots and Intelligent Wheelchairs |
|
| Kalenberg, Matthias | Friedrich-Alexander Universität Erlangen-Nürnberg |
| Probst, Kilian Gerhard | Friedrich-Alexander-Universität Erlangen-Nürnberg |
| Gründer, Andreas | Friedrich-Alexander-Universität Erlangen-Nürnberg |
| May, Christopher | Friedrich-Alexander-Universität Erlangen-Nürnberg |
| Walter, Jonas | Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) |
| Franke, Jörg | Friedrich-Alexander-Universität Erlangen-Nürnberg |
Keywords: Human-Aware Motion Planning, Reinforcement Learning, Human-Centered Robotics
Abstract: The demand for assistive robots for passenger transport, such as intelligent wheelchairs, is increasing rapidly due to demographic changes. To allow passengers to navigate in crowded environments, such as shopping malls and hospitals, these systems must navigate in a socially accepted manner that ensures the comfort of both passengers and surrounding pedestrians. Although deep reinforcement learning (DRL) has shown promising results for social navigation, existing planners often learn overly passive behaviors, not engaging in the mutual adaptation characteristic of human interaction. In this paper, we introduce a novel DRL-based local planner that learns navigation behaviors by integrating the Social Force Model (SFM) directly into its reward function, allowing more cooperative interactions for mobile robots and intelligent wheelchairs. This approach encourages the agent to learn more forward-looking and mutual navigation policies by rewarding actions that align with the dynamics of pedestrians. To ensure generalization and straightforward deployment, our method utilizes the standard Navigation 2 local costmap augmented with pedestrian detections as an observation. The experiments demonstrate that our agent achieves a higher success rate in crowded scenarios with fewer space intrusions, outperforming the state-of-the-art DRL planner based on velocity obstacles by up to 11%.
|
| |
| 09:00-10:30, Paper TuI1I.51 | Add to My Program |
| Multi-Modal Manipulation Via Multi-Modal Policy Consensus |
|
| Chen, Haonan | University of Illinois at Urbana-Champaign |
| Xu, Jiaming | University of Illinois Urbana-Champaign |
| Chen, Hongyu | University of Illinois at Urbana-Champaign |
| Hong, Kaiwen | University of Illinois at Urbana Champaign |
| Huang, Binghao | Columbia University |
| Liu, Chaoqi | University of Illinois at Urbana-Champaign |
| Mao, Jiayuan | MIT |
| Li, Yunzhu | Columbia University |
| Du, Yilun | MIT |
| Driggs-Campbell, Katherine | University of Illinois at Urbana-Champaign |
Keywords: Agent-Based Systems, Force and Tactile Sensing, AI-Enabled Robotics
Abstract: Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in {RLBench}, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities.
|
| |
| 09:00-10:30, Paper TuI1I.52 | Add to My Program |
| A Gripper with Extreme Stiffness Anisotropy for High-Speed Handling of Fragile Foods |
|
| Wang, Zhongkui | Ritsumeikan University |
| Sato, Mutsuhito | Ritsumeikan University |
| Arita, Hikaru | Kyushu University |
| Mori, Yoshiki | The University of Osaka |
| Kawamura, Sadao | Ritsumeikan University |
| Liu, Mengyao | Ritsumeikan University |
Keywords: Grippers and Other End-Effectors, Dexterous Manipulation, Grasping
Abstract: Handling fragile objects at high speeds presents a significant challenge. To address this challenge, researchers explored hybrid grippers and grippers with adjustable stiffness. However, grippers exhibiting anisotropic stiffness have received little attention. This paper presents an extreme case of an anisotropic stiffness gripper designed for high-speed handling of delicate foods. The proposed gripper incorporates linear motors and low-friction linear guides. Its extremely low friction in the grasping direction ensures minimal grasping force (less than 0.1 N) and high compliance, enabling the secure handling of fragile objects. Simultaneously, its rigid structure provides sufficient stiffness in the translational direction, ensuring stability during high-speed motion. Leveraging anisotropic stiffness, the gripper achieves both gentle grasp and stable high-speed translation--two requirements that typically necessitate a trade-off. Theoretical analysis was conducted to determine the maximum permissible acceleration under two conditions: with and without stiffness anisotropy. Results indicate that stiffness anisotropy enables significantly higher acceleration during translational motion, thereby reducing task time. Pick-and-place experiments on a 3D printed object, delicate foods of a potato chip, a block of tofu, and a piece of dried seaweed validated theoretical findings and demonstrated the gripper's capability to handle fragile objects at high speeds effectively.
|
| |
| 09:00-10:30, Paper TuI1I.53 | Add to My Program |
| Background Fades, Foreground Leads: Curriculum-Guided Background Pruning for Efficient Foreground-Centric Collaborative Perception |
|
| Wu, Yuheng | KAIST |
| Gao, Xiangbo | Texas A&M University |
| Tau, Quang | KAIST |
| Tu, Zhengzhong | Texas A&M University |
| Lee, Dongman | KAIST |
Keywords: Intelligent Transportation Systems, Computer Vision for Transportation, Cooperating Robots
Abstract: Collaborative perception enhances the reliability and spatial coverage of autonomous vehicles by sharing complementary information across vehicles, offering a promising solution to long-tail scenarios that challenge single-vehicle perception. However, the bandwidth constraints of vehicular networks make transmitting the entire feature map infeasible. Recent methods therefore adopt a foreground-centric paradigm, transmitting only predicted foreground region features but discarding background which encodes essential context. We propose FadeLead, a foreground-centric framework that overcomes this limitation by learning to encapsulate background context into compact foreground features during training. At the core of our design is a curricular learning strategy that leverages background cues early on but progressively prunes them away, forcing the model to internalize context into foreground representations without transmitting background itself. Extensive experiments on both simulated and real-world scenarios show that FadeLead outperforms prior methods under different bandwidth settings, underscoring the effectiveness of context-enriched foreground sharing. Code, demos, and checkpoints are available at https://wyhallenwu.github.io/FadeLead/.
|
| |
| 09:00-10:30, Paper TuI1I.54 | Add to My Program |
| MoiréTac: A Dual-Mode Visuotactile Sensor for Multidimensional Perception Using Moiré Pattern Amplification |
|
| Sou, Kit-Wa | Tsinghua University |
| Gong, Junhao | Tsinghua University |
| Li, Shoujie | Tsinghua Shenzhen International Graduate School |
| Lyu, Chuqiao | Tsinghua Shenzhen International Graduate School |
| Song, Ziwu | Tsinghua University |
| Mu, Shilong | Tsinghua University |
| Ding, Wenbo | Tsinghua University |
Keywords: Force and Tactile Sensing, Sensor Fusion, Perception for Grasping and Manipulation
Abstract: Visuotactile sensors typically employ sparse marker arrays that limit spatial resolution and lack clear analytical force-to-image relationships. To solve this problem, we present MoiréTac, a dual-mode sensor that generates dense interference patterns via overlapping micro-gratings within a transparent architecture. When two gratings overlap with misalignment, they create moiré patterns that amplify microscopic deformations. The design preserves optical clarity for vision tasks while producing continuous moiré fields for tactile sensing, enabling simultaneous 6-axis force/torque measurement, contact localization, and visual perception. We combine physics-based features (brightness, phase gradient, orientation, and period) from moiré patterns with deep spatial features. These are mapped to 6-axis force/torque measurements, enabling interpretable regression through end-to-end learning. Experimental results demonstrate three capabilities: force/torque measurement with R²>0.98 across tested axes; sensitivity tuning through geometric parameters (threefold gain adjustment); and vision functionality for object classification despite moiré overlay. Finally, we integrate the sensor into a robotic arm for cap removal with coordinated force and torque control, validating its potential for dexterous manipulation.
|
| |
| 09:00-10:30, Paper TuI1I.55 | Add to My Program |
| Dr-PoGO: Direct Radar Pose-Graph Optimization |
|
| Le Gentil, Cedric | University of Toronto |
| Li, Weican | University of Toronto |
| Brizi, Leonardo | Sapienza University of Rome |
| Barfoot, Timothy | University of Toronto |
Keywords: SLAM, Mapping, Intelligent Transportation Systems
Abstract: This paper introduces Dr-PoGO, a method for Simultaneous Localization And Mapping (SLAM) using a 2D spinning radar. Unlike cameras or lidars that require line-of-sight, millimetre-wave radars can `see' through dust, falling snow, rain, etc. Accordingly, it is a great modality for robust perception regardless of the weather conditions. While most existing radar-based SLAM methods rely on the extraction of point clouds or features to perform ego-motion estimation, Dr-PoGO leverages direct registration techniques for odometry (DRO) and loop-closure registration. An off-the-shelf radar-focused place recognition algorithm, RaPlace, provides loop-closure candidates. As RaPlace does not provide relative transformations, Dr-PoGO introduces a coarse-to-fine registration that uses visual features and descriptors to obtain an initial guess for the direct transformation refinement. The global trajectory is optimized in a pose-graph optimization. Dr-PoGO demonstrates state-of-the-art performance over 300km of data in various real-world automotive environments. Our implementation is publicly available: https://github.com/utiasASRL/dr_pogo.
|
| |
| 09:00-10:30, Paper TuI1I.56 | Add to My Program |
| RE-Formation: Resilient and Efficient Formation Planning in Large-Scale Distributed Aerial Swarms (I) |
|
| Zhou, Yuan | Zhejiang University |
| Quan, Lun | Zhejiang University |
| Xu, Guangtong | Zhejiang University |
| Xu, Chao | Zhejiang University |
| Gao, Fei | Zhejiang University |
Keywords: Swarm Robotics, Aerial Systems: Applications
Abstract: Due to the limited online computational resources and the inherent probability of hardware and software failures of real-world robots, large-scale formation planning faces two common challenges: computational intractability and agent failures. Based on the theory of sparse graphs and the maximum clique, {we achieve a resilient and efficient formation planning (mathbf{RE}-mathbf{Formation}) to address these issues.} To improve the computational efficiency of trajectory planning while ensuring flexible formation maneuvers, we introduce sparse graphs to describe connection relationships and present a sparse graph construction method with closed-form solutions. The sparse graphs ensure the {}{underline{G}}lobal {}{underline{R}}igidity for uniquely corresponding to a geometric shape and {}{underline{P}}reserve the main {}{underline{F}}eatures of complete graphs, {}{denoted as the mathbf{GRPF} sparse graph}. To prevent the impact of abnormal agents, the problem of eliminating abnormal agents is transformed into an outlier rejection problem that can be solved by computing the maximum clique. {}{We approximate the maximum clique by periodically triggering the calculation of the maximum k-core to meet the real-time computational demands of large-scale swarms.} We validate the performance through real-world experiments and implement formation planning with 100 drones in simulation. Benchmark comparisons and ablation experiments demonstrate the effectiveness of
|
| |
| 09:00-10:30, Paper TuI1I.57 | Add to My Program |
| Miniature Multihole Airflow Sensor for Lightweight Aircraft Over Wide Speed and Angular Range |
|
| Stuber, Lukas | EPFL |
| Jeger, Simon Luis | EPFL |
| Zufferey, Raphael | MIT |
| Floreano, Dario | Ecole Polytechnique Fédérale De Lausanne (EPFL) |
Keywords: Aerial Systems: Perception and Autonomy, Product Design, Development and Prototyping
Abstract: An aircraft's airspeed, angle of attack, and angle of side slip are crucial to its safety, especially when flying close to the stall regime. Various solutions exist, including pitot tubes, angular vanes, and multihole pressure probes. However, current sensors are either too heavy (>30 g) or require large airspeeds (>20 m/s), making them unsuitable for small uncrewed aerial vehicles. We propose a novel multihole pressure probe, integrating sensing electronics in a single-component structure, resulting in a mechanically robust and lightweight sensor (9 g), which we released to the public domain. Since there is no consensus on two critical design parameters, tip shape (conical vs spherical) and hole spacing (distance between holes), we provide a study on measurement accuracy and noise generation using wind tunnel experiments. The sensor is calibrated using a multivariate polynomial regression model over an airspeed range of 3-27 m/s and an angle of attack/sideslip range of +/-35 deg, achieving a mean absolute error of 0.44 m/s and 0.16 deg. Finally, we validated the sensor in outdoor flights near the stall regime. Our probe enabled accurate estimations of airspeed, angle of attack and sideslip during different acrobatic manoeuvres. Due to its size and weight, this sensor will enable safe flight for lightweight, uncrewed aerial vehicles flying at low speeds close to the stall regime.
|
| |
| 09:00-10:30, Paper TuI1I.58 | Add to My Program |
| TopAY: Efficient Trajectory Planning for Differential Drive Mobile Manipulators Via Topological Paths Search and Arc Length-Yaw Parameterization |
|
| Xu, Long | Zhejiang University |
| Wong, Choi Lam | Huzhou Institute of Zhejiang University |
| Zhang, Mengke | Zhejiang University |
| Lin, Junxiao | Zhejiang University |
| Hou, Jialiang | Zhejiang University |
| Gao, Fei | Zhejiang University |
Keywords: Motion and Path Planning, Collision Avoidance, Mobile Manipulation
Abstract: Differential drive mobile manipulators combine the mobility of wheeled bases with the manipulation capability of multi-joint arms, enabling versatile applications but posing considerable challenges for trajectory planning due to their high-dimensional state space and nonholonomic constraints. This paper introduces TopAY, an optimization-based planning framework designed for efficient and safe trajectory generation for differential drive mobile manipulators. The framework employs a hierarchical initial value acquisition strategy, including topological paths search for the base and parallel sampling for the manipulator. A polynomial trajectory representation with arc length–yaw parameterization is also proposed to reduce optimization complexity while preserving dynamic feasibility. Extensive simulation and real-world experiments validate that TopAY achieves higher planning efficiency and success rates than state-of-the-art method in dense and complex scenarios. The source code is released at https://github.com/TopAY-Planner/TopAY.
|
| |
| 09:00-10:30, Paper TuI1I.59 | Add to My Program |
| Semantic-LiDAR-Inertial-Wheel Odometry Fusion for Robust Localization in Large-Scale Dynamic Environments |
|
| Jiang, Haoxuan | The Hong Kong University of Science and Technology (Guangzhou) |
| Qian, Peicong | Unity Drive Innovation Technology |
| Xie, Yusen | The Hong Kong University of Science and Technology (Guangzhou) |
| Zheng, Linwei | The Hong Kong University of Science and Technology |
| Li, Xiaocong | Eastern Institute of Technology, Ningbo |
| Liu, Ming | Hong Kong University of Science and Technology (Guangzhou) |
| Ma, Jun | The Hong Kong University of Science and Technology |
Keywords: Industrial Robots, SLAM, Localization
Abstract: Reliable, drift-free global localization presents significant challenges yet remains crucial for autonomous navigation in large-scale dynamic environments. In this paper, we introduce a tightly-coupled Semantic-LiDAR-Inertial-Wheel Odometry fusion framework, which is specifically designed to provide high-precision state estimation and robust localization in large-scale dynamic environments. Our framework leverages an efficient semantic-voxel map representation and employs an improved scan matching algorithm, which utilizes global semantic information to significantly reduce long-term trajectory drift. Furthermore, it seamlessly fuses data from LiDAR, IMU, and wheel odometry using a tightly-coupled multi-sensor fusion Iterative Error-State Kalman Filter (iESKF). This ensures reliable localization without experiencing abnormal drift. Moreover, to tackle the challenges posed by terrain variations and dynamic movements, we introduce a 3D adaptive scaling strategy that allows for flexible adjustments to wheel odometry measurement weights, thereby enhancing localization precision. This study presents extensive real-world experiments conducted in a one-million-square-meter automated port, encompassing 3,575 hours of operational data from 35 Intelligent Guided Vehicles (IGVs). The results consistently demonstrate that our system outperforms state-of-the-art LiDAR-based localization methods in large-scale dynamic environments, highlighting the framework's reliability and practical value.
|
| |
| 09:00-10:30, Paper TuI1I.60 | Add to My Program |
| Collaborative Human-Robot Object Transportation Using a Deformable Sheet |
|
| Zhang, Weijian | University of Bimingham |
| Street, Charlie | University of Birmingham |
| Mansouri, Masoumeh | Birmingham University |
Keywords: Multi-Robot Systems, Optimization and Optimal Control, Human-Robot Collaboration
Abstract: In this paper, we tackle real-time formation trajectory planning for collaborative object transportation in complex environments using a team of nonholonomic robots and a human. The object is transported in a deformable sheet, and robots should follow the human’s lead while autonomously avoiding obstacles. By including a human in the formation, we leverage their adaptability and decision-making to improve transportation. However, it can be difficult for a human to predict how autonomous robots will behave in complex situations, such as when the formation must cross an obstacle, i.e. where the object is transported above it. This could cause human decisions that compromise safety. To overcome these challenges, we introduce a multi-modal formation planning framework. By default the human leads the formation, and the robots plan to remain in the same homotopy class as the human to avoid collisions. If obstacle crossing is necessary the robots take the lead of the formation, where human motion is constrained to a feasible region projected visually in front of them. We demonstrate the efficacy of our framework in simulation and on hardware.
|
| |
| 09:00-10:30, Paper TuI1I.61 | Add to My Program |
| Teamformer: Scalable Heterogeneous Multi-Robot Team Formation |
|
| Boehme, Noah | Oregon State University |
| Hollinger, Geoffrey | Oregon State University |
Keywords: Multi-Robot Systems, Reinforcement Learning
Abstract: Accounting for heterogeneity among robots and tasks adds additional complexity to multi-robot task allocation. While existing task allocation methods effectively handle heterogeneity among robots and tasks, they do not scale well in the number of different robots and tasks. To address this gap, we formulate the Team Formation Markov Decision Process (TF-MDP) for training Teamformer: a scalable, decentralized transformer policy for dynamically forming heterogeneous teams of robots to complete diverse tasks. Combining the TF-MDP with the autoregressive capability of transformers enables Teamformer to scale linearly in the number of robots, tasks, and combinations of different heterogeneous robots. Simulations demonstrate Teamformer generalizing to combinations of 100 different types of robots and tasks. Hardware experiments using Georgia Tech's Robotarium show Teamformer decentrally coordinating up to 20 heterogeneous robots for task completion.
|
| |
| 09:00-10:30, Paper TuI1I.62 | Add to My Program |
| FORM: Fixed-Lag Odometry with Reparative Mapping Utilizing Rotating LiDAR Sensors |
|
| Potokar, Easton | Carnegie Mellon Uiversity |
| Pool, Taylor | Carnegie Mellon University |
| McGann, Daniel | Carnegie Mellon University |
| Kaess, Michael | Carnegie Mellon University |
Keywords: Localization, Range Sensing, SLAM
Abstract: Light Detection and Ranging (LiDAR) sensors have become a de-facto sensor for many robot state estimation tasks, spurring development of many LiDAR Odometry (LO) methods in recent years. While some smoothing-based LO methods have been proposed, most require matching against multiple scans, resulting in sub-real-time performance. Due to this, most prior works estimate a single state at a time and are ``submap''-based. This architecture propagates any error in pose estimation to the fixed submap and can cause jittery trajectories and degrade future registrations. We propose Fixed-Lag Odometry with Reparative Mapping (FORM), a LO method that performs smoothing over a densely connected factor graph while utilizing a single iterative map for matching. This allows for both real-time performance and active correction of the local map as pose estimates are further refined. We evaluate on a wide variety of datasets to show that FORM is robust, accurate, real-time, and provides smooth trajectory estimates when compared to prior state-of-the-art LO methods.
|
| |
| 09:00-10:30, Paper TuI1I.63 | Add to My Program |
| Latent Action Diffusion for Cross-Embodiment Manipulation |
|
| Bauer, Erik | ETH Zürich |
| Nava, Elvis | ETH Zurich, Mimic Robotics |
| Katzschmann, Robert K. | ETH Zurich |
Keywords: Deep Learning in Grasping and Manipulation, Imitation Learning, Representation Learning
Abstract: End-to-end learning is emerging as a powerful paradigm for robotic manipulation, but its effectiveness is limited by data scarcity and the heterogeneity of action spaces across robot embodiments. In particular, diverse action spaces across different end-effectors create barriers for cross-embodiment learning and skill transfer. We address this challenge through diffusion policies learned in a latent action space that unifies diverse end-effector actions. We first show that we can learn a semantically aligned latent action space for anthropomorphic robotic hands, a human hand, and a parallel jaw gripper using encoders trained with a contrastive loss. Second, we show that by using our proposed latent action space for co-training on manipulation data from different end-effectors, we can utilize a single policy for multi-robot control and obtain up to 25.3% improved manipulation success rates, indicating successful skill transfer despite a significant embodiment gap. Our approach using latent cross-embodiment policies presents a new method to unify different action spaces across embodiments, enabling efficient multi-robot control and data sharing across robot setups. This unified representation significantly reduces the need for extensive data collection for each new robot morphology, accelerates generalization across embodiments, and ultimately facilitates more scalable and efficient robotic learning.
|
| |
| 09:00-10:30, Paper TuI1I.64 | Add to My Program |
| ADGaussian: Generalizable Gaussian Splatting for Autonomous Driving Via Multi-Modal Joint Learning |
|
| Song, Qi | The Chinese University of Hong Kong, Shenzhen |
| Li, Chenghong | The Chinese University of Hong Kong, Shenzhen |
| Lin, Haotong | Zhejiang University |
| Peng, Sida | Zhejiang University |
| Huang, Rui | The Chinese University of Hong Kong, Shenzhen |
Keywords: RGB-D Perception, Computer Vision for Transportation, Deep Learning for Visual Perception
Abstract: We present a novel approach, termed ADGaussian, for generalizable street scene reconstruction. The proposed method enables high-quality rendering from merely single-view input. Unlike prior Gaussian Splatting methods that primarily focus on geometry refinement, we emphasize the importance of joint optimization of image and depth features for accurate Gaussian prediction. To this end, we first incorporate sparse LiDAR depth as an additional input modality, formulating the Gaussian prediction process as a joint learning framework of visual information and geometric clue. Furthermore, we propose a Multi-modal Feature Matching strategy coupled with a Multi-scale Gaussian Decoding model to enhance the joint refinement of multi-modal features, thereby enabling efficient multi-modal Gaussian learning. Extensive experiments on Waymo and KITTI demonstrate that our ADGaussian achieves state-of-the-art performance and exhibits superior zero-shot generalization capabilities in novel-view shifting.
|
| |
| 09:00-10:30, Paper TuI1I.65 | Add to My Program |
| TUNI: Real-Time RGB-T Semantic Segmentation with Unified Multi-Modal Feature Extraction and Cross-Modal Feature Fusion |
|
| Guo, Xiaodong | Beijing Institute of Technology |
| Liu, Tong | Beijing Institute of Technology |
| Li, Yike | Beijing Institute of Technology |
| Lin, Zi'ang | Beijing Institute of Technology |
| Deng, Zhihong | Beijing Institute of Technology |
Keywords: Deep Learning for Visual Perception, Sensor Fusion, Semantic Scene Understanding
Abstract: RGB-thermal (RGB-T) semantic segmentation improves the environmental perception of autonomous platforms in challenging conditions. Prevailing models employ encoders pre-trained on RGB images to extract features from both RGB and infrared inputs, and design additional modules to achieve cross-modal feature fusion. This results in limited thermal feature extraction and suboptimal cross-modal fusion, while the redundant encoders further compromises the model’s real-time efficiency. To address the above issues, we propose TUNI, with an RGB-T encoder consisting of multiple stacked blocks that simultaneously perform multi-modal feature extraction and cross-modal fusion. By leveraging large-scale pre-training with RGB and pseudo-thermal data, the RGB-T encoder learns to integrate feature extraction and fusion in a unified manner. By slimming down the thermal branch, the encoder achieves a more compact architecture. Moreover, we introduce an RGB-T local module to strengthen the encoder’s capacity for cross-modal local feature fusion. The RGB-T local module employs adaptive cosine similarity to selectively emphasize salient consistent and distinct local features across RGB-T modalities. Experimental results show that TUNI achieves competitive performance with state-of-the-art models on FMB, PST900 and CART, with fewer parameters and lower computational cost. Meanwhile, it achieves an inference speed of 27 FPS on a Jetson Orin NX, demonstrating its real-time capability in deployment. Codes are available at https://github.com/xiaodonguo/TUNI.
|
| |
| 09:00-10:30, Paper TuI1I.66 | Add to My Program |
| All-Onboard Relative Positioning and Control Framework for Autonomous Micro-UAV Swarms Based on Vision-Optoelectronic-UWB Fusion and Distributed Graph Optimization |
|
| Xiong, Chengsong | Tsinghua University |
| Wan, Jiaqi | Tsinghua University |
| Tong, Qifan | Tsinghua University |
| Lu, Wenshuai | Tsinghua University |
| He, Qingning | Tsinghua University |
| You, Zheng | Tsinghua University |
Keywords: Swarm Robotics, Localization, Sensor Fusion
Abstract: The autonomous cooperation of micro-Unmanned Aerial Vehicle (UAV) swarms remain key challenges. Existing swarm relative positioning and control methods demand high sensing, computing, and communication resources and rely on external equipment like GPS and ground stations. To address these issues, this paper proposes an all-onboard and external-aiding-free swarm relative measurement, positioning and control framework. The framework utilizes an onboard Vision-Optoelectronic-Ultra-Wideband (UWB) coupled measurement system to acquire inter-UAV relative distance and direction. Subsequently, the swarm's relative positions are solved via a distributed graph optimization (DGO) approach. Based on the solved relative positions, swarm cooperative control is implemented through a distributed Voronoi diagram approach. Experimental results demonstrate that the proposed method enables 150 g micro-UAVs to achieve nearly 100-meter autonomous outdoor formation flight and collaborative tracking of dynamic targets, with swarm relative localization accuracy reaching approximately 0.262 m. This work pioneers fully autonomous measurement and control for 100-gram scale UAV swarms without external infrastructure, significantly advancing autonomy and enabling swarm intelligence emergence.
|
| |
| 09:00-10:30, Paper TuI1I.67 | Add to My Program |
| Design of an Adaptive Modular Anthropomorphic Dexterous Hand for Human-Like Manipulation |
|
| Zhou, Zelong | Hunan University |
| Chen, Wenrui | Hunan University |
| Hu, Zeyun | Hunan University |
| Diao, Qiang | Hunan University |
| Gao, Qixin | Hunan University |
| Yan, Cuo | Hunan University |
| Wang, Yaonan | Hunan University |
Keywords: Multifingered Hands, Compliant Joints and Mechanisms
Abstract: Biological synergies have emerged as a widely adopted paradigm for dexterous hand design, enabling human-like manipulation with a small number of actuators. Nonetheless, excessive coupling tends to diminish the dexterity of hands. This paper tackles the trade-off between actuation complexity and dexterity by proposing an anthropomorphic finger topology with 4-DoFs driven by 2 actuators, and by developing an adaptive, modular dexterous hand based on this finger topology. We explore the biological basis of hand synergies and human gesture analysis, translating joint-level coordination and structural attributes into a modular finger architecture. Leveraging these biomimetic mappings, we design a five-finger modular hand and establish its kinematic model to analyze adaptive grasping and in-hand manipulation. Finally, we construct a physical prototype and conduct preliminary experiments, which validate the effectiveness of the proposed design and analysis.
|
| |
| 09:00-10:30, Paper TuI1I.68 | Add to My Program |
| Perception-Control Coupled Visual Servoing for Textureless Objects Using Keypoint-Based EKF |
|
| Tao, Allen | University of Toronto |
| Yang, Jun | Epson Canada Ltd |
| Oparnica, Stanko | Epson Canada Inc |
| Xue, Wenjie | Epson Canada |
Keywords: Visual Servoing, Perception for Grasping and Manipulation, Computer Vision for Automation
Abstract: Visual servoing is fundamental to robotic applications, enabling precise positioning and control. However, applying it to textureless objects remains a challenge due to the absence of reliable visual features. Moreover, adverse visual conditions, such as occlusions, often corrupt visual feedback, leading to reduced accuracy and instability in visual servoing. In this work, we build upon learning-based keypoint detection for textureless objects and propose a method that enhances robustness by tightly integrating perception and control in a closed loop. Specifically, we employ an Extended Kalman Filter (EKF) that integrates per-frame keypoint measurements to estimate 6D object pose, which drives pose-based visual servoing (PBVS) for control. The resulting camera motion, in turn, enhances the tracking of subsequent keypoints, effectively closing the perception-control loop. Additionally, unlike standard PBVS, we propose a probabilistic control law that computes both camera velocity and its associated uncertainty, enabling uncertainty-aware control for safe and reliable operation. We validate our approach on real-world robotic platforms using quantitative metrics and grasping experiments, demonstrating that our method outperforms traditional visual servoing techniques in both accuracy and practical application.
|
| |
| 09:00-10:30, Paper TuI1I.69 | Add to My Program |
| Hallucinating 360°: Panoramic Street-View Generation Via Local Scenes Diffusion and Probabilistic Prompting |
|
| Teng, Fei | Hunan University |
| Luo, Kai | Hunan University |
| Wu, Sheng | Hunan University |
| Li, Siyu | Hunan University |
| Guo, PuJun | Hunan University |
| Wei, Jiale | Karlsruhe Institut of Technology |
| Zhang, Jiaming | Hunan University |
| Peng, Kunyu | Karlsruhe Institute of Technology |
| Yang, Kailun | Hunan University |
Keywords: Computer Vision for Transportation, Deep Learning for Visual Perception, Semantic Scene Understanding
Abstract: Panoramic perception holds significant potential for autonomous driving, enabling vehicles to acquire a comprehensive 360° surround view in a single shot. However, autonomous driving is a data-driven task. Complete panoramic data acquisition requires complex sampling systems and annotation pipelines, which are time-consuming and labor-intensive. Although existing street view generation models have demonstrated strong data regeneration capabilities, they can only learn from the fixed data distribution of existing datasets and cannot achieve high-quality, controllable panoramic generation. In this paper, we propose the first panoramic generation method, Percep360, for autonomous driving. Percep360 enables coherent generation of panoramic data with control signals based on the stitched panoramic data. Percep360 focuses on two key aspects: coherence and controllability. Specifically, to overcome the inherent information loss caused by the pinhole sampling process, we propose the Local Scenes Diffusion Method (LSDM). LSDM reformulates the panorama generation as a spatially continuous diffusion process, bridging the gaps between different data distributions. Additionally, to achieve the controllable generation of panoramic images, we propose a Probabilistic Prompting Method (PPM). PPM dynamically selects the most relevant control cues, enabling controllable panoramic image generation. We evaluate the effectiveness of the generated images from three perspectives: image quality assessment (i.e.,no-reference and with reference), controllability, and their utility in real-world Bird’s Eye View (BEV) segmentation. Notably, the generated data consistently outperforms the original stitched images in no-reference quality metrics and enhances downstream perception models, leading to an improvement of 2.5% in mIoU for panoramic BEV segmentation. The source code will be publicly available.
|
| |
| 09:00-10:30, Paper TuI1I.70 | Add to My Program |
| Prepare for Warp Speed: Sub-Millisecond Visual Place Recognition Using Event Cameras |
|
| Ramanathan, Vignesh | Vignesh.ramanathan@hdr.qut.edu.au |
| Milford, Michael J | Queensland University of Technology |
| Fischer, Tobias | Queensland University of Technology |
Keywords: Localization
Abstract: Visual Place Recognition (VPR) enables systems to identify previously visited locations within a map, a fundamental task for autonomous navigation. Prior works have developed VPR solutions using event cameras, which asynchronously measure per-pixel brightness changes with microsecond temporal resolution. However, these approaches rely on dense representations of the inherently sparse camera output and require tens to hundreds of milliseconds of event data to predict a place. Here, we break this paradigm with Flash, a lightweight VPR system that predicts places using sub-millisecond slices of event data. Our method is based on the observation that active pixel locations provide strong discriminative features for VPR. Flash encodes these active pixel locations using efficient binary frames and computes similarities via fast bitwise operations, which are then normalized based on the relative event activity in the query and reference frames. Flash improves Recall@1 for sub-millisecond VPR over existing baselines by 11.33x on the indoor QCR-Event-Dataset and 5.92x on the 8 km Brisbane-Event-VPR dataset. Moreover, our approach reduces the duration for which the robot must operate without awareness of its position, as evidenced by a localization latency metric we term Time to Correct Match (TCM). To the best of our knowledge, this is the first work to demonstrate sub-millisecond VPR using event cameras.
|
| |
| 09:00-10:30, Paper TuI1I.71 | Add to My Program |
| CD-FKD: Cross-Domain Feature Knowledge Distillation for Robust Single-Domain Generalization in Object Detection |
|
| Lee, Junseok | LG Electronics |
| Shin, Sungho | Naver |
| Lee, Seongju | Gwangju Institue of Science and Technology (GIST) |
| Lee, Kyoobin | Gwangju Institute of Science and Technology |
Keywords: Object Detection, Segmentation and Categorization, Computer Vision for Transportation
Abstract: Single-domain generalization is essential for object detection, particularly when training models on a single source domain and evaluating them on unseen target domains. Domain shifts, such as changes in weather, lighting, or scene conditions, pose significant challenges to the generalization ability of existing models. To address this, we propose Cross-Domain Feature Knowledge Distillation (CD-FKD), which enhances the generalization capability of the student network by leveraging both global and instance-wise feature distillation. The proposed method uses diversified data through downscaling and corruption to train the student network, whereas the teacher network receives the original source domain data. The student network mimics the features of the teacher through both global and instance-wise distillation, enabling it to extract object-centric features effectively, even for objects that are difficult to detect owing to corruption. Extensive experiments on challenging scenes demonstrate that CD-FKD outperforms state-of-the-art methods in both target domain generalization and source domain performance, validating its effectiveness in improving object detection robustness to domain shifts. This approach is valuable in real-world applications, like autonomous driving and surveillance, where robust object detection in diverse environments is crucial.
|
| |
| 09:00-10:30, Paper TuI1I.72 | Add to My Program |
| T2S: Tokenized Skill Scaling for Lifelong Imitation Learning |
|
| Zhang, Hongquan | East China Norm University |
| Gong, Jingyu | East China Normal University |
| Zhizhong, Zhang | East China Norm University |
| Tan, Xin | East China Normal University |
| Xie, Yuan | East China Normal University |
Keywords: Continual Learning, Incremental Learning, Imitation Learning
Abstract: The main challenge in lifelong imitation learning lies in the balance between mitigating catastrophic forgetting of previous skills while maintaining sufficient capacity for acquiring new ones. However, current approaches typically address these aspects in isolation, overlooking their internal correlation in lifelong skill acquisition. We address this limitation with a unified framework named Tokenized Skill Scaling (T2S). Specifically, by tokenizing the model parameters, the linear parameter mapping of the traditional transformer is transformed into cross-attention between input and learnable tokens, thereby enhancing model scalability through the easy extension of new tokens. Additionally, we introduce language-guided skill scaling to transfer knowledge across tasks efficiently and avoid linearly growing parameters. Extensive experiments across diverse tasks demonstrate that T2S: 1) effectively prevents catastrophic forgetting (achieving an average NBT of 1.0% across the three LIBERO task suites), 2) excels in new skill scaling with minimal increases in trainable parameters (needing only 8.0% trainable tokens in an average of lifelong tasks), and 3) enables efficient knowledge transfer between tasks (achieving an average FWT of 77.7% across the three LIBERO task suites), offering a promising solution for lifelong imitation learning.
|
| |
| 09:00-10:30, Paper TuI1I.73 | Add to My Program |
| Automated Coral Spawn Monitoring for Reef Restoration: The Coral Spawn and Larvae Imaging Camera System (CSLICS) |
|
| Tsai, Dorian | Queensland University of Technology |
| Brunner, Christopher A. | Australian Institute of Marine Science |
| Lamont, Riki | Australian Institute of Marine Science |
| Nordborg, F. Mikaela | Australian Institute of Marine Science |
| Severati, Andrea | AIMS |
| Terry, Java | Queensland University of Technology |
| Jackel, Karen E | Queensland University of Technology |
| Dunbabin, Matthew | Queensland University of Technology |
| Fischer, Tobias | Queensland University of Technology |
| Raine, Scarlett | Queensland University of Technology |
Keywords: Environment Monitoring and Management, Robotics and Automation in Life Sciences
Abstract: Coral aquaculture for reef restoration requires accurate and continuous spawn counting for resource distribution and larval health monitoring, but current methods are labor-intensive and represent a critical bottleneck in the coral production pipeline. We propose the Coral Spawn and Larvae Imaging Camera System (CSLICS), which uses low cost modular cameras and object detectors trained using human-in-the-loop labeling approaches for automated spawn counting in larval rearing tanks. This paper details the system engineering, dataset collection, and computer vision techniques to detect, classify and count coral spawn. Experimental results from mass spawning events demonstrate an F1 score of 82.4% for surface spawn detection at different embryogenesis stages, 65.3% F1 score for sub-surface spawn detection, and a saving of 5,720 hours of labor per spawning event compared to manual sampling methods at the same frequency. Comparison of manual counts with CSLICS monitoring during a mass coral spawning event on the Great Barrier Reef demonstrates CSLICS' accurate measurement of fertilization success and sub-surface spawn counts. These findings enhance the coral aquaculture process and enable upscaling of coral reef restoration efforts to address climate change threats facing ecosystems like the Great Barrier Reef.
|
| |
| 09:00-10:30, Paper TuI1I.74 | Add to My Program |
| Asymptotically Stable Gait Generation and Instantaneous Walkability Determination for Planar Almost Linear Biped with Knees |
|
| Asano, Fumihiko | Japan Advanced Institute of Science and Technology |
| Lei, Ning | Japan Advanced Institute of Science and Technology |
| Sedoguchi, Taiki | Japan Advanced Institute of Science and Technology |
Keywords: Dynamics, Humanoid and Bipedal Locomotion, Motion Control
Abstract: A class of planar bipedal robots with unique mechanical properties has been proposed, where all links are balanced around the hip joint, preventing natural swinging motion due to gravity. A common property of their equations of motion is that the inertia matrix is a constant matrix, there are no nonlinear velocity terms, and the gravity term contains simple nonlinear terms. By performing a Taylor expansion of the gravity term and making a linear approximation, it is easy to derive a linearized model, and calculations for future states or walkability determination can be performed instantaneously without the need for numerical integration. This paper extends the method to a planar biped robot model with knees. First, we derive the equations of motion, constraint conditions, and inelastic collisions for a planar 6-DOF biped robot, design its control system, and numerically generate a stable bipedal gait on a horizontal plane. Next, we reduce the equations of motion to a 3-DOF model, and derive a linearized model by approximating the gravity term as linear around the expansion point for the thigh frame angle. Through numerical simulations, we demonstrate that calculations for future states and walkability determination can be completed in negligible time. By applying control inputs to the obtained model, performing state-space realization, and then discretizing it, instantaneous walkability determination through iterative calculation becomes possible. Through detailed gait analysis, we discuss how the knee joint flexion angle and the expansion point affect the accuracy of the linear approximation, and the issues that arise when descending a small step.
|
| |
| 09:00-10:30, Paper TuI1I.75 | Add to My Program |
| CoPlanner: An Interactive Motion Planner with Contingency-Aware Diffusion for Autonomous Driving |
|
| Zhong, Ruiguo | The Hong Kong University of Science and Technology (Guangzhou) |
| Yao, Ruoyu | The Hong Kong University of Science and Technology (Guangzhou) |
| Liu, Pei | The Hong Kong University of Science and Technology(GuangZhou) |
| Chen, Xiaolong | The Hong Kong University of Science and Technology (Guangzhou) |
| Yang, Rui | The Hong Kong University of Science and Technology |
| Ma, Jun | The Hong Kong University of Science and Technology |
Keywords: Intelligent Transportation Systems, Autonomous Vehicle Navigation, Imitation Learning
Abstract: Accurate trajectory prediction and motion planning are crucial for autonomous driving systems to navigate safely in complex, interactive environments characterized by multimodal uncertainties. However, current generation-then-evaluation frameworks typically construct multiple plausible trajectory hypotheses but ultimately adopt a single most likely outcome, leading to overconfident decisions and a lack of fallback strategies that are vital for safety in rare but critical scenarios. Moreover, the usual decoupling of prediction and planning modules could result in socially inconsistent or unrealistic joint trajectories, especially in highly interactive traffic. To address these challenges, we propose a contingency-aware diffusion planner (CoPlanner), a unified framework that jointly models multi-agent interactive trajectory generation and contingency-aware motion planning. Specifically, the pivot-conditioned diffusion mechanism anchors trajectory sampling on a validated, shared short-term segment to preserve temporal consistency, while stochastically generating diverse long-horizon branches that capture multimodal motion evolutions. In parallel, we design a contingency-aware multi-scenario scoring strategy that evaluates candidate ego trajectories across multiple plausible long-horizon evolution scenarios, balancing safety, progress, and comfort. This integrated design preserves feasible fallback options and enhances robustness under uncertainty, leading to more realistic interaction-aware planning. Extensive closed-loop experiments on the nuPlan benchmark demonstrate that CoPlanner consistently surpasses state-of-the-art methods on both Val14 and Test14 datasets, achieving significant improvements in safety and comfort under both reactive and non-reactive settings. The source code is available on GitHub.
|
| |
| 09:00-10:30, Paper TuI1I.76 | Add to My Program |
| ExBody2: Advanced Expressive Humanoid Whole-Body Control |
|
| Ji, Mazeyu | UCSD |
| Peng, Xuanbin | University of California, San Diego |
| Liu, Fangchen | University of California, Berkeley |
| Li, Jialong | UCSD |
| Yang, Ge | Massachusetts Institute of Technology |
| Cheng, Xuxin | University of California, San Diego |
| Wang, Xiaolong | UC San Diego |
Keywords: Humanoid and Bipedal Locomotion, Humanoid Robot Systems, Reinforcement Learning
Abstract: This paper tackles the challenge of enabling real-world humanoid robots to perform expressive and dynamic whole-body motions while maintaining stability. We propose ExBody2, a whole-body tracking framework trained in simulation with Reinforcement Learning and then transferred to the real world. The framework decouples keypoint tracking from velocity control and leverages a privileged teacher policy to distill precise mimic skills into the student policy, enabling robust, high-fidelity reproduction of complex motions such as walking, crouching, and dancing. A significant contribution is the identification of an empirical trade-off between feasibility and diversity in motion datasets, which guides the development of an automatic dataset curation method. This principle facilitates pretraining a versatile model generalizing well across diverse motions and can be fine-tuned for specific tasks to achieve superior tracking accuracy. Extensive experiments show that Exbody2 achieves consistently better performance than strong baselines and provides insights that may inform future work on whole-body humanoid control.
|
| |
| 09:00-10:30, Paper TuI1I.77 | Add to My Program |
| Vision-Based Panoptic Occupancy Prediction in Urban Environments |
|
| Marcuzzi, Rodrigo | University of Bonn |
| Nunes, Lucas | University of Bonn |
| Marks, Elias Ariel | University of Bonn |
| Zhong, Xingguang | University of Bonn |
| Behley, Jens | University of Bonn |
| Stachniss, Cyrill | University of Bonn |
Keywords: Semantic Scene Understanding, Deep Learning Methods
Abstract: Abstract— Understanding the surrounding scene geometrically and semantically is a key requirement for autonomously navigating systems. Vision-based 3D panoptic occupancy prediction aims to provide a 3D representation of the surroundingsincluding semantic meaning and identifying individual objectssuch as traffic participants in the context of urban navigation. The majority of vision-based approaches to occupancy prediction require 3D voxel labels or segmented LiDAR scan as supervision signal. While other vision-based approaches use only a few consecutive images for supervision, these approaches typically do not provide instance-level information, which is crucial for achieving a holistic understanding of the scene. In this paper, we propose a novel method for 3D panoptic occupancy prediction that relies solely on image data for both training and inference. We use bundle adjustment to align all available images in the training set to obtain depth information. We further use a pre-trained open-vocabulary image model to obtain panoptic segmentation of the RGB images and generate occupancy pseudo labels to directly optimize for the 3D panoptic occupancy prediction task. Furthermore, we use a 3D foundation model to obtain depth predictions for individual images to add dynamic objects into the pseudo labels. Without any manual or LiDAR-based annotations, our approach outputs occupancy, semantic class, and instance ID for each 3D voxel in the full voxel grid. We achieve state-of-the-art results on 3D semantic occupancy prediction among label-free methods, and we propose the first method for 3D panoptic occupancy without any LiDAR supervision.
|
| |
| 09:00-10:30, Paper TuI1I.78 | Add to My Program |
| Seeing Motion, Generating Action: Explicit Motion-Aware Policy for Robotic Action Generation |
|
| Li, Yixiong | Sun Yat-Sen University |
| Zhang, Ye | Sun Yat-Sen University |
| Pei, Yun | Sun Yat-Sen University |
| Zhang, Yongjian | Sun Yat-Sen University |
| Zhang, Ruimao | The Chinese University of Hong Kong (Shenzhen) |
| Guo, Yulan | Sun Yat-Sen University |
Keywords: Imitation Learning, Learning from Demonstration
Abstract: Imitation learning (IL) offers a scalable framework for teaching robots complex manipulation skills from human demonstrations. However, conventional end-to-end visuomotor IL models often suffer from poor performance and robustness due to the significant modality mismatch between high-dimensional visual inputs and low-dimensional motor actions. The redundant information in RGB image, such as color of ambient light, leads models to depend on strong yet brittle task irrelevant priors that ultimately degrade performance across diverse visual environments. To address these limitations, we propose Motion-Aware Two-Stream Policy (MTP) -- a novel imitation learning architecture that explicitly incorporates motion priors via optical flow alongside RGB observations. MTP employs a two-stream perception module that separately encodes spatial (RGB) and temporal (optical flow) information. These spatial-temporal features are fused and fed into a conditional flow matching module to generate actions. We evaluate MTP extensively in both simulation and real-world robot tasks. Results show that MTP significantly outperforms state-of-the-art baselines in terms of success rate and robustness to visual perturbations, demonstrating its effectiveness in generalizable robotic manipulation. To benefit the community, our code will be released.
|
| |
| 09:00-10:30, Paper TuI1I.79 | Add to My Program |
| Multi-Modal Affordance Planner with Temporal-Context Action Policy for Long-Horizon Bimanual Robot Manipulation |
|
| Oh, Ji-Heon | Kyung Hee University |
| Jung, Danbi | Kyung Hee University |
| Espinoza, Ismael | Kyung Hee University |
| Choi, Yong-Hyeok | Kyung Hee University |
| Kim, YoungOuk | Korea Electronics Technology Institute |
| Shin, Dongin | Korea Electronics Technology Institute |
| Moon, JongSul | Korea Electronics Technology Institute |
| Kim, Wonha | Kyung Hee University |
| Kim, Tae-Seong | Kyung Hee University |
Keywords: Dual Arm Manipulation, Reinforcement Learning, Dexterous Manipulation
Abstract: Bimanual robot manipulation for long-horizon (LH) tasks is crucial for the practical use of humanoids, but it struggles with robust planning and generalization. Approaches based on Task and Motion Planning (TAMP), transformers, and Large Language Models (LLMs) suffer from critical limitations, including costly human demonstrations, task planner hallucination, and unsatisfactory generalization performance. To address these challenges, this paper introduces the Multi-modal Affordance Planner with Temporal-Context Action Policy (MAP-TCA), a novel hierarchical framework that learns and performs diverse bimanual long-horizon (LH) tasks by generating action plans from MAP. The MAP-TCA consists of a planner based on Bimanual Robot Manipulation Retrieval-Augmented Generation (Bi-RAG)-enhanced Large-Language Model (LLM) and a low-level Temporal Context Action Policy (TCA). With multimodal inputs including vision, language, and affordance for primitive action demonstration, Bi-RAG generates a Primitive Action (PA)-specific embedded space. Then, MAP generates LH plans, LH demonstrations, and reward functions within the PA-specific embedded space, thereby mitigating hallucinations and reducing training cost. The generated plan, demos, and rewards then guide TCA, which learns the LH tasks via behavior cloning (BC) and online fine-tuning. We demonstrate that the proposed MAP-TCA achieves an average success rate of 86.75%, comparable to a baseline model, TCA, which is trained extensively on direct human demonstrations and manually designed rewards. Our work presents a scalable and generalizable solution for complex bimanual LH manipulation, significantly reducing the dependency on human supervision.
|
| |
| 09:00-10:30, Paper TuI1I.80 | Add to My Program |
| ModalPatch: A Plug-And-Play Module for Robust Multi-Modal 3D Object Detection under Modality Drop |
|
| Li, Shuangzhi | University of Alberta |
| Ma, Lei | The University of Tokyo & University of Alberta |
| Li, Xingyu | University of Alberta |
Keywords: Object Detection, Segmentation and Categorization, Sensor Fusion, Computer Vision for Automation
Abstract: Multi-modal 3D object detection is pivotal for autonomous driving, integrating complementary sensors like LiDAR and cameras. However, its real-world reliability is challenged by transient data interruptions and missing, where modalities can momentarily drop due to hardware glitches, adverse weather, or occlusions. This poses a critical risk, especially during a simultaneous modality drop, where the vehicle is momentarily blind. To address this problem, we introduce ModalPatch, the first plug-and-play module designed to enable robust detection under arbitrary modality-drop scenarios. Without requiring architectural changes or retraining, ModalPatch can be seamlessly integrated into diverse detection frameworks. Technically, ModalPatch leverages the temporal nature of sensor data for perceptual continuity, using a history-based module to predict and compensate for transiently unavailable features. To improve the fidelity of the predicted features, we further introduce an uncertainty-guided cross-modality fusion strategy that dynamically estimates the reliability of compensated features, suppressing biased signals while reinforcing informative ones. Extensive experiments show that ModalPatch consistently enhances both robustness and accuracy of state-of-the-art 3D object detectors under diverse modality-drop conditions. Code will be available at https://github.com/Castiel-Lee/MM3Det_MD.
|
| |
| 09:00-10:30, Paper TuI1I.81 | Add to My Program |
| LEAR: Learning Edge-Aware Representations for Event-To-LiDAR Localization |
|
| Chen, Kuangyi | Graz University of Technology |
| Zhang, Jun | Graz University of Technology |
| Hu, Yuxi | Graz University of Technology |
| Zhou, Yi | Hunan University |
| Fraundorfer, Friedrich | Graz University of Technology |
Keywords: Deep Learning for Visual Perception, Localization
Abstract: Event cameras offer high-temporal-resolution sensing that remains reliable under high-speed motion and challenging lighting, making them promising for localization from LiDAR point clouds in GPS-denied and visually degraded environments. However, aligning sparse, asynchronous events with dense LiDAR maps is fundamentally ill-posed, as direct correspondence estimation suffers from modality gaps. We propose LEAR, a dual-task learning framework that jointly estimates edge structures and dense event–depth flow fields to bridge the sensing-modality divide. Instead of treating edges as a post-hoc aid, LEAR couples them with flow estimation through a cross-modal fusion mechanism that injects modality-invariant geometric cues into the motion representation, and an iterative refinement strategy that enforces mutual consistency between the two tasks over multiple update steps. This synergy produces edge-aware, depth-aligned flow fields that enable more robust and accurate pose recovery via Perspective-n-Point (PnP) solvers. On several popular and challenging datasets, LEAR achieves superior performance over the best prior method. The source code, trained models, and demo videos are made publicly available online.
|
| |
| 09:00-10:30, Paper TuI1I.82 | Add to My Program |
| UrbanHuRo: A Two-Layer Human-Robot Collaboration Framework for the Joint Optimization of Heterogeneous Urban Services |
|
| Dey, Tonmoy | Florida State University |
| Jiang, Lin | Florida State University |
| Dong, Zheng | Wayne State University |
| Wang, Guang | Florida State University |
Keywords: Automation Technologies for Smart Cities
Abstract: In the vision of smart cities, technologies are being developed to enhance the efficiency of urban services and improve residents' quality of life. However, most existing research focuses on optimizing individual services in isolation, without adequately considering the reciprocal interactions among heterogeneous urban services that could yield higher efficiency and improved resource utilization. For example, human couriers could collect traffic and air quality data along their delivery routes, while sensing robots could assist with on-demand delivery during peak hours, enhancing both sensing coverage and delivery efficiency. However, the joint optimization of different urban services is challenging due to their potentially conflicting objectives and real-time coordination in dynamic environments. In this paper, we propose UrbanHuRo, a two-layer human-robot collaboration framework for joint optimization of heterogeneous urban services, demonstrated through the examples of crowdsourced delivery and urban sensing. There are two innovative designs in UrbanHuRo, i.e., (i) a scalable distributed MapReduce-based K-Submodular maximization module for efficient order dispatch and (ii) a deep submodular reward reinforcement learning algorithm for sensing route planning. Experimental evaluations on real-world datasets from a food delivery platform demonstrate that our UrbanHuRo improves sensing coverage by 29.7% and courier income by 39.2% on average in most settings, while also significantly reducing the number of overdue orders.
|
| |
| 09:00-10:30, Paper TuI1I.83 | Add to My Program |
| A Multimodal Stochastic Planning Approach for Navigation and Multi-Robot Coordination |
|
| Gonzales, Mark | Johns Hopkins University |
| Oh, Ethan | Johns Hopkins University |
| Moore, Joseph | Johns Hopkins University |
Keywords: Multi-Robot Systems, Swarm Robotics
Abstract: In this paper, we present a receding-horizon, sampling-based planner capable of reasoning over multimodal policy distributions. By using the cross-entropy method to optimize a multimodal policy under a common cost function, our approach increases robustness against local minima and promotes effective exploration of the solution space. We show that our approach naturally extends to multi-robot collision-free planning, enables agents to share diverse candidate policies to avoid deadlocks, and allows teams to minimize a global objective without incurring the computational complexity of centralized optimization. Numerical simulations demonstrate that employing multiple modes significantly improves success rates in trap environments and in multi-robot collision avoidance. Hardware experiments further validate the approach's real-time feasibility and practical performance.
|
| |
| 09:00-10:30, Paper TuI1I.84 | Add to My Program |
| EveryDayVLA: A Vision-Language-Action Model for Affordable Robotic Manipulation |
|
| Chopra, Samarth | University of Maryland, College Park |
| McMoil, Alexander | University of Pittsburgh |
| Carnovale, Benjamin | University of Pittsburgh |
| Sokolson, Evan | University of Pittsburgh |
| Kubendran, Rajkumar | University of Pittsburgh |
| Dickerson, Samuel | University of Pittsburgh |
Keywords: AI-Enabled Robotics, Deep Learning Methods, Learning from Demonstration
Abstract: While Vision–Language–Action (VLA) models map visual inputs and language instructions directly to robot actions, they often rely on costly hardware and struggle in novel or cluttered scenes. We introduce EverydayVLA, a 6-DOF manipulator that can be assembled for 300, capable of modest payloads and workspaces. A single unified model jointly outputs discrete and continuous actions, and our adaptive-horizon ensembler monitors motion uncertainty to trigger on-the-fly replanning for safe, reliable operation. On LIBERO, EverydayVLA matches state-of-the-art success rates, and in real-world tests it outperforms prior methods by 49% in-distribution and 34.9% out-of-distribution. By combining a state-of-the-art VLA with cost-effective hardware, EverydayVLA democratizes access to a robotic foundation model, and paves the way for economical use in homes and research labs alike.
|
| |
| 09:00-10:30, Paper TuI1I.85 | Add to My Program |
| Estimating Force Interactions of Deformable Linear Objects from Their Shapes |
|
| Chen, Qi Jing | Nanyang Technological University |
| Shan, Shilin | Nanyang Technological University |
| Bretl, Timothy | University of Illinois at Urbana-Champaign |
| Pham, Quang-Cuong | NTU Singapore |
Keywords: Tendon/Wire Mechanism, Flexible Robotics, Force and Tactile Sensing
Abstract: This work introduces an analytical approach for detecting and estimating external forces acting on deformable linear objects (DLOs) using only their observed shapes. In many robot-wire interaction tasks, contact occurs not at the end-effector but at other points along the robot’s body. Such scenarios arise when robots manipulate wires indirectly (e.g., by nudging) or when wires act as passive obstacles in the environment. Accurately identifying these interactions is crucial for safe and efficient trajectory planning, helping to prevent wire damage, avoid restricted robot motions, and mitigate potential hazards. Existing approaches often rely on expensive external force-torque sensor or that contacts occur at the end-effector for accurate force estimation. Using wire shape information acquired from a depth camera and under the assumption that the wire is in or near its static equilibrium, our method estimates both the location and magnitude of external forces without additional prior knowledge. This is achieved by exploiting derived consistency conditions and solving a system of linear equations based on force-torque balance along the wire. The approach was validated through simulation, where it achieved high accuracy, and through real-world experiments, where accurate estimation was demonstrated in selected interaction scenarios.
|
| |
| 09:00-10:30, Paper TuI1I.86 | Add to My Program |
| AdapGrasp: A Stiffness and Grasp Affordance Dataset with a Transformer-Based Adaptive Grasp Model |
|
| Pu, Menghao | Huazhong University of Science and Technology |
| Han, Chaoqun | Huazhong University of Science and Technology |
| Chai, Zhiping | Huazhong University of Science and Technology |
| Zhao, Tiyong | Huazhong University of Science and Technology |
| Wu, Dunxuan | Huazhong University of Science and Technology |
| Wen, Pu | Huazhong University of Science and Technology |
| Ke, Xingxing | Fuzhou University |
| Ding, Han | Huazhong University of Science and Technology |
| Wu, Zhigang | Huazhong University of Science and Technology |
Keywords: Data Sets for Robot Learning, Deep Learning in Grasping and Manipulation, Deep Learning for Visual Perception
Abstract: Robotic grasp has been employed in various industrial, household, and medical applications. However, neglecting the final grasping state and objects' stiffness and affordance, prevailing strategies predominantly emphasize the grippers’ initial state upon reaching grasp positions and often fail due to damage or grasp slippage. Here, we propose an AdapGrasp strategy with a dataset named AdapGraspDataset and a corresponding model named AdapGraspNet. The dataset focuses on the object stiffness and grasp affordance. Specifically, for objects with different stiffness properties, the corresponding final grasp width (FGW) is annotated to ensure the object's intactness. For objects’ grasp affordance properties, higher grasp affordance weight (GAW) is typically annotated closer to the centroid, increasing grasp stability. Meanwhile, to output the set of grasping configurations (initial and final grasp states) more accurately, a denoising principle is introduced to build a corresponding transformer-based model. It enables more accurate convergence of FGW and GAW, achieving a precision of 98.04% and a mean absolute final width error of 2.71 pixels. Finally, extensive real-world experiments are conducted, where the AdapGrasp strategy ensures the intactness of fragile objects and thus enhances grasping stability without any additional sensors. It achieves a grasping accuracy of 95% and yields a 19.5% improvement compared with those without FGW and GAW. The AdapGrasp strategy is publicly available at https://embodied-soft-intelligence.github.io/AdapGrasp/.
|
| |
| 09:00-10:30, Paper TuI1I.87 | Add to My Program |
| Shell-Type Soft Jig for Holding Objects During Disassembly |
|
| Kiyokawa, Takuya | The University of Osaka |
| Takebayashi, Ryunosuke | The University of Osaka |
| Harada, Kensuke | The University of Osaka |
Keywords: Soft Robot Applications, Intelligent and Flexible Manufacturing, Disassembly
Abstract: This study addresses a flexible holding tool for robotic disassembly. We propose a shell-type soft jig that securely and universally holds objects, mitigating the risk of component damage and adapting to diverse shapes while enabling soft fixation that is robust to recognition, planning, and control errors. The balloon-based holding mechanism ensures proper alignment and stable holding performance, thereby reducing the need for dedicated jig design, highly accurate perception, precise grasping, and finely tuned trajectory planning that are typically required with conventional fixtures. Our experimental results demonstrate the practical feasibility of the proposed jig through performance comparisons with a vise and a jamming-gripper-inspired soft jig. Tests on ten different objects further showed representative successes and failures, clarifying the jig's limitations and outlook.
|
| |
| 09:00-10:30, Paper TuI1I.88 | Add to My Program |
| A Teleoperated Control for Robot-Aided Percutaneous Surgery: An Application to Needle Insertion in Nephrolithotomy |
|
| Lauretti, Clemente | Università Campus Bio-Medico di Roma |
| Morfino, Rosaura | Campus Bio-medico University of Rome |
| Cocco, Francesco | Campus Bio-Medico University of Rome |
| Prata, Francesco | Department of Urology, Fondazione Policlinico Universitario Campus Bio-Medico |
| Papalia, Rocco | Department of Urology, Fondazione Policlinico Universitario Campus Bio-Medico |
| Zollo, Loredana | Università Campus Bio-Medico di Roma |
|
|
| |
| 09:00-10:30, Paper TuI1I.89 | Add to My Program |
| LSADS-Gaussian: Gaussian Splatting for Large-Scale Autonomous Driving Scene Reconstruction |
|
| Wang, Ping | Tongji University |
| Li, Ben | Tongji University |
| Qian, Bo | The University of Tokyo |
| Jin, Chuan | Tongji University |
| Tian, Can | Geely Auto Research Institute (Ningbo) Co., LTD |
| Ji, Yusheng | National Institute of Informatics |
Keywords: Computer Vision for Automation, Computer Vision for Transportation, Visual Learning
Abstract: The rapid advancement of 3D scene understanding techniques presents a significant opportunity for enhancing autonomous driving simulation systems. As these systems are increasingly required to operate in complex, large-scale, and unbounded real-world environments, efficient and high-fidelity 3D reconstruction of common outdoor scenes has become a critical prerequisite for realistic and extensible autonomous driving simulation. 3D Gaussian Splatting has achieved state-of-the-art performance in novel view synthesis, coupled with real-time rendering efficiency. However, large-scale reconstruction for autonomous driving scenarios faces several challenges as scenes grow in complexity: (1) limited views with insufficient pose diversity, (2) inadequate representation of geometric structural details, and (3) complex lighting conditions involving saturation and shadow variations. To cope with these challenges, we propose LSADS-Gaussian, a novel model for large-scale autonomous driving scene reconstruction. The model consists of a Multimodal Gaussian Network (MGN) module composed of two Gaussian sub-networks, designed to perform Gaussian aggregation and optimization from multi-sensor data, a Geometric Representation Guidance (GRG) module refines and enhances geometric consistency, and a Lighting Enhancement (LE) module introduces learnable illumination coefficients to maintain illumination consistency. Extensive experiments show that LSADS-Gaussian outperforms the state-of-the-art methods.
|
| |
| 09:00-10:30, Paper TuI1I.90 | Add to My Program |
| Control of Humanoid Robots with Parallel Mechanisms Using Differential Actuation Models |
|
| Lutz, Victor | LAAS-CNRS |
| De Matteïs, Ludovic | LAAS-CNRS |
| Batto, Virgile | LAAS-CNRS |
| Mansard, Nicolas | CNRS |
Keywords: Humanoid and Bipedal Locomotion, Parallel Robots, Reinforcement Learning
Abstract: Several recently released humanoid robots, in- spired by the mechanical design of Cassie, employ actuator configurations in which the motors are displaced from the joints to reduce leg inertia. While studies accounting for the full kinematic complexity have demonstrated the benefits of these designs, the associated loop-closure constraints greatly increase computational cost and limit their use in control and learning. As a result, the non-linear transmission is often approximated by a constant reduction ratio, preventing exploitation of the mechanism’s full capabilities. This paper introduces a compact analytical formulation for the two standard knee and ankle mechanisms that captures the exact non-linear transmission while remaining computationally efficient. The model is fully differentiable up to second order with a minimal formulation, enabling low-cost evaluation of dynamic derivatives for trajectory optimization and of the apparent transmission impedance for reinforcement learning. We integrate this formulation into trajectory optimization and locomotion policy learning, and compare it against simplified constant-ratio approaches. Hardware experiments demonstrate improved accuracy and robustness, showing that the proposed method provides a practical means to incorporate parallel actuation into modern control algorithms.
|
| |
| 09:00-10:30, Paper TuI1I.91 | Add to My Program |
| Augmenting the Reach: Visualizing Robotic Working Volume at the Tool Tip for Intuitive Retinal Access in Eye Surgery |
|
| Yang, Junjie | TUM |
| Inagaki, Satoshi | NSK.Ltd |
| Zhao, Zhihao | Technische Universität München |
| Zapp, Daniel | Klinikum Rechts Der Isar Der TU München |
| Huang, Kai | Sun Yat-Sen University |
| Nasseri, M. Ali | Technische Universitaet Muenchen |
Keywords: Medical Robots and Systems, Vision-Based Navigation, Surgical Robotics: Planning
Abstract: Retinal Surgery Robotics is a rapidly emerging field that offers enhanced precision by overcoming human tremors. A key trend of these robotic designs is toward more compact and lightweight structures for improved positioning accuracy and precise force delivery. However, this compactness sacrifices the robot's working volume, making it difficult for ophthalmic surgeons to intuitively assess if retinal targets are accessible by the surgical tool tip. This paper proposes a methodology for visualizing the actual accessible area in the microscopic view to provide surgeons with an intuitive visual guide of the tool's reach, reducing uncertainty and streamlining extraocular robotic maneuvers. We validated this method on a commercial phantom with a surgical robot system, achieving <=1.0 deg error for 83.3% of tested points across four retinal subareas and demonstrating its clinical potential.
|
| |
| 09:00-10:30, Paper TuI1I.92 | Add to My Program |
| CDV-SLAM: Compact Deep Visual SLAM with Unified Semantic and Geometric Perception |
|
| Fan, Ya | Beihang University |
| Lang, Rongling | Beihang University |
Keywords: Deep Learning Methods, SLAM, Localization
Abstract: Robust monocular visual Simultaneous Localization and Mapping (SLAM) serves as a cornerstone for various applications. However, its performance frequently suffers degradation in challenging scenarios including fast motion, dynamic objects, and scale ambiguity. This paper proposes CDV-SLAM, a compact deep visual SLAM framework that unifies geometric and semantic perception through a shared visual foundation model. A tight semantic-geometric fusion network is devised to predict optical flow in fast motion. Semantic features are efficiently reused to obtain segmentation and monocular depth for dynamic objects exclusion and scale acquisition. To further address scale drift, we introduce local scale correction in bundle adjustment. Experimental results demonstrate a 42% decrease in average Absolute Trajectory Error (ATE) on the KITTI dataset over the state-of-the-art. Furthermore, our flow-only visual odometry surpasses geometric-only methods on the TartanAir and EuRoC datasets, with a marginal speed reduction of 6%. Our code is publicly available at https://github.com/FrankYard/CDV-SLAM.
|
| |
| 09:00-10:30, Paper TuI1I.93 | Add to My Program |
| Diffusion-Guided Generalizable Enhancer for Urban Scene Reconstruction |
|
| Che, Henry | University of Illinois, Urbana-Champaign |
| Wang, Jingkang | Waabi, University of Toronto |
| Chen, Yun | University of Toronto |
| Yang, Ze | Waabi, University of Toronto |
| Manivasagam, Siva | Waabi, University of Toronto |
| Urtasun, Raquel | Waabi, University of Toronto |
Keywords: Computer Vision for Automation, Autonomous Vehicle Navigation, Simulation and Animation
Abstract: Urban scene reconstruction from real-world observations has emerged as a powerful tool for self-driving development and testing. While current neural rendering approaches achieve high-fidelity rendering along the recorded trajectories, their quality degrades significantly under large viewpoint shifts, limiting the applicability for closed-loop simulation. Recent works have shown promising results in using diffusion models to enhance quality at these challenging viewpoints and distill improvements back into 3D representations. However, they often require costly per-scene optimization, and the distilled representations remain fragile and fail to generalize beyond limited synthesized views. To address these limitations, we propose GenRe, a novel diffusion-guided generalizable enhancer for urban scene reconstruction. GenRe takes as input any pretrained 3D Gaussian representation and fixes the deficiencies within a few minutes. By learning to distill generative priors across diverse scenes, GenRe produces robust and high-fidelity representation efficiently that generalizes reliably to challenging unseen viewpoints (e.g., lane change). Experiments show that GenRe outperforms existing methods in both quality and efficiency and benefits various downstream tasks, enabling robust and scalable sensor simulation for autonomous driving.
|
| |
| 09:00-10:30, Paper TuI1I.94 | Add to My Program |
| Action Sequence Transfer Via LLMs for Heterogeneous Environments |
|
| Chung, Choong Ho | Korea Advanced Institute of Science and Technology |
| Shin, DongHwan | Korea Advanced Institute of Science and Technology |
| Lee, Sung-Hee | Korea Advanced Institute of Science and Technology (KAIST) |
Keywords: AI-Based Methods, Agent-Based Systems, Autonomous Agents
Abstract: We present an action sequence transfer system that adaptively transfers user action sequences across different target spaces. Given an input action sequence from a source space and scene graph representations of both the source and target environments, our system predicts a corresponding action sequence in the target space by adapting to the spatial and object constraints of the new environment. To achieve this, we leverage multi-level representations of user activity to generalize actions at varying levels of abstraction. To demonstrate our system, we collect a new scene graph-based dataset derived from the Ego4D GoalStep dataset for valuation. Results indicate that our system can generate valid action sequences even between spaces with drastically different object configurations.
|
| |
| 09:00-10:30, Paper TuI1I.95 | Add to My Program |
| ClearDepth: Efficient Stereo Perception of Transparent Objects for Robotic Manipulation |
|
| Bai, Kaixin | University of Hamburg |
| Zeng, Huajian | MBZUAI |
| Zhang, Lei | University of Hamburg |
| Liu, Yiwen | Technical University of Munich, Agile Robots |
| Xu, Hongli | TU Munich |
| Chen, Zhaopeng | University of Hamburg |
| Zhang, Jianwei | University of Hamburg |
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, Data Sets for Robotic Vision
Abstract: Transparent object depth perception remains a major challenge in robotics and logistics due to the limitations of standard 3D sensors in capturing accurate depth on transparent and reflective surfaces. This affects applications relying on depth maps and point clouds, particularly in robotic manipulation. To address this, we propose ClearDepth, a vision transformer-based algorithm for stereo depth recovery of transparent objects, enhanced by a novel feature post-fusion module that refines depth estimation using structural visual features. To mitigate the high costs of stereo dataset collection, we introduce a physically realistic, domain-adaptive Sim2Real framework for efficient data generation. Our method outperforms state-of-the-art stereo matching approaches on transparent depth recovery. Furthermore, in transparent object grasping experiments, ClearDepth improves transparent-scene perception and achieves at least an 18% higher grasp success rate compared to the state-of-the-art methods for transparent object manipulation. Our method demonstrates strong Sim2Real generalization, enabling precise depth perception of transparent objects for robotic applications in the real world. Dataset and project details are available at https://sites.google.com/view/cleardepth-anonymous.
|
| |
| 09:00-10:30, Paper TuI1I.96 | Add to My Program |
| TransLocNet: Cross-Modal Attention for Aerial-Ground Vehicle Localization with Contrastive Learning |
|
| Pham, Phu | Purdue University |
| Conover, Damon | DEVCOM Army Research Laboratory |
| Bera, Aniket | Purdue University |
Keywords: Aerial Systems: Perception and Autonomy, Localization, Deep Learning for Visual Perception
Abstract: Aerial–ground localization is difficult due to large viewpoint and modality gaps between ground-level LiDAR and overhead imagery. We propose TransLocNet, a cross-modal attention framework that fuses LiDAR geometry with aerial semantic context. LiDAR scans are projected into a bird’s-eye-view representation and aligned with aerial features through bidirectional attention, followed by a likelihood map decoder that outputs spatial probability distributions over position and orientation. A contrastive learning module enforces a shared embedding space to improve cross-modal alignment. Experiments on CARLA and KITTI show that TransLocNet outperforms state-of-the-art baselines, reducing localization error by up to 63% and achieving sub-meter, sub-degree accuracy. These results demonstrate that TransLocNet provides robust and generalizable aerial–ground localization in both synthetic and real-world settings.
|
| |
| 09:00-10:30, Paper TuI1I.97 | Add to My Program |
| RoboHitch: Learning Visual Affordance from Disordered Keypoints for Hitch Knots Tying |
|
| Zuo, Jiahui | The Hong Kong University of Science and Technology |
| Zhang, Boyang | The Hong Kong University of Science and Technology |
| Zhang, Fumin | Hong Kong University of Science and Technology |
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation, Manipulation Planning
Abstract: Robotic manipulation of deformable linear objects (DLOs) presents significant challenges due to complex dynamics and frequent self-occlusions. Existing robotic knot tying methods typically rely on precise topological state tracking with ordered keypoints and explicit edge connectivity. This reliance makes them prone to failures due to tracking drift and topology mismatch caused by repeated bending and crossings during knot formation. To address these limitations, we introduce RoboHitch, a novel framework that learns to perform hitch knot tying from human demonstrations using only disordered 3D keypoints and RGB images. This eliminates the need for explicit topological order, allowing for more flexible manipulation. Our method employs a dynamic Graph Autoencoder to extract geometric features from untracked keypoints, complemented by a Convolutional Autoencoder that captures essential visual context. A bidirectional cross-attention mechanism then fuses these modalities to jointly predict pick and place affordances, facilitating implicit reasoning about the rope's state and enabling knot tying under occlusion. Real-world experiments demonstrate the effectiveness and generalizability of our approach, successfully completing hitch knots in scenarios with self-occlusions.
|
| |
| 09:00-10:30, Paper TuI1I.98 | Add to My Program |
| A Passivity-Based Framework for Dynamic Arbitration between Trajectory and Force Tracking Using Human Demonstration |
|
| Yun, Yeoil | Sungkyunkwan University |
| Kim, Youngwuk | Sungkyunkwan University |
| Gwak, Junchul | Sungkyunkwan University |
| Moon, Hyungpil | Sungkyunkwan University |
| Choi, Hyouk Ryeol | Sungkyunkwan University |
| Koo, Ja Choon | Sungkyunkwan University |
Keywords: Physical Human-Robot Interaction, Learning from Demonstration, Intention Recognition
Abstract: Learning from Demonstration (LfD) for contact-rich tasks faces a fundamental challenge: arbitrating between tracking a demonstrated trajectory and reproducing an interaction force. This paper introduces a novel one-shot LfD framework that resolves this conflict by leveraging the operator's grip force as an intuitive, continuous signal for arbitration. This signal allows the controller to seamlessly transition between a trajectory-tracking impedance controller and a force-tracking admittance controller, prioritizing path accuracy when the demonstrated grip was light and interaction force fidelity when it was firm. To ensure verifiably safe interaction, the adaptive control law is integrated within a dual-layer passivity assurance framework. This mechanism intelligently distributes potentially non-passive energy between an energy tank and adaptive null-space dissipation to guarantee energetic stability. The proposed framework was experimentally validated on a 7-DOF manipulator, demonstrating that the controller autonomously reproduces interaction forces and shows significant robustness against environmental position uncertainties, a scenario where conventional impedance controllers can fail.
|
| |
| 09:00-10:30, Paper TuI1I.99 | Add to My Program |
| Self-Supervised Point Cloud Single Object Tracking |
|
| Liu, Yuheng | Nankai University |
| Hui, Le | Northwestern Polytechnical University |
| Zhu, Ziyue | Nankai University |
| Mei, Shaohui | Northwestern Polytechnical University |
| Zhang, Yigong | Nankai University |
| Xie, Jin | Nanjing University |
| Yang, Jian | Nankai University |
Keywords: Visual Tracking, Deep Learning Methods
Abstract: Point cloud single object tracking is critical in autonomous driving. However, current methods heavily rely on frame-by-frame human annotations, which do not scale well with the growing amount of unlabeled LiDAR data. In this paper, we propose the first self-supervised point cloud single object tracking framework, eliminating the need for any manual labels. Our method integrates motion, geometry, and semantic cues to generate plausible object proposals and tracks the target using a predictive filter. Specifically, we generate pseudo labels by clustering local motion patterns from scene flow, while pre-training a proposal network using point cloud forecasting as a proxy task to learn global motion patterns and geometric shape priors. Then, we train the proposal network using the initial pseudo labels and iteratively refine them by treating semantic features as evolving prototypes in each training round. Finally, a simple motion filter is employed to predict the target’s current state based on its past dynamics. Evaluated on KITTI, nuScenes, and Waymo, our self-supervised point cloud single object tracking approach is on par with—and in some cases outperforms—fully supervised trackers, demonstrating that self-supervision is a scalable path forward for 3D single object tracking.
|
| |
| 09:00-10:30, Paper TuI1I.100 | Add to My Program |
| DreamNav: A Trajectory-Based Imaginative Framework for Zero-Shot Vision-And-Language Navigation |
|
| Wang, Yunheng | The Hong Kong University of Science and Technology (Guangzhou) |
| Fang, Yuetong | The Hong Kong University of Science and Technology (Guangzhou) |
| Wang, Taowen | The Hongkong University of Science and Technology (gz) |
| Feng, Yixiao | The Hong Kong University of Science and Technology (Guangzhou) |
| Tan, Yawen | Zhejiang Normal University |
| Zhang, Shuning | The Hong Kong University of Science and Technology (Guangzhou) |
| Liu, Peiran | Hong Kong University of Science and Technology (GuangZhou) |
| Ji, Yiding | Hong Kong University of Science and Technology (Guangzhou) |
| Xu, Renjing | The Hong Kong University of Science and Technology (Guangzhou) |
Keywords: Vision-Based Navigation, AI-Enabled Robotics, Task and Motion Planning
Abstract: Vision-and-Language Navigation in Continuous Environments (VLN-CE), which links language instructions to perception and control in the real world, is a core capability of embodied robots. Recently, large-scale pretrained foundation models have been leveraged as shared priors for perception, reasoning, and action, enabling zero-shot VLN without task-specific training. However, existing zero-shot VLN methods depend on costly perception and passive scene understanding, collapsing control to point-level choices. As a result, they are expensive to deploy, misaligned in action semantics, and short-sighted in planning. To address these issues, we present DreamNav that focuses on the following three aspects: (1) for reducing sensory cost, our EgoView Corrector aligns viewpoints and stabilizes egocentric perception; (2) instead of point-level actions, our Trajectory Predictor favors global trajectory-level planning to better align with instruction semantics; and (3) to enable anticipatory and long-horizon planning, we propose an Imagination Predictor to endow the agent with proactive thinking capability. On VLN-CE and real-world tests, DreamNav sets a new zero-shot state-of-the-art (SOTA), outperforming the strongest egocentric baseline with extra information by up to 7.49% and 18.15% in terms of SR and SPL metrics. To our knowledge, this is the first zero-shot VLN method to unify trajectory-level planning and active imagination while using only egocentric inputs.
|
| |
| 09:00-10:30, Paper TuI1I.101 | Add to My Program |
| Curriculum Reinforcement Learning for Quadrotor Racing with Random Obstacles |
|
| Sun, Fangyu | Shanghai Jiaotong University |
| Li, Fanxing | Shanghai Jiao Tong University |
| Hu, Yu | Shanghai Jiao Tong University |
| Zhang, Linzuo | Shanghai Jiao Tong University |
| Liu, Yueqian | Hypershell Tech |
| Yu, Wenxian | Shanghai Jiao Tong University |
| Zou, Danping | Shanghai Jiao Ton University |
Keywords: Aerial Systems: Applications, Reinforcement Learning, Vision-Based Navigation
Abstract: Autonomous drone racing has attracted increasing interest as a research topic for exploring the limits of agile flight. However, existing studies primarily focus on obstacle free racetracks, while the perception and dynamic challenges introduced by obstacles remain underexplored, often resulting in low success rates and limited robustness in realworld flight. To this end, we propose a novel vision-based curriculum reinforcement learning framework for training a robust controller capable of addressing unseen obstacles in drone racing. We combine multi-stage cu rriculum learning, domain randomization, and a multi-scene updating strategy to address the conflicting challenges of obstacle avoidance and gate traversal. Our end-to-end control policy is implemented as a single network, allowing high-speed flight of quadrotors in environments with variable obstacles. Both hardware-in-the-loop and real-world experiments demonstrate that our method achieves faster lap times and higher success rates than existing approaches, effectively advancing drone racing in obstacle-rich environments. The video and code are available at: https://github.com/SJTU-ViSYS-team/CRL-Drone-Racing.
|
| |
| 09:00-10:30, Paper TuI1I.102 | Add to My Program |
| A Real-Time Multi-Model Parametric Representation of Point Clouds |
|
| Gao, Yuan | Shanghai Jiao Tong University |
| Dong, Wei | Shanghai Jiao Tong University |
Keywords: Mapping, Object Detection, Segmentation and Categorization, Computational Geometry
Abstract: In recent years, parametric representations of point clouds have been widely applied in tasks such as memory-efficient mapping and multi-robot collaboration. Highly adaptive models, like spline surfaces or quadrics, are computationally expensive in detection or fitting. In contrast, real-time methods, such as Gaussian mixture models or planes, have low degrees of freedom, making high accuracy with few primitives difficult. To tackle this problem, a multi-model parametric representation with real-time surface detection and fitting is proposed. Specifically, the Gaussian mixture model is first employed to segment the point cloud into multiple clusters. Then, flat clusters are selected and merged into planes or curved surfaces. Planes can be easily fitted and delimited by a 2D voxel-based boundary description method. Surfaces with curvature are fitted by B-spline surfaces and the same boundary description method is employed. Through evaluations on multiple public datasets, the proposed surface detection exhibits greater robustness than the state-of-the-art approach, with 3.78 times improvement in efficiency. Meanwhile, this representation achieves a 2-fold gain in accuracy over Gaussian mixture models, operating at 36.4 fps on a low-power onboard computer.
|
| |
| 09:00-10:30, Paper TuI1I.103 | Add to My Program |
| Autoregressive End-To-End Planning with Time-Invariant Spatial Alignment and Multi-Objective Policy Refinement |
|
| Zhao, Jianbo | University of Science and Technology of China |
| Ban, Taiyu | University of Science and Technology of China |
| Li, Xiangjie | Chongqing Afari Intelligent Drive Co., Ltd |
| Gui, Xingtai | Chongqing Afari Intelligent Drive Co., Ltd |
| Zhou, Hangning | Chongqing Afari Intelligent Drive Co., Ltd |
| Liu, Lei | University of Science and Technology of China |
| Hongwei, Zhao | University of Science and Technology of China |
| Li, Bin | University of Science and Technology of China |
Keywords: Intelligent Transportation Systems, Deep Learning Methods
Abstract: The inherent sequential modeling capabilities of autoregressive models make them a formidable baseline for end-to-end planning in autonomous driving. Nevertheless, their performance is constrained by a spatio-temporal misalignment, as the planner must condition future actions on past sensory data. This creates an inconsistent worldview, limiting the upper bound of performance for an otherwise powerful approach. To address this, we propose a Time-Invariant Spatial Alignment (TISA) module that learns to project initial environmental features into a consistent ego-centric frame for each future time step, effectively correcting the agent's worldview without explicit future scene prediction. In addition, we employ a kinematic action prediction head (i.e., acceleration and yaw rate) to ensure physically feasible trajectories. Finally, we introduce a multi-objective post-training stage using Direct Preference Optimization (DPO) to move beyond pure imitation. Our approach provides targeted feedback on specific driving behaviors, offering a more fine-grained learning signal than the single, overall objective used in standard DPO. Our model achieves a state-of-the-art 89.8 PDMS on the NAVSIM dataset among autoregressive models.
|
| |
| 09:00-10:30, Paper TuI1I.104 | Add to My Program |
| Scaling Single Human Demonstrations for Imitation Learning Using Generative Foundational Models |
|
| Heppert, Nick | University of Freiburg |
| Nguyen, Minh Quang | University of Freiburg |
| Valada, Abhinav | University of Freiburg |
Keywords: Learning from Demonstration, Deep Learning in Grasping and Manipulation, Simulation and Animation
Abstract: Imitation learning is a popular paradigm to teach robots new tasks, but collecting robot demonstrations through teleoperation or kinesthetic teaching is tedious and time-consuming. In contrast, directly demonstrating a task using our human embodiment is much easier and data is available in abundance, yet transfer to the robot can be non-trivial. In this work, we propose Real2Gen to train a manipulation policy from a single human demonstration. Real2Gen extracts required information from the demonstration and transfers it to a simulation environment, where a programmable expert agent can demonstrate the task arbitrarily many times, generating an unlimited amount of data to train a flow matching policy. We evaluate Real2Gen on human demonstrations from three different real-world tasks and compare it to a recent baseline. Real2Gen shows an average increase in the success rate of 26.6% and better generalization of the trained policy due to the abundance and diversity of training data. We further deploy our purely simulation-trained policy zero-shot in the real world. We make the data, code, and trained models publicly available at real2gen.cs.uni-freiburg.de.
|
| |
| 09:00-10:30, Paper TuI1I.105 | Add to My Program |
| DepthMesh: A Dual-End Complementary Online Depth Estimation and Mesh Reconstruction |
|
| Yang, Jiaqi | Information Engineering University |
| Fan, Dazhao | Information Engineering University |
| Yang, Xingbin | Vivo Central Research Institute |
| Yang, Jiabin | ByteDance Inc |
| Ji, Song | Information Engineering University |
| Dong, Yang | Information Engineering University |
| Li, Ming | Information Engineering University |
| Wang, Aosheng | Information Engineering University |
Keywords: Computational Geometry, Computer Vision for Automation, Collision Avoidance
Abstract: We present a novel dual-end complementary method for online depth estimation and mesh reconstruction, termed DepthMesh. Unlike most existing state-of-the-art methods that produce either only depth online or surface mesh offline, our method tightly couples online multiview depth estimation and Truncated Signed Distance Function (TSDF) reconstruction to achieve fast online mesh reconstruction. For each keyframe from 6DoF tracking, we first obtain the prior depth and normal maps via ultra-fast raycasting from TSDF, which is incrementally fused from historical keyframe depths. Then, these priors, combined with segmentation results, are used to generate local planar hypotheses that optimize both depth accuracy and computational efficiency. Finally, the optimized depth estimates further enhance the accuracy of mesh reconstruction. Through this dual-end complementary mechanism, our system achieves high accuracy and efficiency. Experiments with qualitative and quantitative evaluations on the ScanNetV2 and self-collected datasets demonstrate the effectiveness of our method. Our method can generate depth and mesh online with accuracy (< 3 cm) on mobile devices, which is useful for robotic autonomous navigation and mixed reality applications such as real-time occlusion and collision handling.
|
| |
| 09:00-10:30, Paper TuI1I.106 | Add to My Program |
| Fast Exploration Planning with Learning-Based Motion Time Prediction for Aerial Robots |
|
| Wang, Ziyu | Nankai University |
| Dong, Qianli | Nankai University |
| Zhang, Xuebo | Nankai University, |
| Zhang, Shiyong | Nankai University |
| Xi, Haobo | Nankai University |
| Ma, Zhe | Nankai University |
| Yuan, Mingxing | Nankai University |
Keywords: Motion and Path Planning, Robotics in Hazardous Fields
Abstract: Unmanned aerial vehicles (UAVs) have been widely employed to achieve autonomous exploration of 3D unknown environments. However, most existing algorithms suffer from low exploration efficiency caused by inaccurate motion time cost evaluation, which typically leads to the motion inconsistency during the UAV flight. In this work, we propose a learning-based motion time prediction method for real-time evaluating the accurate motion time costs to candidate viewpoints. Specifically, the prediction method takes the current state of the UAV and its surrounding environment features as input to predict the arrival time to each viewpoint. Based on the motion time cost prediction, the UAV can minimize the time wasted by unnecessary acceleration and deceleration during exploration. To further improve the efficiency, we also develop an optimal exploration target decision algorithm that benefits from the predicted motion time costs and the adaptive upper-bound constraints. Simulation and real-world experiments demonstrate that our method can significantly improve the exploration efficiency and increase the average flight speed of the UAV.
|
| |
| 09:00-10:30, Paper TuI1I.107 | Add to My Program |
| ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting |
|
| Yan, Xiaoyang | Hong Kong University of Science and Technology |
| Pei, Muleilan | Hong Kong University of Science and Technology |
| Shen, Shaojie | Hong Kong University of Science and Technology |
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, Semantic Scene Understanding
Abstract: 3D occupancy prediction is critical for comprehensive scene understanding in vision-centric autonomous driving. Recent advances have explored utilizing 3D semantic Gaussians to model occupancy while reducing computational overhead, but they remain constrained by insufficient multi-view spatial interaction and limited multi-frame temporal consistency. To overcome these issues, in this paper, we propose a novel Spatial-Temporal Gaussian Splatting (ST-GS) framework to enhance both spatial and temporal modeling in existing Gaussian-based pipelines. Specifically, we develop a guidance-informed spatial aggregation strategy within a dual-mode attention mechanism to strengthen spatial interaction in Gaussian representations. Furthermore, we introduce a geometry-aware temporal fusion scheme that effectively leverages historical context to improve temporal continuity in scene completion. Extensive experiments on the large-scale nuScenes occupancy prediction benchmark showcase that our proposed approach not only achieves state-of-the-art performance but also delivers markedly better temporal consistency compared to existing Gaussian-based methods.
|
| |
| 09:00-10:30, Paper TuI1I.108 | Add to My Program |
| Training Humans to Teach Robots: Large and Lasting Skill Gains |
|
| Zhu, Yuqing | King's College London |
| Sun, Endong | King's College London |
| Howard, Matthew | King's College London |
Keywords: Human Factors and Human-in-the-Loop, Human-Robot Collaboration, Learning from Demonstration
Abstract: Recent evidence has shown that, contrary to expectations, it is difficult for novices to teach robots tasks through learning from demonstration (LfD). Novices often struggle with understanding the relationship between robot states and actions, leading to suboptimal demonstrations. This paper introduces a framework that leverages machine teaching algorithms to train novices in a controlled, ideal environment where optimal control parameters are predefined. The training enables participants to internalise fundamental control principles, preparing them to adapt to new skills that share similar properties. The study evaluates whether such teaching ability is (i) retained beyond the training period (including a long-term follow-up) and (ii) generalised so that novices teach robots more effectively in environments where control parameters are not predefined. It reports a series of between-subjects studies that demonstrate that trained novice teachers achieve a 75% improvement in teaching ability, with these gains retained even after guidance is removed, and exhibit a 71% enhancement in applying skills beyond the training content.
|
| |
| 09:00-10:30, Paper TuI1I.109 | Add to My Program |
| Graph Neural Planning and Predictive Control for Multi-Robot Communication-Constrained Unlabeled Motion Planning |
|
| Goarin, Manohari | New York University, Tandon School of Engineering |
| Zhou, Yang | New York University |
| Loianno, Giuseppe | UC Berkeley |
Keywords: Multi-Robot Systems, Integrated Planning and Learning, Task and Motion Planning
Abstract: The multi-robot unlabeled motion planning problem of concurrently assigning robots to goals and generating safe trajectories is central in many collaborative tasks. Recent Graph Neural Network methods offer scalable decentralized solutions but rely on simplified dynamics and simulation environments, overlooking key challenges of real-world deployment such as dynamic feasibility and communication constraints. To address these gaps, we propose a hierarchical framework that combines a Graph ATtention Planner (GATP) with a decentralized Nonlinear Model Predictive Controller (NMPC). GATP provides intermediate subgoals through multi-robot cooperation, and the NMPC enforces safety under nonlinear dynamics and actuation constraints. We evaluate our framework in both simulation and real-world quadrotor experiments. Thanks to attention mechanisms and minimal communication requirements, we demonstrate improved generalization to larger teams, robustness to communication delays up to 200 ms and practical feasibility with decentralized on-board inference.
|
| |
| 09:00-10:30, Paper TuI1I.110 | Add to My Program |
| Perception-Driven Estimation of Terrain Motion Resistance for UGVs |
|
| Bourbon, Tom | CNRS |
| Aravecchia, Stephanie | Georgia Tech Europe - IRL 2958 GT-CNRS |
| Pradalier, Cedric | GeorgiaTech Lorraine |
Keywords: Field Robots, Probabilistic Inference, Vision-Based Navigation
Abstract: Accurate estimation of wheel–terrain interaction parameters is important for efficient navigation of Unmanned Ground Vehicles in unstructured outdoor environments. In this paper, we propose a hybrid data-driven and model-based method to estimate a priori emph{motion resistance}, a terrain-specific parameter representing the force opposing wheel motion, which is largely influenced by terrain class and geometry. The proposed method relies on learning motion resistance from proprioceptive feedback collected on reference terrains. This learned model is then transferred to new environments, where motion resistance is inferred from exteroceptive observation, including LiDAR and cameras, leveraging terrain geometry and class information. To capture uncertainty from terrain roughness and sensor noise, we evaluate two probabilistic models predicting motion resistance distributions: a Gaussian-MLP and a Gaussian Process Regressor. Their robustness to domain shifts is assessed by measuring performance degradation as the target diverges from the source domain. Extensive off-road field experiments validate the method’s effectiveness, demonstrating accurate prediction of motion resistance and its potential for deployment.
|
| |
| 09:00-10:30, Paper TuI1I.111 | Add to My Program |
| GRATE: A Graph Transformer-Based Deep Reinforcement Learning Approach for Time-Efficient Autonomous Robot Exploration |
|
| Ni, Haozhan | National University of Singapore |
| Liang, Jingsong | Nanyang Technological University |
| He, Chenyu | Tongji University |
| Cao, Yuhong | National University of Singapore |
| Sartoretti, Guillaume Adrien | National University of Singapore (NUS) |
Keywords: View Planning for SLAM, Reinforcement Learning
Abstract: Autonomous robot exploration (ARE) is the process of a robot autonomously navigating and mapping an unknown environment. Recent Reinforcement Learning (RL)-based approaches typically formulate ARE as a sequential decision-making problem defined on a collision-free informative graph. However, these methods often demonstrate limited reasoning ability over graph-structured data. Moreover, due to the insufficient consideration of robot motion, the resulting RL policies are generally optimized to minimize travel distance, while neglecting time efficiency. To overcome these limitations, we propose GRATE, a Deep Reinforcement Learning (DRL)-based approach that leverages a Graph Transformer to effectively capture both local structure patterns and global contextual dependencies of the informative graph, thereby enhancing the model’s reasoning capability across the entire environment. In addition, we deploy a Kalman filter to smooth the waypoint outputs, ensuring that the resulting path is kinodynamically feasible for the robot to follow. Experimental results demonstrate that our method exhibits better exploration efficiency (up to 21.5% in distance and 21.3% in time to complete exploration) than state-of-the-art conventional and learning-based baselines in various simulation benchmarks. We also validate our planner in real-world scenarios.
|
| |
| 09:00-10:30, Paper TuI1I.112 | Add to My Program |
| DRIFT: Dual-Representation Inter-Fusion Transformer for Automated Driving Perception with 4D Radar Point Clouds |
|
| Pei, Siqi | Delft University of Technology |
| Palffy, Andras | Perciv AI |
| Gavrila, Dariu | Delft University of Technology |
Keywords: Intelligent Transportation Systems, Object Detection, Segmentation and Categorization, Deep Learning Methods
Abstract: 4D radars, which provide 3D point cloud data along with Doppler velocity, are attractive components of modern automated driving systems due to their low cost and robustness under adverse weather conditions. However, they provide a significantly lower point cloud density than LiDAR sensors. This makes it important to exploit not only local but also global contextual scene information. This paper proposes DRIFT, a model that effectively captures and fuses both local and global contexts through a dual-path architecture. The model incorporates a point path to aggregate fine-grained local features and a pillar path to encode coarse-grained global features. These two parallel paths are intertwined via novel feature-sharing layers at multiple stages, enabling full utilization of both representations. DRIFT is evaluated on the widely used View-of-Delft (VoD) dataset and a proprietary internal dataset. It outperforms the baselines on the tasks of object detection and/or free road estimation. For example, DRIFT achieves a mean average precision (mAP) of 52.6% (compared to, say, 45.4% of CenterPoint) on the VoD dataset.
|
| |
| 09:00-10:30, Paper TuI1I.113 | Add to My Program |
| Taxonomy-Aware Dynamic Motion Generation on Hyperbolic Manifolds |
|
| Augenstein, Luis | TNG Technology Consulting GmbH |
| Jaquier, Noémie | KTH Royal Institute of Technology |
| Asfour, Tamim | Karlsruhe Institute of Technology (KIT) |
| Rozo, Leonel | Italian Institute of Artificial Intelligence for Industry |
Keywords: Representation Learning, Motion and Path Planning, Grasping
Abstract: Human-like motion generation for robots often draws inspiration from biomechanical studies, which categorize complex human motions into hierarchical taxonomies. While these taxonomies provide rich structural information about how movements relate to one another, this information is frequently overlooked in motion generation models, leading to a disconnect between the generated motions and their underlying hierarchical structure. This paper introduces the Gaussian Process Hyperbolic Dynamical Model (GPHDM), a novel approach that learns latent representations preserving both the hierarchical structure of motions and their temporal dynamics to ensure physical consistency. Our model achieves this by extending the dynamics prior of the Gaussian Process Dynamical Model (GPDM) to the hyperbolic manifold and integrating it with taxonomy-aware inductive biases. Building on this geometry- and taxonomy-aware frameworks, we propose three novel mechanisms for generating motions that are both taxonomically-structured and physically-consistent: two probabilistic recursive approaches and a method based on pullback-metric geodesics. Experiments on generating realistic motion sequences on the hand grasping taxonomy show that the proposed GPHDM faithfully encodes the underlying taxonomy and temporal dynamics, and it generates novel physically-consistent trajectories.
|
| |
| 09:00-10:30, Paper TuI1I.114 | Add to My Program |
| From CAD to POMDP: Probabilistic Planning for Robotic Disassembly of End-Of-Life Products |
|
| Baumgärtner, Jan | Karlsruhe Institute of Technology |
| Hansjosten, Malte | Karlsruhe Institute of Technology (KIT) |
| Hald, David | Karlsruhe Institute of Technology |
| Hauptmannl, Adrian | Karlsruhe Institute of Technology |
| Puchta, Alexander | Karlsruhe Institute of Technology |
| Fleischer, Jürgen | Karlsruhe Institute of Technology (KIT) |
Keywords: Disassembly, Task and Motion Planning, Planning under Uncertainty
Abstract: To support the circular economy, robotic systems must not only assemble new products but also disassemble end-of-life (EOL) ones for reuse, recycling, or safe disposal. Existing approaches to disassembly sequence planning often assume deterministic and fully observable product models, yet real EOL products frequently deviate from their initial designs due to wear, corrosion, or undocumented repairs. We argue that disassembly should therefore be formulated as a Partially Observable Markov Decision Process (POMDP), which naturally captures uncertainty about the product's internal state. We present a mathematical formulation of disassembly as a POMDP, in which hidden variables represent uncertain structural or physical properties. Building on this formulation, we propose a task and motion planning framework that automatically derives specific POMDP models from CAD data, robot capabilities, and inspection results. To obtain tractable policies, we approximate this formulation with a reinforcement-learning approach that operates on stochastic action outcomes informed by inspection priors, while a Bayesian filter continuously maintains beliefs over latent EOL conditions during execution. Using three products on two robotic systems, we demonstrate that this probabilistic planning framework outperforms deterministic baselines in terms of average disassembly time and variance, generalizes across different robot setups, and successfully adapts to deviations from the CAD model, such as missing or stuck parts.
|
| |
| 09:00-10:30, Paper TuI1I.115 | Add to My Program |
| TUN3D: Towards Real-World Scene Understanding from Unposed Images |
|
| Konushin, Anton | Lomonosov Moscow State University |
| Drozdov, Nikita | Lomonosov Moscow State University |
| Gabdullin, Bulat | MSU |
| Zakharov, Alexey | MSU |
| Vorontsova, Anna | Samsung AI Center, Moscow |
| Rukhovich, Danila | Institute of Mechanics, Armenia |
| Kolodiazhnyi, Maksim | MSU |
Keywords: Deep Learning for Visual Perception, RGB-D Perception, Computer Vision for Automation
Abstract: Layout estimation and 3D object detection are two fundamental tasks in indoor scene understanding. When combined, they enable the creation of a compact yet semantically rich spatial representation of a scene. Existing approaches typically rely on point cloud input, which poses a major limitation since most consumer cameras lack depth sensors and visual-only data remains far more common. We address this issue with TUN3D, the first method that tackles joint layout estimation and 3D object detection in real scans, given multi-view images as input, and does not require ground-truth camera poses or depth supervision. Our approach builds on a lightweight sparse-convolutional backbone and employs two dedicated heads: one for 3D object detection and one for layout estimation, leveraging a novel and effective parametric wall representation. Extensive experiments show that TUN3D achieves state-of-the-art performance across three challenging scene understanding benchmarks: (i) using ground-truth point clouds, (ii) using posed images, and (iii) using unposed images. While performing on par with specialized 3D object detection methods, TUN3D significantly advances layout estimation, setting a new benchmark in holistic indoor scene understanding.
|
| |
| 09:00-10:30, Paper TuI1I.116 | Add to My Program |
| Multi-Agent Monte Carlo Tree Search for Makespan-Efficient Object Rearrangement in Cluttered Spaces |
|
| Ren, Hanwen | Purdue University |
| Kim, Junyoung | Purdue University |
| Tharmasanthiran, Aathman | Purdue University |
| Qureshi, Ahmed H. | Purdue University |
Keywords: Task Planning, Task and Motion Planning, Multi-Robot Systems
Abstract: Object rearrangement planning in complex, cluttered environments is a common challenge in warehouses, households, and rescue sites. Prior studies largely address monotone instances, whereas real-world tasks are often non-monotone—objects block one another and must be temporarily relocated to intermediate positions before reaching their final goals. In such settings, effective multi-agent collaboration can substantially reduce the time required to complete tasks. This paper introduces Centralized, Asynchronous, Multi-agent Monte Carlo Tree Search (CAM-MCTS), a novel framework for general-purpose makespan-efficient object rearrangement planning in challenging environments. CAM-MCTS combines centralized task assignment—where agents remain aware of each other’s intended actions to facilitate globally optimized planning—with an asynchronous task execution strategy that enables agents to take on new tasks at appropriate time steps, rather than waiting for others, guided by a one-step look-ahead cost estimate. This design minimizes idle time, prevents unnecessary synchronization delays, and enhances overall system efficiency. We evaluate CAM-MCTS across a diverse set of monotone and non-monotone tasks in cluttered environments, demonstrating consistent reductions in makespan compared to strong baselines. Finally, we validate our approach on a real-world multi-agent system under different configurations, further confirming its effectiveness and robustness. Videos can be found at https://www.youtube.com/watch?v=kNRg2kNnFxg.
|
| |
| 09:00-10:30, Paper TuI1I.117 | Add to My Program |
| Zero-Shot Exocentric Viewpoint-Robust Imitation Learning (VIL): Bridging Handheld Gripper and Exocentric Views |
|
| Li, Boyan | Shanghai Jiao Tong University |
| Meng, Peilin | Shanghai Jiao Tong University |
| Liu, Chang | Shanghai Jiao Tong University |
| Chen, Yulin | Shanghai Jiao Tong University |
| Zhou, Qi | Shanghai Jiaotong University |
| Bi, Youyi | Shanghai Jiao Tong University |
Keywords: Imitation Learning, Learning from Demonstration, Grippers and Other End-Effectors
Abstract: Recent advances in robot learning have motivated integrated pipelines that combine hardware for data collection with imitation learning algorithms. Existing data collection methods like leader–follower, VR/AR, and exoskeletons rely on costly hardware and exhibit limited scalability, while imitation learning algorithms built on them remain highly sensitive to viewpoint shifts, further constraining generalizability. Handheld grippers provide a low-cost, robot-agnostic alternative, but prior systems bypass exocentric view alignment by relying solely on wrist-mounted cameras, resulting in narrowed observation and reduced policy robustness. We propose VIL, a framework pairing customized handheld gripper with zero-shot, exocentric viewpoint-robust imitation learning algorithm, bridging the handheld gripper with exocentric views. Our approach employs adapters for appearance alignment and a hybrid encoder design to extract view-consistent representations for an ACT-style policy, enabling robust execution across diverse perspectives. We further optimize the data collection pipeline and validate the system in both simulation and real-world tasks. Experiments show that VIL achieves stable performance under viewpoint shifts, challenging low-horizon scenarios, and dynamic perspectives, outperforming SOTA methods and demonstrating a scalable pipeline for manipulator-independent, viewpoint-robust policy learning. The project repository containing code and hardware is available at https://github.com/liboyan233/VIL.git.
|
| |
| 09:00-10:30, Paper TuI1I.118 | Add to My Program |
| Autonomous Distributionally Robust Virtual Energy Storage Services Based on Parked Electric Vehicles |
|
| Mignoni, Nicola | Politecnico Di Bari |
| Pantazis, Georgios | Eindhoven University of Technology |
| Carli, Raffaele | Politecnico Di Bari |
| Grammatico, Sergio | Delft University of Technology |
| Dotoli, Mariagrazia | Politecnico Di Bari |
Keywords: Energy and Environment-Aware Automation, Optimization and Optimal Control, Probability and Statistical Methods
Abstract: We propose a novel model of a virtual energy storage system (ESS) that leverages the aggregate battery capacity of parked and idling electric vehicles (EVs). Such an energy service is offered to a community of prosumers as a temporary energy buffer and managed by a parking lot manager (PLM), which absorbs the risks arising from the unreliability of the EV-based ESS due to the arrival and departure of EVs. Hence, from the prosumers' perspective, such a virtual storage service behaves deterministically. Both the PLM and the prosumers act as self-interested agents that optimize their own objectives, subject to operational constraints, leading to a non-cooperative game. To deal with the uncertainty of prosumers' renewable net generation and EVs' arrivals/departures, we use a data-driven distributionally robust approach, showing that a tractable reformulation can be obtained, where the equilibrium solutions can be computed as a variational inequality. Numerical simulations based on real data illustrate the behavior of the proposed model.
|
| |
| 09:00-10:30, Paper TuI1I.119 | Add to My Program |
| Closing the Communication Loop for Robotic Failures: Multi-Turn, Behavior-Tree-Grounded Explanations with Large Language Models |
|
| Khanna, Parag | KTH Royal Institute of Technology |
| Zhou, Haoyun | Division of Robotics, Perception and Learning, KTH Royal Institute of Technology, Stockholm, Sweden |
| Yadollahi, Elmira | Lancaster University |
| Leite, Iolanda | KTH Royal Institute of Technology |
| Smith, Claes Christian | KTH Royal Institute of Technology |
Keywords: Human-Robot Collaboration, Social HRI, Natural Dialog for HRI
Abstract: Robot failures during collaborative tasks can frustrate users and reduce trust. To address this, we developed a failure communication module that combines large language models (LLMs) with Behavior Trees (BTs) to generate interactive, context-aware explanations for task failures. The module supports three key processes: (1) initial (high/medium/low) leveled explanations, (2) interactive clarifications for user follow-up questions, and (3) explicit verification of user actions to close the recovery loop. By leveraging the BT structure and persistent interaction history, it generates responsive, multi-turn explanations and reduces redundancy for repeated failures. We implemented and evaluated this module in real-time robotic pick-and-place tasks as a user study with 33 participants across three high/medium/low explanation conditions. The user study showed that the module improved resolution rates for challenging failures and reduced resolution times for simpler failures, demonstrating the effectiveness of LLM-powered, BT-grounded explanations in human-robot collaboration (HRC).
|
| |
| 09:00-10:30, Paper TuI1I.120 | Add to My Program |
| Behavior Foundation Model for Humanoid Robots |
|
| Zeng, Weishuai | Peking University |
| Lu, Shunlin | The Chinese University of Hong Kong, Shenzhen |
| Yin, Kangning | Shanghai Jiao Tong University |
| Niu, Xiaojie | Shanghai Artificial Intelligence Laboratory |
| Dai, Minyue | Fudan University |
| Wang, Jingbo | Shanghai Artificial Intelligence Laboratory |
| Pang, Jiangmiao | Shanghai Artificial Intelligence Laboratory |
Keywords: Whole-Body Motion Planning and Control, Imitation Learning, Reinforcement Learning
Abstract: Whole-body control (WBC) of humanoid robots has witnessed remarkable progress in skill versatility, enabling a wide range of applications such as locomotion, teleoperation, and motion tracking. Despite these achievements, existing WBC frameworks remain largely task-specific, relying heavily on labor-intensive reward engineering and demonstrating limited generalization across tasks and skills. These limitations hinder their response to arbitrary control modes and restrict their deployment in complex, real-world scenarios. To address these challenges, we revisit existing WBC systems and identify a shared objective across diverse tasks: the generation of appropriate behaviors that guide the robot toward desired goal states. Building on this insight, we propose the Behavior Foundation Model (BFM), a generative model pretrained on large-scale behavioral datasets to capture broad, reusable behavioral knowledge for humanoid robots. BFM integrates a masked online distillation framework with a Conditional Variational Autoencoder (CVAE) to model behavioral distributions, thereby enabling flexible operation across diverse control modes and efficient acquisition of novel behaviors without retraining from scratch. Extensive experiments in both simulation and on a physical humanoid platform demonstrate that BFM generalizes robustly across diverse WBC tasks while rapidly adapting to new behaviors. These results establish BFM as a promising step toward a foundation model for general-purpose humanoid control.
|
| |
| 09:00-10:30, Paper TuI1I.121 | Add to My Program |
| SPLC: Social Preference Learning for Crowd Robot Navigation |
|
| Chen, Zixuan | Wuhan University of Science and Technology |
| Fu, Hao | Wuhan University of Science and Technology |
| Hu, Haiwen | Wuhan University of Science and Technology |
| Zheng, Shiquan | Wuhan University of Science and Technology |
Keywords: Motion and Path Planning, Reinforcement Learning, Collision Avoidance
Abstract: Offline reinforcement learning (RL) holds significant potential for crowd robot navigation in human-robot coexistence applications. However, the inherent complexity of pedestrian motion renders the design of effective reward functions for promoting socially compliant robot behaviors a persistent challenge. This paper proposes a Social Preference Learning for Crowd Robot Navigation (SPLC) algorithm to eliminate the need for detailed reward design. Its core innovation lies in the introduction of a social preference feedback mechanism to automatically generate preference data through principled preference evaluation criteria. By explicitly accounting for the intricacies of pedestrian dynamics, the pipeline mitigates the reward bias and facilitates the systematic quantification of broad social norms, thereby fostering socially compliant behaviors. Extensive experiments integrating SPLC with offline RL methods demonstrate consistent improvements over state-of-the-art baselines across standard performance metrics. Furthermore, real-world experiments on the TurtleBot4 further validate the effectiveness of SPLC in practical human–robot coexistence settings. Our code and video demos are available at https://github.com/sklus949/SPLC.
|
| |
| 09:00-10:30, Paper TuI1I.122 | Add to My Program |
| Bending Perception-Based Variable Stiffness Control for Snake Robots in Pipe Navigation |
|
| Meng, Shiyong | Central South University |
| Yang, Huizhuo | Tiangong University,China |
| Shen, Kai | Tiangong University |
| Xu, Honglu | Tiangong University |
| Chen, Gen | Tiangong University. No. 399 BinShuiXi Road, XiQing District, Tianjin 300387, P.R. of China |
| Wang, Jianming | Tiangong University |
| Xiao, Xuan | Tiangong University |
|
|
| |
| 09:00-10:30, Paper TuI1I.123 | Add to My Program |
| DexTele: A Dual-Arm Dexterous Teleoperation System Based on Motion Retargeting and Adaptive Force Control |
|
| Lai, Yuanchuan | Sun Yat-Sen University |
| Gao, Qing | Sun Yat-Sen University |
| Liang, Ziyan | Sun Yat-Sen University |
| Cheng, Xianfeng | Sun Yat-Sen University |
| Hu, Junjie | The Chinese University of Hong Kong, Shenzhen |
| Ju, Zhaojie | University of Portsmouth |
Keywords: Telerobotics and Teleoperation, Dual Arm Manipulation
Abstract: In dual-arm dexterous teleoperation, cross-platform generalization of motion retargeting and interactivity of grasping are crucial. However, the heterogeneity of robotic architectures and the wide variety of grasping objects pose significant challenges to achieving precise motion retargeting and compliant grasping in dual-arm dexterous teleoperation. To address these challenges, a dual-arm dexterous teleoperation system (DexTele) is proposed based on motion retargeting and adaptive force control. First, a vision-based motion retargeting module is designed to generate preliminary robot motions from human images. In this module, a motion-graph encoder and latent optimization are proposed for precise and convenient cross-platform motion retargeting. Second, an adaptive grasping module is designed to achieve compliant grasping. This module combines a vision-language model (VLM) with model predictive control (MPC), allowing the system to predict the required grasping force for a target object and perform gradient-based online optimization. Finally, extensive experiments demonstrate that the DexTele achieves precise motion retargeting and compliant grasping with generalization across multiple robot platforms. The source code will be released upon paper acceptance.
|
| |
| 09:00-10:30, Paper TuI1I.124 | Add to My Program |
| Dynamic Scoop-And-Flick Manipulation for Rapid Non-Prehensile High-Arc Object Transfer |
|
| Ahn, Gijae | Pusan National University |
| Lee, Junwoo | Pusan National University |
| Oh, Seung Hwa | Pusan National University |
| Shin, Mujin | Pusan National University |
| Yi, Seung-Joon | Pusan National University |
| Seo, Jungwon | Pusan National University |
Keywords: Grasping, Dexterous Manipulation, Grippers and Other End-Effectors
Abstract: This study presents dynamic scoop-and-flick manipulation, a robotic technique that achieves desired projectile motions of target objects through rapid, non-prehensile physical interactions. The method allows a robot to scoop objects resting on a surface and quickly launch them into projectile trajectories. We formulate a theoretical model of the technique and realize it through a hybrid approach that combines model-based reasoning and data-driven learning. The advantages—-namely, rapid and accurate pick-and-place with reduced planning complexity-—are validated in experiments conducted with a particularly challenging class of objects: low-profile items with small thickness.
|
| |
| 09:00-10:30, Paper TuI1I.126 | Add to My Program |
| Planning Using Belief Summaries for Goal-Directed Manipulation of Articulated Objects with Force and Proprioception |
|
| Illandara, Thavishi | MIT |
| Hagenow, Michael | University of Wisconsin - Madison |
| Shah, Julie A. | MIT |
Keywords: Manipulation Planning, Planning under Uncertainty, Sensor-based Control
Abstract: Enabling robots to manipulate articulated objects is essential for their successful integration into human-centric environments. Such manipulation is often part of a larger multistep task, where achieving a specific joint configuration is necessary for subsequent actions-for example, in a cluttered environment, a cabinet door must be rotated to a precise angle that creates just enough clearance to retrieve an object, beyond which it would collide with surrounding obstacles. In this work, we present an approach to learning goal-directed policies for articulated object manipulation using force and proprioceptive feedback. We formulate the manipulation problem as a Partially Observable Markov Decision Process (POMDP) with a continuous state space and a set of low-level control actions. Due to the limitations of standard POMDP solvers in this setting, we introduce Planning using Belief Summaries (PuBS), which approximates the POMDP as a Markov Decision Process (MDP) over compact particle-filter belief summaries encoding estimated state and uncertainty. This approximate MDP is then solved using reinforcement learning techniques to learn goal-directed policies that enable safe exploration while efficiently guiding the object toward the goal. We evaluate our approach through simulation and real-world robotic experiments, demonstrating reliable goal-reaching performance.
|
| |
| 09:00-10:30, Paper TuI1I.127 | Add to My Program |
| Voronoi-Based Second-Order Descriptor with Whitened Metric in LiDAR Place Recognition |
|
| Kim, Jaein | Seoul National University |
| Yoo, Hee Bin | École Normale Supérieure |
| Han, Dong-Sig | Imperial College London |
| Zhang, Byoung-Tak | Seoul National University |
Keywords: Localization, Deep Learning Methods, Deep Learning for Visual Perception
Abstract: The pooling layer plays a vital role in aggregating local descriptors into the metrizable global descriptor in the LiDAR Place Recognition (LPR). In particular, the second-order pooling is capable of capturing higher-order interactions among local descriptors. However, its existing methods in the LPR adhere to conventional implementations and post-normalization, and incur the descriptor unsuitable for Euclidean distancing. Based on the recent interpretation that associates NetVLAD with the second-order statistics, we propose to integrate second-order pooling with the inductive bias from Voronoi cells. Our novel pooling method aggregates local descriptors to form the second-order matrix and whitens the global descriptor to implicitly measure the Mahalanobis distance while conserving the cluster property from Voronoi cells, addressing its numerical instability during learning with diverse techniques. We demonstrate its performance gains through the experiments conducted on the Oxford Robotcar and Wild-Places benchmarks and analyze the numerical effect of the proposed whitening algorithm.
|
| |
| 09:00-10:30, Paper TuI1I.128 | Add to My Program |
| World Model Failure Classification and Anomaly Detection for Autonomous Inspection |
|
| Ho, Michelle | Stanford University |
| Ginting, Muhammad Fadhil | Stanford University |
| Ward, Isaac Ronald | Stanford University |
| Reinke, Andrzej | University of Bonn |
| Kochenderfer, Mykel J. | Stanford University |
| Agha-mohammadi, Ali-akbar | NASA-JPL, Caltech |
| Omidshafiei, Shayegan | Field AI, Inc |
Keywords: Failure Detection and Recovery, Field Robots
Abstract: Autonomous inspection robots for monitoring industrial sites can reduce costs and risks associated with human-led inspection. However, accurate readings can be challenging due to occlusions, limited viewpoints, or unexpected environmental conditions. We propose a hybrid framework that combines supervised failure classification with anomaly detection, enabling classification of inspection tasks as a success, known failure, or anomaly (i.e., out-of-distribution) case. Our approach uses a world model backbone with compressed video inputs. This policy-agnostic, distribution-free framework determines classifications based on two decision functions set by conformal prediction (CP) thresholds before a human observer. We evaluate the framework on gauge inspection feeds collected from office and industrial sites and demonstrate real-time deployment on a Boston Dynamics Spot. Experiments show over 90% accuracy in distinguishing between successes, failures, and OOD cases, with classifications occurring earlier than a human observer. These results highlight the potential for robust, anticipatory failure detection in autonomous inspection tasks or as a feedback signal for model training to assess and improve the quality of training data. Project website: https://autoinspection-classification.github.io/
|
| |
| 09:00-10:30, Paper TuI1I.129 | Add to My Program |
| Cooperative-Competitive Team Play of Real-World Craft Robots |
|
| Zhao, Rui | Tencent |
| Li, Xihui | Tsinghua University |
| Zhang, Yizheng | Tencent |
| Liu, Yuzhen | Tencent |
| Zhang, Zhong | City University of Hong Kong |
| Zhang, Yufeng | Tencent |
| Zhou, Cheng | Tencent |
| Zhang, Zhengyou | Tencent |
| Han, Lei | Tencent Robotics X |
Keywords: Multi-Robot Systems, Cooperating Robots, Reinforcement Learning
Abstract: Multi-agent deep Reinforcement Learning (RL) has made significant progress in developing intelligent game-playing agents in recent years. However, the efficient training of collective robots using multi-agent RL and the transfer of learned policies to real-world applications remain open research questions. In this work, we first develop a comprehensive robotic system, including simulation, distributed learning framework, and physical robot components. We then propose and evaluate reinforcement learning techniques designed for efficient training of cooperative and competitive policies on this platform. To address the challenges of multi-agent sim-to-real transfer, we introduce Out of Distribution State Initialization (OODSI) to mitigate the impact of the sim-to-real gap. In the experiments, OODSI improves the Sim2Real performance by 20%. We demonstrate the effectiveness of our approach through experiments with a multi-robot car competitive game and a cooperative task in real-world settings.
|
| |
| 09:00-10:30, Paper TuI1I.130 | Add to My Program |
| Angle-I2P: Angle-Consistent-Aware Hierarchical Attention for Cross-Modality Outlier Rejection |
|
| Peng, Muyao | Huazhong University of Science and Technology |
| Zou, Shun | Huazhong University of Science and Technology |
| An, Pei | Huazhong University of Science and Technology |
| Yang, You | Huazhong University of Science and Technology |
| Liu, Qiong | Huahzong University of Science and Technology |
Keywords: Computer Vision for Automation, Localization, Mapping
Abstract: Image-to-point-cloud registration (I2P) is a fundamental task in robotic applications such as manipulation, grasping and localization. Existing deep learning-based I2P methods seek to align image and point cloud features in a learned representation space to establish correspondences, and have achieved promising results. However, due to issues such as scale ambiguity, many geometrically inconsistent outlier correspondences persist in the feature space. To address this limitation, we propose Angle-I2P, an outlier rejection network that leverages angle-consistent geometric constraints and hierarchical attention. First, we design a scale-invariant, cross-modality geometric constraint based on angular consistency. This explicit geometric constraint guides the model in distinguishing inliers from outliers. Furthermore, we propose a global-to-local hierarchical attention mechanism that effectively filters out geometrically inconsistent matches under rigid transformation, thereby improving the Inlier Ratio (IR) and Registration Recall (RR). Experimental results demonstrate that our method achieves state-of-the-art performance on the 7Scenes, RGBD Scenes V2, and a self-collected dataset, with consistent improvements across all benchmarks.
|
| |
| 09:00-10:30, Paper TuI1I.131 | Add to My Program |
| Simulation-Driven Evolutionary Motion Parameterization for Contact-Rich Granular Scooping with a Soft Conical Robotic Hand |
|
| Wang, Yongliang | University of Groningen |
| Beltran-Hernandez, Cristian Camilo | OMRON SINIC X Corporation |
| Takahashi, Tomoya | OMRON SINIC X Corporation |
| Hamaya, Masashi | OMRON SINIC X Corporation |
Keywords: Soft Robot Applications, Soft Robot Materials and Design, Simulation and Animation
Abstract: Tool-based scooping is vital in robot-assisted tasks, enabling interaction with objects of varying sizes, shapes, and material states. Recent studies have shown that flexible, reconfigurable soft robotic end-effectors can adapt their shape to maintain consistent contact with container surfaces during scooping, improving efficiency compared to rigid tools. These soft tools can adjust to varying container sizes and materials without requiring complex sensing or control. However, the inherent compliance and complex deformation behavior of soft robotics introduce significant control complexity that limits practical applications. To address this challenge, this paper presents the development of a physics-based simulation model of a deformable soft conical robotic hand that captures its passive reconfiguration dynamics and enables systematic trajectory optimization for scooping tasks. We propose a novel physics-based simulation approach that accurately models the soft tool’s morphing behavior from flat sheets to adaptive conical structures, combined with an evolutionary strategy framework that automatically optimizes scooping trajectories without manual parameter tuning. We validate the optimized trajectories through both simulation and real-robot experiments. The results demonstrate strong generalization and successfully address a range of challenging tasks previously beyond the reach of existing approaches. Videos of our experiments are available online: https://sites.google.com/view/scoopsh
|
| |
| 09:00-10:30, Paper TuI1I.132 | Add to My Program |
| Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models |
|
| Ni, Weicong | East China Normal University |
| Jiang, Tianbao | East China Normal University |
| Wang, Linlin | East China Normal University |
Keywords: Multi-Modal Perception for HRI, Hybrid Logical/Dynamical Planning and Verification, Acceptability and Trust
Abstract: Vision-Language Models (VLMs) are becoming the cornerstone of high-level reasoning for robotic automation, enabling robots to parse natural language commands and perceive their environments. However, their susceptibility to hallucinations introduces critical failures in decision-making, posing significant safety and reliability risks in physical deployments. This challenge is exacerbated by the open-ended nature of real-world tasks, where questions vary vastly in difficulty and modality, demanding robust and adaptable reasoning strategies. To tackle this, we propose the underline{P}seudocode-guided underline{St}ructured Reunderline{a}soning funderline{r}amework (PStar), which adaptively selects structured pseudocode reasoning paths to help VLMs perform flexible and step-by-step reasoning. We design a set of abstract reasoning functions and formulate a structured pseudocode library to represent modular reasoning strategies. Crucially, we design a Difficulty Feature Vector (DFV) that allows the model to assess question complexity and adaptively choose appropriate reasoning strategies—enhancing robustness and interpretability. Extensive experiments demonstrate that PStar significantly reduces hallucination rates, achieving state-of-the-art scores of 87.1% on POPE and 68.0% on MMStar, outperforming even GPT-4V. By providing a validated mechanism to reduce visual-language errors, PStar offers a critical step toward deploying more trustworthy and deterministic VLMs for real-world automated systems, where such errors can lead to catastrophic outcomes.
|
| |
| 09:00-10:30, Paper TuI1I.133 | Add to My Program |
| Efficient Collision-Avoidance for Multi-Robot System with Superquadric Models and Sum-Of-Squares Approximation |
|
| Lu, Siyi | Beihang University |
| Ruan, Sipu | Beihang University |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Collision Avoidance, Distributed Robot Systems
Abstract: Multi-robot motion planning and crowd simulations are crucial in social navigation, enabling agents to avoid collisions with one another in dynamic environments. While existing methods typically use simple circular models for robot and pedestrian boundaries, superquadric models offer greater flexibility in accurately representing non-circular objects. This paper addresses the challenges of employing superquadric models to avoid dynamic obstacles and other moving robots. We tackle three primary challenges: (i) approximating the complex parametric boundary surface of Minkowski sum for easier differentiation; (ii) computing the boundary of velocity obstacles; and (iii) rapidly calculating velocity changes. The approximation of differentiable Minkowski sum boundary is formulated as a semidefinite programming problem using convex sum-of-squares polynomials. We then develop a tangency point-finding algorithm with superlinear convergence speed, and introduce a rule-based collision-avoidance approach, named SSCA (Superquadric-based Sum-of-Squares Collision Avoidance for Multi-Robot Systems) for efficient velocity change calculation. Our proposed method is evaluated through extensive experiments, demonstrating millisecond-level computational efficiency and scalability to dozens of robots. This work provides a more effective solution for collision avoidance algorithm using superquadric models, enhancing the safety and performance of robots in dynamic shared environments.
|
| |
| 09:00-10:30, Paper TuI1I.134 | Add to My Program |
| SOE: Sample-Efficient Robot Policy Self-Improvement Via On-Manifold Exploration |
|
| Jin, Yang | Shanghai Jiao Tong University |
| Lv, Jun | Shanghai Jiao Tong University |
| Xue, Han | Shanghai Jiao Tong University |
| Chen, Wendi | Shanghai Jiao Tong University |
| Wen, Chuan | Shanghai Jiao Tong University |
| Lu, Cewu | ShangHai Jiao Tong University |
Keywords: Imitation Learning, Learning from Experience, Deep Learning in Grasping and Manipulation
Abstract: Intelligent agents progress by continually refining their capabilities through actively exploring environments. Yet robot policies often lack sufficient exploration capability due to action mode collapse. Existing methods that encourage exploration typically rely on random perturbations, which are unsafe and induce unstable, erratic behaviors, thereby limiting their effectiveness. We propose Self-Improvement via On-Manifold Exploration (SOE), a framework that enhances policy exploration and improvement in robotic manipulation. SOE learns a compact latent representation of task-relevant factors and constrains exploration to the manifold of valid actions, ensuring safety, diversity, and effectiveness. It can be seamlessly integrated with arbitrary policy models as a plug-in module, augmenting exploration without degrading the base policy performance. Moreover, the structured latent space enables human-guided exploration, further improving efficiency and controllability. Extensive experiments in both simulation and real-world tasks demonstrate that SOE consistently outperforms prior methods, achieving higher task success rates, smoother and safer exploration, and superior sample efficiency. These results establish on-manifold exploration as a principled approach to sample-efficient policy self-improvement.
|
| |
| 09:00-10:30, Paper TuI1I.135 | Add to My Program |
| InBi-RRT: Incremental Bidirectional Tree Based Real-Time Path Planning/Replanning in Unknown Non-Convex Environments |
|
| Cui, Bo | Northwestern Polytechnical University |
| Li, Yang | Northwestern Polytechnical University |
| Yan, Weisheng | Northwestern Polytechnical University |
| Feng, Ao | Northwestern Polytechnical University |
| Yang, Zhanwei | Northwestern Polytechnical University |
| Cui, Rongxin | Northwestern Polytechnical University |
Keywords: Collision Avoidance, Motion and Path Planning, Task and Motion Planning
Abstract: Real-time path planning in unknown non-convex environments is challenging, as obstacle updates can invalidate existing paths while narrow passages restrict feasible connectivity. This paper presents textbf{InBi-RRT}, an incremental bidirectional tree-based framework that grows a reverse tree from the goal and maintains a reusable forward tree from the start. When the current path becomes invalid, a cost-guided expansion selectively extends the forward tree to establish collision-free connections with the reverse tree, followed by backtracking and lightweight path optimization for efficient repair. Simulation results in unknown and non-convex scenarios demonstrate that InBi-RRT achieves significantly faster replanning than baseline methods, being up to textbf{5.5times} faster than RT-RRT and textbf{22times} faster than RRT^{text{X}}, with paths up to 19.8% shorter than RRT^{text{X}} under the same sample count. Furthermore, real-world experiments in an indoor maze-like environment verify the practicality and robustness of the proposed planner in unknown non-convex scenarios.
|
| |
| 09:00-10:30, Paper TuI1I.136 | Add to My Program |
| Contact-Robust Trajectory Planning Via Parametric Sensitivity Analysis for Hybrid Robotic Systems |
|
| Belvedere, Tommaso | CNRS |
| Zhu, James | LAAS-CNRS |
| Cognetti, Marco | LAAS-CNRS and Université De Toulouse |
| Robuffo Giordano, Paolo | Irisa Cnrs Umr6074 |
Keywords: Planning under Uncertainty, Optimization and Optimal Control, Robot Safety
Abstract: In this paper, we combine first-order approximations of hybrid systems (i.e., the so-called saltation matrix) with previous works on parametric sensitivity for continuous systems to propose a general framework for robust trajectory optimization of hybrid systems subject to parametric uncertainties. A method for computing parametric sensitivities of both continuous dynamics and hybrid events is presented. The obtained "hybrid parametric sensitivity" is then combined with sensitivity-based tubes that encapsulate all possible perturbed states and control trajectories given a known bounded range for the uncertain parameters. The proposed method is then applied to the problem of planning robust trajectories for legged robot systems, which allows obtaining trajectories that remain feasible w.r.t.~the contact constraints even in presence of uncertainties in the dynamics, guard conditions, and reset maps. We also illustrate one of the fundamental limitations of first-order approximations, that is, the fact that the sensitivity reset time is fixed, and propose an extension to the sensitivity analysis that can form the basis for future developments.
|
| |
| 09:00-10:30, Paper TuI1I.137 | Add to My Program |
| One-Step Model Predictive Path Integral for Manipulator Motion Planning Using Configuration Space Distance Fields |
|
| Li, Yulin | The University of Tokyo |
| Miyazaki, Tetsuro | The University of Tokyo |
| Kawashima, Kenji | The University of Tokyo |
Keywords: Motion and Path Planning, Collision Avoidance, Simulation and Animation
Abstract: Motion planning for robotic manipulators is a fundamental problem in robotics. Classical optimization-based methods typically rely on the gradients of signed distance fields (SDF) to impose collision-avoidance constraints. However, these methods are susceptible to local minima and may fail when the SDF gradients vanish. Recently, Configuration Space Distance Fields (CDF) have been introduced, which directly model distances in the robot’s configuration space. Unlike workspace SDF, CDF are differentiable almost everywhere and thus provide reliable gradient information. On the other hand, gradient-free approaches such as Model Predictive Path Integral (MPPI) control leverage long-horizon rollouts to achieve collision avoidance. While effective, these methods are computationally expensive due to the large number of trajectory samples, repeated collision checks, and the difficulty of designing cost functions with heterogeneous physical units. In this paper, we propose a framework that integrates CDF with MPPI to enable direct navigation in the robot’s configuration space. Leveraging CDF gradients, we unify the MPPI cost in joint-space and reduce the horizon to one step, substantially cutting computation while preserving collision avoidance in practice. We demonstrate that our approach achieves nearly 100% success rates in 2D environments and consistently high success rates in challenging 7-DOF Franka manipulator simulations with complex obstacles. Furthermore, our method attains control frequencies exceeding 750 Hz, substantially outperforming both optimization-based and standard MPPI baselines. These results highlight the effectiveness and efficiency of the proposed CDF-MPPI framework for high-dimensional motion planning.
|
| |
| 09:00-10:30, Paper TuI1I.138 | Add to My Program |
| ForSim: Stepwise Forward Simulation for Traffic Policy Fine-Tuning |
|
| Chen, Keyu | Tsinghua University |
| Sun, Wenchao | Tsinghua University |
| Cheng, Hao | Tsinghua University |
| Fu, Zheng | Tsinghua University |
| Zheng, Sifa | Tsinghua University |
Keywords: Autonomous Agents, Motion and Path Planning, Reinforcement Learning
Abstract: As the foundation of closed-loop training and evaluation in autonomous driving, traffic simulation still faces two fundamental challenges: covariate shift introduced by open-loop imitation learning and limited capacity to reflect the multimodal behaviors observed in real-world traffic. Although recent frameworks such as RIFT have partially addressed these issues through group-relative optimization, their forward simulation procedures remain largely non-reactive, leading to unrealistic agent interactions within the virtual domain and ultimately limiting simulation fidelity. To address these issues, we propose ForSim, a stepwise closed-loop forward simulation paradigm. At each virtual timestep, the traffic agent propagates the virtual candidate trajectory that best spatiotemporally matches the reference trajectory through physically grounded motion dynamics, thereby preserving multimodal behavioral diversity while ensuring intra-modality consistency. Other agents are updated with stepwise predictions, yielding coherent and interaction-aware evolution. When incorporated into the RIFT traffic simulation framework, ForSim operates in conjunction with group-relative optimization to fine-tune traffic policy. Extensive experiments confirm that this integration consistently improves safety while maintaining efficiency, realism, and comfort. These results underscore the importance of modeling closed-loop multimodal interactions within forward simulation and enhance the fidelity and reliability of traffic simulation for autonomous driving.
|
| |
| 09:00-10:30, Paper TuI1I.139 | Add to My Program |
| Bioinspired Origami Exosuit for Sequential Lifting Assistance with Energy-Aware Compliance and Event-Triggered Impedance |
|
| Yang, Qunting | University of Science and Technology of China |
| Wu, Xiaoyu | Tongji University |
| Jian, Bingcong | Tongji University |
| Xia, Haisheng | Tongji University |
| Li, Zhijun | Tongji University |
Keywords: Wearable Robotics, Safety in HRI, Biologically-Inspired Robots
Abstract: Back injuries resulting from manual material handling have long constituted a prominent threat to occupational safety. While back-support exosuits offer the potential to augment human strength, their practical implementation is hindered by persistent challenges pertaining to comfort and safety. Drawing inspiration from human biomechanics and muscle behavior, we develop a lightweight assistive exosuit that synchronizes with natural load-handling rhythms. By integrating a deployable Kresling origami structure with a twostage transmission mechanism, a single motor can sequentially assist both the waist and arms, achieving motion-conforming support with minimal complexity. An energy-aware compliance control strategy allows the system to yield passively during unassisted motion, avoiding interference with voluntary human behavior. We propose an event-triggered impedance control strategy based on an energy tank framework, which adaptively intervenes only when interaction energy exceeds safety thresholds. Experimental results demonstrate substantial reductions in muscle activation during load-handling tasks, with decreases of up to 22.8%, 15.4%, and 14.8% in the biceps, triceps, and erector spinae (MVC%), respectively.
|
| |
| 09:00-10:30, Paper TuI1I.140 | Add to My Program |
| ARTEMIS: Active Real-Time Textured Environment Meshing with Interactive Semantics |
|
| Ge, Yigu | Beijing Institute of Technology |
| Ma, Zhenhuan | Beijing Institute of Technology |
| Tang, Shihao | Beijing Institute of Technology |
| Shi, Yangxi | Beijing Institute of Technology |
| Liang, Xinkai | Beijing Institute of Technology |
| Fang, Hao | Beijing Institute of Technology |
Keywords: Mapping, SLAM, Semantic Scene Understanding
Abstract: To advance 3D reconstruction from static digital replicas towards semantically interactive Living Maps responsive to an agent's queries, we propose ARTEMIS, a system for Active Real-time Textured Environment Meshing with Interactive Semantics. At its core, our Semantic Brush is a methodology comprised of tightly-coupled modules for segmentation, constraint, and refinement that operate in a two-stage, coarse-to-fine pipeline. Initially, its segmentation and constraint modules translate natural language into a semantically-aware mesh, enforcing sharp object boundaries with a unified energy function. Subsequently, its refinement module computes a unified reliability metric from color and depth consistency to guide the joint optimization of the texture map and semantic labels. This holistic process inherently filters unreliable measurements, establishing a complete interactive workflow from language input to real-time highlighting on a high-fidelity textured mesh. We evaluated ARTEMIS on public datasets and in real-world scenarios. The results demonstrate its state-of-the-art accuracy in mesh reconstruction, while simultaneously attaining high fidelity in both texture and semantics. To share our findings and make contributions to the community, our code will be made publicly available.
|
| |
| 09:00-10:30, Paper TuI1I.141 | Add to My Program |
| RL-Based Coverage Path Planning for Deformable Objects on 3D Surfaces |
|
| Zhang, Yuhang | University of Science and Technology of China |
| Ma, Jinming | Xiaomi Robotics Lab |
| Wu, Feng | University of Science and Technology of China |
Keywords: Reinforcement Learning, Motion and Path Planning
Abstract: Currently, many manipulation tasks for deformable objects focus on activities like folding clothes, handling ropes, and manipulating bags. However, research on contact-rich tasks involving deformable objects remains relatively underdeveloped. When humans use cloth or sponges to wipe surfaces, they rely on both vision and tactile feedback. Yet, current algorithms still face challenges with issues like occlusion, while research on tactile perception for manipulation is still evolving and requires further development. Tasks such as covering surfaces with deformable objects demand not only perception but also precise robotic manipulation. To address this, we propose a method that leverages efficient and accessible simulators for task execution. Specifically, we train a reinforcement learning agent in a simulator to manipulate deformable objects for surface wiping tasks. We simplify the state representation of object surfaces using UV mapping, process contact feedback from the simulator on 2D feature maps, and use scaled grouped convolutions to extract features from these maps. The agent then outputs actions in a reduced-dimensional action space to generate coverage paths. Experiments demonstrate that our method outperforms previous approaches in key metrics, including total path length and coverage area. We deploy these paths on the Kinova Gen3 manipulator to perform wiping experiments on the back of a torso model, validating the feasibility of our approach.
|
| |
| 09:00-10:30, Paper TuI1I.142 | Add to My Program |
| Hierarchical Reactive Grasping Via Task-Space Velocity Fields and Joint-Space Quadratic Programming |
|
| Lee, Yonghyeon | Massachusetts Institute of Technology |
| Lin, Tzu-Yuan | Massachusetts Institute of Technology |
| Alexiev, Alexander | Massachusetts Institute of Technology |
| Kim, Sangbae | Massachusetts Institute of Technology |
Keywords: Reactive and Sensor-Based Planning, Grasping, Manipulation Planning
Abstract: We present a fast and reactive grasping framework that combines task-space velocity fields with joint-space Quadratic Program (QP) in a hierarchical structure. Reactive, collision-free global motion planning is particularly challenging for high-DoF systems, as simultaneous increases in state dimensionality and planning horizon trigger a combinatorial explosion of the search space, making real-time planning intractable. To address this, we plan globally in a lower-dimensional task space – such as fingertip positions – and track locally in the full joint space while enforcing all constraints. This approach is realized by constructing velocity fields in multiple task-space coordinates (or, in some cases, a subset of joint coordinates) and solving a weighted joint-space QP to compute joint velocities that track these fields with appropriately assigned priorities. Through simulation experiments and real-world tests using the recent pose-tracking algorithm FoundationPose, we verify that our method enables high-DoF arm–hand systems to perform real-time, collision-free reaching motions while adapting to dynamic environments and external disturbances.
|
| |
| 09:00-10:30, Paper TuI1I.143 | Add to My Program |
| LiLa-Net: Lightweight Latent LiDAR Autoencoder for 3D Point Cloud Reconstruction |
|
| Resino Solis, Mario | Universidad Carlos III De Madrid |
| Pérez López, Borja | Universidad Carlos III of Madrid |
| Godoy Calvo, Jaime | Universidad Carlos III De Madrid |
| Al-Kaff, Abdulla | University Carlos III of Madrid |
| Garcia, Fernando | Carlos III University |
Keywords: Deep Learning Methods, Intelligent Transportation Systems
Abstract: This work proposed a 3D autoencoder architecture, named LiLa-Net, which encodes efficient features from real traffic environments, employing only the LiDAR's point clouds. For this purpose, we have real semi-autonomous vehicle, equipped with Velodyne LiDAR. The system leverage skip connections concept to improve the performance without using extensive resources as the state-of-the-art architectures. Key changes include reducing the number of encoder layers and simplifying the skip connections, while still producing an efficient and representative latent space which allows to accurately reconstruct the original point cloud. Furthermore, an effective balance has been achieved between the information carried by the skip connections and the latent encoding, leading to improved reconstruction quality without compromising performance. Finally, the model successfully reconstruct objects unrelated to the original traffic environment.
|
| |
| 09:00-10:30, Paper TuI1I.144 | Add to My Program |
| NORM-Nav: Zero-Shot Mobile Robot Navigation with Natural Language Behavioral Constraints |
|
| Huo, Dongjie | Beijing University of Chemical Technology |
| Wang, Junhui | Macau University of Science and Technology |
| Gao, Chao | University of Cambridge |
| Qiao, Yan | Macau University of Science and Technology |
| Zhang, Dong | School of Information Science and Technology, Beijing University of Chemical Technology |
| Zhou, Guyue | Tsinghua University |
|
|
| |
| 09:00-10:30, Paper TuI1I.145 | Add to My Program |
| Design of an Untethered Multi-Mode Swimming Robot Driven by Electromagnetic Actuators |
|
| Yan, Jinchun | The Hong Kong University of Science and Technology (Guangzhou) |
| Lu, Yiyi | The Hong Kong University of Science and Technology (Guangzhou) |
| Li, Qifan | The Hong Kong University of Science and Technology (Guangzhou) |
| Yasa, Oncay | The Hong Kong University of Science and Technology (Guangzhou) |
Keywords: Mechanism Design, Actuation and Joint Mechanisms, Biologically-Inspired Robots
Abstract: Underwater robots have significant potential for a wide range of applications, including deep-sea exploration, hydrocarbon extraction, marine biodiversity observation, and waste retrieval. A hybrid actuation system that combines electromagnets and permanent magnets preserves the main benefits of magnetic-driven robots, addressing the issues of bulky coil systems and limited mobility. However, most electromagnetically actuated underwater robots are limited to a fixed swimming mode due to their relatively simple designs, which restrict their adaptability to unpredictable and unstructured aquatic environments. In this work, we present an untethered multi-mode swimming robot driven by four 2-degreesof-freedom (DoF) electromagnetic actuators, each with a rigid shell interconnected by flexible connectors and covered with silicone membranes. Initially, we conducted tests to determine the optimal hardness of the flexible connector by validating the module’s range of motion across different activation times. Next, we demonstrated that the robot can swim forward and backward in a water tank, exhibiting snake-inspired motion, front- and rear-undulation, and wave-shaped motion, and reaching a maximum speed of 87.8 mm/s. Finally, we showed the lateral translation and steering motions achieved with different control signals, resulting in an average turning speed of 3◦/s. This approach enables a novel robot design strategy based on compact multi-DoF electromagnetic modules, facilitating potential applications in search-and-rescue missions and environmental inspections.
|
| |
| 09:00-10:30, Paper TuI1I.146 | Add to My Program |
| From Dream to Action: Hierarchical Policy Learning with 3D World Imagination for Robotic Manipulation |
|
| Wang, Wenshuo | National University of Singapore |
| Zhao, Ruiteng | National University of Singapore |
| Teo, Tat Joo | Singapore Institute of Manufacturing Technology |
| Ang Jr, Marcelo H | National University of Singapore |
| Zhu, Haiyue | Agency for Science, Technology and Research (A*STAR) |
Keywords: Imitation Learning, Learning from Demonstration, Deep Learning in Grasping and Manipulation
Abstract: Recent advancements in robotics have focused on developing foundation models capable of generating both actions and future states. Typically, these policies leverage world models to depict human-like imagination. However, most methods remain confined to the 2D domain, where they forecast only the final outcome state rather than the evolving interaction process, thereby offering limited guidance for step-by-step control. To address these limitations, we propose a hierarchical framework that couples 3D imagination, 3D perception, and action generation. A triplane-based world model captures future scene dynamics in a computationally efficient manner, providing predictive cues for decision-making. Based on these representations, the action expert, implemented with a flow-based policy network, converts the outputs of 3D imagination and perception into executable commands. We further introduce an adaptive Classifier-Free Guidance strategy to balance action quality with condition adherence. On Adroit, Meta-World, and real-world tasks, our method achieves a 92% voxel IoU in future state prediction and up to 8% higher success rates than state-of-the-art baselines. The performance gains highlight the effectiveness and generalizability of our method in complex robotic manipulation.
|
| |
| 09:00-10:30, Paper TuI1I.147 | Add to My Program |
| Autonomous Robot-Assisted Ureteroscopy (ARA-URS) for the Treatment of Kidney Stones |
|
| Saini, Sarvesh | University of Miami |
| Ojalvo, Julio | University of Miami |
| Katz, Jonathan | University of Miami |
| Visser, Ubbo | University of Miami |
Keywords: Surgical Robotics: Planning, Surgical Robotics: Laparoscopy, Medical Robots and Systems
Abstract: We present a supervised autonomous robot-assisted ureteroscopy (ARA-URS) system for the treatment of kidney stones, integrated with a digital-twin (DT). A three degrees of freedom (3-DOF) robotic system was developed to actuate a Wolf disposable ureteroscope, enabling the ARA-URS to autonomously position the ureteroscope for laser lithotripsy procedures. The DT of the robotic system was developed to enable the generation of diverse synthetic intraoperative scenarios, which are used to train vision and control models for improved precision in ARAURS. By mapping joint motion to virtual counterparts via precise physics simulation, the system ensures realistic representation and reliable validation. This framework’s performance was assessed with particular focus on endoscopic tip positioning. Initial in-air simulation experiments demonstrated root mean square error (RMSE) values of (1.871, 1.725, 1.194) mm in the x-, y-, and z-directions, respectively, computed as the deviation between the desired laser target position and the achieved ureteroscope tip position. Corresponding real-world experiments yielded RMSE values of (5.029, 3.919, 6.681) mm. The comparison between simulated and physical experiments indicates that the DT is able to reproduce the motion behavior of the physical system with good agreement. Further benchtop and simulation experiments demonstrated the system’s capacity for stone targeting (quantified by the percentage of the image occluded by stone). In digital simulation, the ureteroscope achieved 88.77% average stone area coverage, while in the benchtop model, coverage averaged 59.04%. Together, this proof of concept highlights the potential of DT technology in robotic-assisted URS, offering a scalable and interactive platform for refining surgical techniques.
|
| |
| 09:00-10:30, Paper TuI1I.148 | Add to My Program |
| Efficient Multi-Objective Planning with Weighted Maximization Using Large Neighbourhood Search |
|
| Kalavadia, Krishna | University of Waterloo |
| Dutta, Shamak | University of Waterloo |
| Pant, Yash Vardhan | University of Waterloo |
| Smith, Stephen L. | University of Waterloo |
Keywords: Motion and Path Planning, Optimization and Optimal Control, Task and Motion Planning
Abstract: Autonomous navigation often requires the simultaneous optimization of multiple objectives. The most common approach scalarizes these into a single cost function using a weighted sum, but this method is unable to find all possible trade-offs and can therefore miss critical solutions. An alternative, the weighted maximum of objectives, can find all Pareto-optimal solutions, including those in non-convex regions of the trade-off space that weighted sum methods cannot find. However, the increased computational complexity of finding weighted maximum solutions in the discrete domain has limited its practical use. To address this challenge, we propose a novel search algorithm based on the Large Neighbourhood Search framework that efficiently solves the weighted maximum planning problem. Through extensive simulations, we demonstrate that our algorithm achieves comparable solution quality to existing weighted maximum planners with a runtime improvement of 1-2 orders of magnitude, making it a viable option for autonomous navigation.
|
| |
| 09:00-10:30, Paper TuI1I.149 | Add to My Program |
| Guiding Vector Field Generation Via Score-Based Diffusion Model |
|
| Chen, Zirui | Westlake University, Zhejiang University |
| Guo, Shiliang | WestlakeUniversity |
| Zhao, Shiyu | Westlake University |
Keywords: Motion and Path Planning, Machine Learning for Robot Control, Autonomous Vehicle Navigation
Abstract: Guiding Vector Fields (GVFs) are a powerful tool for robotic path following. However, classical methods assume smooth, ordered curves and fail when paths are unordered, multi-branch, or generated by probabilistic models. We propose a unified framework, termed the Score-Induced Guiding Vector Field (SGVF), which leverages score-based generative modeling to construct vector fields directly from data distributions. SGVF learns tangent fields from point clouds with unit-norm, orthogonality, and directional-consistency losses, ensuring geometric fidelity and control feasibility. This approach removes the reliance on ad-hoc path segmentation and enables guidance along complex topologies such as branching and pseudo-manifolds. The study establishes a correspondence between score vanishing in diffusion models and GVF singularities and highlights representational capacity near sharp path curvatures. Experiments on robotic navigation in planar environments demonstrate that SGVF achieves reliable path following in scenarios where classical GVFs fail, underscoring its potential as a bridge between generative modeling and geometric control. Code and experiment video are available at https://github.com/czr-gif/Guiding-Vector-Field-Generation-via-Score-based-Diffusion-Model.
|
| |
| 09:00-10:30, Paper TuI1I.150 | Add to My Program |
| CLAR: Learning 3D Representations for Robotic Manipulation by Fusing Masked Reconstruction with Multi-Level Contrastive Alignment |
|
| Cui, Wenbo | Institute of Automation, Chinese Academy of Sciences |
| Zhao, Chengyang | Carnegie Mellon University |
| Chen, Yuhui | Chinese Academic of Science, Institute of Automation |
| Li, Haoran | Institute of Automation, Chinese Academy of Sciences |
| Zhang, Zhizheng | Beijing Galbot Co., Ltd |
| Zhao, Dongbin | Chinese Academy of Sciences |
| Wang, He | Peking University |
Keywords: Imitation Learning, Representation Learning
Abstract: The spatial information inherent in 3D point clouds is crucial for robotic manipulation. However, existing 3D pre-training methods face a fundamental trade-off: Masked Autoencoding (MAE) excels at capturing spatial-geometric features but lacks semantics, whereas contrastive learning, while able to distill semantics from 2D foundation models, is ill-suited for the fine-grained details required for manipulation tasks. To address these challenges, we propose CLAR, a novel 3D pre-training framework that synergizes global understanding with fine-grained local alignment. Our framework unifies MAE with global cross-modal contrastive learning to integrate robust spatial awareness with rich semantic understanding. To enhance its focus on fine-grained details, at the local level, we introduce an adaptive alignment mechanism that leverages deformable attention to force precise correspondences between local 3D geometry and 2D visual features, thereby overcoming the limitations of conventional global alignment in manipulation tasks. Extensive experiments in simulation and the real world demonstrate that CLAR achieves state-of-the-art performance, significantly outperforming existing methods in visuomotor policy learning.
|
| |
| 09:00-10:30, Paper TuI1I.151 | Add to My Program |
| Turning Disturbances into Actuation: Hierarchical Environment-Assisted MPC for USV Fault Recovery |
|
| Hu, Yang | University College London |
| Aldhaheri, Sara | University College London; TII |
| Wang, Yanchao | University College London |
| Wu, Peng | University College London |
| Liu, Yuanchang | University College London |
Keywords: Marine Robotics, Robust/Adaptive Control, Failure Detection and Recovery
Abstract: Thruster failures in unmanned surface vehicles (USVs) can critically compromise mission completion, particularly when severe degradation eliminates controllability in essential degrees of freedom. While traditional fault-tolerant control treats environmental disturbances as impediments to be rejected, this paper presents a novel approach: strategically exploiting wind and wave forces as virtual actuators for emergency harbor return. The proposed environment-assisted model predictive control (EAMPC) framework adaptively modulates environmental force utilization factors based on fault severity and the environmental force prediction confidence, transforming natural disturbances into environmental assistance. The hierarchical architecture integrates state estimation and prediction with physics-informed learning for short-term environmental forces, and reachability-based trajectory planning that exploits environmental forces to expand feasible zones. Theoretical analysis establishes practical input-to-state stability with explicit bounds quantifying degradation. Extensive validation across 320 trials demonstrates 91.25% mission success under 95% thruster degradation compared to 0% for baseline methods. This work demonstrates that strategic environmental exploitation fundamentally transforms fault recovery capabilities in marine robotics.
|
| |
| 09:00-10:30, Paper TuI1I.152 | Add to My Program |
| Explicit Analytical Derivation of LSL and RSR Dubins' Paths for Intercepting a Uniformly Moving Target |
|
| Belier, Titouan | ENSTA |
| Lapierre, Lionel | LIRMM |
| Viel, Christophe | CNRS, Lab-STICC |
| Degorre, Loick | Université De Toulon |
| Massé, Damien | Université De Bretagne Occidentale |
Keywords: Nonholonomic Motion Planning, Constrained Motion Planning, Autonomous Vehicle Navigation
Abstract: Autonomous underwater robotics faces significant challenges, particularly in the reliable recovery of Autonomous Underwater Vehicles (AUVs) after mission completion. To address this, small AUVs can dock onto a moving mothership for safe transport to recovery sites, reducing operational risks. This paper presents an explicit analytical derivation of the LSL (Left-Straight-Left) and RSR (Right-Straight-Right) Dubins' Paths for intercepting an uniformly moving target, a critical problem for robust rendezvous in dynamic marine and underwater environments. The proposed approach leverages the classical Dubins' Path model to generate time optimal, real-time, curvature-constrained paths suitable for 2D AUVs. Experimental validation on an Unmanned Surface Vehicle (USV) demonstrates the effectiveness of the developed motion planning strategy.
|
| |
| 09:00-10:30, Paper TuI1I.153 | Add to My Program |
| CDC-SLAM: Switchable Centralized and Distributed Collaborative LiDAR SLAM Framework for Robotic Swarms |
|
| Liu, Xiangnan | Guangdong University of Technology |
| Huo, Xiang | Guangdong University of Technology |
| Zhu, Haifei | Guangdong University of Technology |
| Guan, Yisheng | Guangdong University of Technology |
| Zhang, Hong | SUSTech |
| Chen, Weinan | Guangdong University of Technology |
|
|
| |
| 09:00-10:30, Paper TuI1I.154 | Add to My Program |
| B^2F-Map: Crowd-Sourced Mapping with Bayesian B-Spline Fusion |
|
| Xie, Yiping | Linköping University |
| Xia, Yuxuan | Shanghai Jiao Tong University |
| Stenborg, Erik | Zenseact |
| Fu, Junsheng | Zenseact |
| Beauvisage, Axel | Zenseact AB |
| Garcia, Gabriel E. | Zenseact AB |
| Wu, Tianyu | Zenseact |
| Hendeby, Gustaf | Linköping University |
Keywords: Mapping, Intelligent Transportation Systems, Sensor Fusion
Abstract: Crowd-sourced mapping offers a scalable alternative to creating maps using traditional survey vehicles. Yet, existing methods either rely on prior high-definition (HD) maps or neglect uncertainties in the map fusion. In this work, we present a complete pipeline for HD map generation using production vehicles equipped only with a monocular camera, consumer-grade GNSS, and IMU. Our approach includes on-cloud localization using lightweight standard-definition maps, on-vehicle mapping via an extended object trajectory (EOT) Poisson multi-Bernoulli (PMB) filter with Gibbs sampling, and on-cloud multi-drive optimization and Bayesian map fusion. We represent the lane lines using B-splines, where each B-spline is parameterized by a sequence of Gaussian distributed control points, and propose a novel Bayesian fusion framework for B-spline trajectories with differing density representation, enabling principled handling of uncertainties. We evaluate our proposed approach, B^2F-Map, on large-scale real-world datasets collected across diverse driving conditions and demonstrate that our method is able to produce geometrically consistent lane-level maps.
|
| |
| 09:00-10:30, Paper TuI1I.155 | Add to My Program |
| A Spatiotemporal Brain Activity Visualization and Assessment Framework for Human-Robot Cognitive Interaction Training |
|
| Huang, Zonghai | University of Electronic Science and Technology of China |
| Zhang, Lianchi | University of Electronic Science and Technology of China |
| Zhang, Jingting | University of Electronic Science and Technology of China |
| Mu, Fengjun | University of Electronic Science and Technology of China |
| Huang, Rui | University of Electronic Science and Technology of China |
| Cheng, Hong | University of Electronic Science and Technology |
Keywords: Brain-Machine Interfaces, Virtual Reality and Interfaces, Cognitive Modeling
Abstract: Accurately assessing brain activity to modulate training parameters online is crucial for improving the human-robot cognitive interaction (HRCI) performance in closed-loop brain training. The major challenge for this technique lies in how to accurately model and characterize the intrinsic behavior of brain activity in HRCI process, which typically exhibits a dynamic manner across spatial and temporal scales. In this study, we propose a dynamic perspective to visualize the spatiotemporal evolution of brain activity during HRCI process, thus enabling assessment of brain states and adaptive modulation during rehabilitation. A novel framework is developed to model the spatiotemporal dynamics of brain activity by integrating deterministic learning with neural population theory. It demonstrates a remarkable capability to mine and visualize the complex nonlinear dynamics of brain activity, encompassing both temporal evolution and spatial connectivity patterns. The proposed model not only visualizes of spatiotemporal brain dynamics but also enables online assessment of brain states, which can facilitate optimal modulation of HRCI process and improve the brain training efficiency. The method is validated using a panoramic virtual reality system. Results show that our method improves the accuracy of brain activity assessment by 8.86%, effectively demonstrating that it accurately visualizes spatiotemporal brain dynamics and enhances training outcomes when integrated with HRCI.
|
| |
| 09:00-10:30, Paper TuI1I.156 | Add to My Program |
| Thermal Image Refinement with Depth Estimation Using Recurrent Networks for Monocular ORB-SLAM3 |
|
| Şahin, Hürkan | Universität Paderborn |
| Pham, Huy | Aarhus University |
| Dang, Van Huyen | Paderborn University |
| Yegenoglu, Alper | Automatic Control Group, University of Paderborn, Paderborn, Germany |
| Kayacan, Erdal | Paderborn University |
|
|
| |
| 09:00-10:30, Paper TuI1I.157 | Add to My Program |
| DSPv2: Improved Dense Policy for Effective and Generalizable Whole-Body Mobile Manipulation |
|
| Su, Yue | Xidian University |
| Zhang, Chubin | Tsinghua University |
| Chen, Sijin | The University of HongKong |
| Tan, Liufan | Chongqing University |
| Tang, Yansong | Tsinghua University |
| Wang, Jianan | Astribot |
| Liu, Xihui | The University of Hong Kong |
Keywords: Imitation Learning, Bimanual Manipulation, Mobile Manipulation
Abstract: Learning whole-body mobile manipulation via imitation is essential for generalizing robotic skills to diverse environments and complex tasks. However, this goal is hindered by significant challenges, particularly in effectively processing complex observation, achieving robust generalization, and generating coherent actions. To address these issues, we propose DSPv2, a novel policy architecture. DSPv2 introduces an effective encoding scheme that aligns 3D spatial features with multi-view 2D semantic features. This fusion enables the policy to achieve broad generalization while retaining the fine-grained perception necessary for precise control. Furthermore, we extend the Dense Policy paradigm to the whole-body mobile manipulation domain, demonstrating its effectiveness in generating coherent and precise actions for the whole-body robotic platform. Extensive experiments show that our method significantly outperforms existing approaches in both task performance and generalization ability. Project page is available at: https://selen-suyue.github.io/DSPv2Net/.
|
| |
| 09:00-10:30, Paper TuI1I.158 | Add to My Program |
| StepNav: Structured Trajectory Priors for Efficient and Multimodal Visual Navigation |
|
| Luo, Xubo | University of Chinese Academy of Sciences |
| Wu, Aodi | University of Chinese Academy of Sciences |
| Han, Haodong | University of Chinese Academy of Sciences |
| Wan, Xue | Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences |
| Zhang, Wei | Chinese Academy of Sciences |
| Shu, Leizheng | Chinese Academy of Sciences |
| Wang, Ruisuo | Chinese Academy of Sciences Technology and Engineering Center for Space Utilization |
|
|
| |
| 09:00-10:30, Paper TuI1I.159 | Add to My Program |
| Leveraging Two Robotic Arms for Tight Assembly Performance Gains |
|
| Livnat, Dror | Tel Aviv University |
| Lavi, Yuval | Tel Aviv University |
| Bilevich, Michael M. | Tel Aviv University |
| Halperin, Dan | Tel Aviv University |
Keywords: Dual Arm Manipulation, Assembly, Motion and Path Planning
Abstract: We provide a novel end-to-end framework for the execution of an assembly operation by two robotic arms, given the digital CAD models of the parts and their desired relative placement in their assembled state. We analyze and demonstrate the advantages of using two robotic arms simultaneously in tight assembly operations, compared to single-arm systems. Our method is implemented in both simulation and using physical robots. It provides theoretical guarantees on execution time and trajectory accuracy, supported by empirical evidence. In particular, we show that coordinated movement of two arms reduces average execution time by more than 50% compared to using a single arm only, produces higher-quality trajectories, and accelerates the search for valid robot placements. Furthermore, we establish bounds on the required dimensions of the robotic cell. Our open source software together with real-life video demonstrations are available in our project page.
|
| |
| 09:00-10:30, Paper TuI1I.160 | Add to My Program |
| NLiPsCalib: An Efficient Calibration Framework for High-Fidelity 3D Reconstruction of Curved Visuotactile Sensors |
|
| Qin, Xuhao | Shanghaitech University |
| Zhao, Feiyu | ShanghaiTech University |
| Leng, Yatao | ShanghaiTech University |
| Hu, Runze | ShanghaiTech University |
| Xiao, Chenxi | ShanghaiTech University |
Keywords: Force and Tactile Sensing, Soft Sensors and Actuators
Abstract: Recent advances in visuotactile sensors increasingly employ biomimetic curved surfaces to enhance sensorimotor capabilities. Although such curved visuotactile sensors enable more conformal object contact, their perceptual quality is often degraded by non-uniform illumination, which reduces reconstruction accuracy and typically necessitates calibration. Existing calibration methods commonly rely on customized indenters and specialized devices to collect large-scale photometric data, but these processes are expensive and labor-intensive. To overcome these calibration challenges, we present NLiPsCalib, a physics-consistent and efficient calibration framework for curved visuotactile sensors. NLiPsCalib integrates controllable near-field light sources and leverages Near-Light Photometric Stereo (NLiPs) to estimate contact geometry, simplifying calibration to just a few simple contacts with everyday objects. We further introduce NLiPsTac, a controllable-light-source tactile sensor developed to validate our framework. Experimental results demonstrate that our approach enables high-fidelity 3D reconstruction across diverse curved form factors with a simple calibration procedure. We emphasize that our approach lowers the barrier to developing customized visuotactile sensors of diverse geometries, thereby making visuotactile sensing more accessible to the broader community.
|
| |
| 09:00-10:30, Paper TuI1I.161 | Add to My Program |
| Near-Field Driven Origami-Based Bio-Inspired Jellyfish Robot |
|
| Wang, Sen | The University of Hong Kong |
| Lu, Chengxiang | The University of Hong Kong |
| Chen, Lepeng | Northwestern Polytechnical University |
| Tang, Yi | University of Hong Kong |
| Chen, Mansen | City University of Hong Kong |
| Zhang, Shuai | The University of Hong Kong |
| Liu, Jun | The University of Hong Kong |
Keywords: Biologically-Inspired Robots, Soft Robot Applications
Abstract: The development of bio-inspired jellyfish robots holds significant benefits for autonomous aquatic systems due to jellyfish’s efficient water jet propulsion. However, the current design of jellyfish robots still faces challenges in balancing high biological fidelity with the demands of lightweight, compact design and high-speed locomotion. This study presents a bio-inspired jellyfish robot that emulates the shape and efficient water jet propulsion of natural jellyfish. The origami-based bell and supporting structure are designed to form an efficient water-jet cavity. As the primary components of the robot, they endow the robot with the ability to self-recover to a stable state during locomotion, thereby reducing energy consumption. Additionally, they replace traditional transmission mechanisms, thereby reducing the weight and complexity of the robot. An optimization model is established to determine the optimal parameters of the robot. Furthermore, a near-field magnetic actuation system is designed to drive the robot, enabling contactless and silent underwater driving without waterproofing requirements. The robot features a diameter of 101.6 mm, a height of 63.8 mm, and a weight of 12.5 g. Experimental results demonstrate a maximum locomotion speed of up to 96.2 mm/s.
|
| |
| 09:00-10:30, Paper TuI1I.162 | Add to My Program |
| As You Wish: Mission Planning with Formal Verification Using LLMs in Precision Agriculture |
|
| Zuzuarregui, Marcos | University of California, Merced |
| Carpin, Stefano | University of California, Merced |
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation, AI-Enabled Robotics
Abstract: Though robotic systems are now being commercialized and deployed in various industries, many of these systems are highly specialized and often require an advanced skill set to operate and ensure they perform as instructed. To mitigate this problem, it has been proposed to use large language models (LLMs) to synthesize mission plans in precision agriculture and other domains based on mission descriptions provided in natural language (NL). While these systems demonstrate impressive performance, they also suffer from the inherent ambiguities of NL. In this paper, we address this issue by introducing a planning architecture that combines LLMs with linear temporal logic (LTL) to ensure that, through formal verification, the mission planning system meets the specifications formulated by the user while still using NL. In our proposed system, the mission plan is seen as the implementation and the LTL formalization is seen as the specification. Both are automatically extracted from mission descriptions provided in NL. To mitigate potential bias, two separate LLMs are tasked with the implementation and specification generation. Through feedback loops, the system self-corrects when syntax or verification errors are encountered, thus offering a fully hands-off solution. Through extensive experiments, we highlight the strengths and limitations of integrating mission verification into a fully autonomous pipeline, particularly regarding an LLM's ability to generate valuable LTL formulas, and show how our proposed implementation addresses and solves these challenges.
|
| |
| 09:00-10:30, Paper TuI1I.163 | Add to My Program |
| DiSPo: Diffusion-SSM Based Policy Learning for Coarse-To-Fine Action Discretization |
|
| Oh, Nayoung | Korea Advanced Institute of Science and Technology, KAIST |
| Jang, Jaehyeong | Korea Advanced Institute of Science and Technology, KAIST |
| Jung, Moonkyeong | Korea Advanced Institute of Science and Technology, KAIST |
| Park, Daehyung | Korea Advanced Institute of Science and Technology, KAIST |
Keywords: Machine Learning for Robot Control, Industrial Robots, Assembly
Abstract: We aim to solve the problem of learning user-intended granular skills from multi-granularity demonstrations. Traditional learning-from-demonstration methods typically rely on extensive fine-grained data, interpolation techniques, or dynamics models, which are ineffective at encoding or decoding the diverse granularities inherent in skills. To overcome it, we introduce a novel diffusion-SSM based policy (DiSPo) that leverages a state-space model, Mamba, to learn from diverse coarse demonstrations and generate multi-scale actions. Our proposed step-scaling mechanism in Mamba is a key innovation, enabling memory-efficient learning, flexible granularity adjustment, and robust representation of multi-granularity data. DiSPo outperforms state-of-the-art baselines on coarse-to-fine benchmarks, achieving up to an 81% improvement in success rates while enhancing inference efficiency by generating inexpensive coarse motions where applicable. We validate DiSPo's scalability and effectiveness on real-world manipulation scenarios. Code and Videos are available at https://robo-dispo.github.io.
|
| |
| 09:00-10:30, Paper TuI1I.164 | Add to My Program |
| CFEAR-Teach-And-Repeat: Fast and Accurate Radar-Only Localization |
|
| Hilger, Maximilian | Technical University of Munich |
| Adolfsson, Daniel | Örebro University |
| Becker, Ralf | Company Bosch Rexroth |
| Andreasson, Henrik | Örebro University |
| Lilienthal, Achim J. | TU Munich |
Keywords: Localization, Mapping, Field Robots
Abstract: Reliable localization in prior maps is essential for autonomous navigation, particularly under adverse weather, where optical sensors may fail. We present CFEAR-TR, a teach-and-repeat localization pipeline using a single spinning radar, which is designed for easily deployable, lightweight, and robust navigation in adverse conditions. Our method localizes by jointly aligning live scans to both stored scans from the teach mapping pass, and to a sliding window of recent live keyframes. This ensures accurate and robust pose estimation across different seasons and weather phenomena. Radar scans are represented using a sparse set of oriented surface points, computed from Doppler-compensated measurements. The map is stored in a pose graph that is traversed during localization. Experiments on the held-out test sequences from the Boreas dataset show that CFEAR-TR can localize with an accuracy as low as 0.117 m and 0.096°, corresponding to improvements of up to 63% over the previous state of the art, while running efficiently at 29 Hz. These results substantially narrow the gap to lidar-level localization, particularly in heading estimation. We make the C++ implementation of our work available to the community.
|
| |
| 09:00-10:30, Paper TuI1I.165 | Add to My Program |
| UM3D: Towards a Unified Multimodal 3D Shape Generation Model |
|
| Han, Xian-Feng | Southwest University |
| Zhang, Zecheng | Southwest University |
| He, Xuran | SouthWest University |
| Wang, Ming-Jie | Zhejiang Sci-Tech University |
Keywords: Deep Learning for Visual Perception, Visual Learning
Abstract: Vision-Language Pre-training models (VLMs) have emerged as a highly promising solution to the generative problem, achieving remarkable success in the field of 2D image generation. However, extending these 2D paradigms to 3D domains is still unexplored due to the scarcity of text-3D pairs and shape ambiguity. To address this challenge, we introduce UM3D, a two-stage pre-training architecture towards unified multimodal 3D shape generation. Our approach first optimizes a Finite Scalar Quantization based Autoencoder (FSQ-AE) to learn a compact yet powerful implicit representation with improved codebook utilization. We then encode sketch features into CLIP's multimodal embedding space to incorporate additional geometric information. This unified space conditions our well-designed Instance-Normalized Glow model (Glow-IN) to model the distribution of 3D shape representations while mitigating distribution shift issues. During inference, UM3D can accept individual text, image, sketch, or combined inputs to generate corresponding 3D shapes. Quantitative and qualitative evaluations confirm our method's effectiveness in synthesizing high-fidelity, input-consistent 3D geometries.
|
| |
| 09:00-10:30, Paper TuI1I.166 | Add to My Program |
| Task-Aware and Structure-Knowledge-Guided Quantization for End-To-End YOLO Object Detection |
|
| Zhu, MingHua | National University of Defense Technology |
| Li, Liangwei | National University of Defense Technology |
| Zhou, Shunan | National University of Defense Technology |
| Jiang, Jingfei | National University of Defense Technology |
| Xu, Jinwei | National University of Defense Technology |
Keywords: Deep Learning for Visual Perception, Visual Learning, Computer Vision for Transportation
Abstract: The YOLO series of models are pivotal for real-time object detection, yet their deployment on resource-constrained edge devices necessitates effective model compression. Post-Training Quantization (PTQ) offers a promising, low-cost solution, but existing methods, primarily designed for classification tasks, often lead to significant performance degradation when applied to YOLO models. In this paper, we systematically analyze the key challenges in quantizing YOLO architectures. We identify three primary obstacles: (1) the high sensitivity of detection tasks to quantization errors, exacerbated by the non-linear IoU metric; (2) the pronounced long-tail distribution of activations, particularly with the SiLU function, which complicates low-bit quantization; and (3) the structural heterogeneity of the multi-scale, multi-task detection head, which renders conventional block-wise quantization strategies ineffective. To address these challenges, we propose a novel framework, Task-Aware and Structure-Knowledge-guided Quantization (TASKQ). Our framework introduces three key components: a sparse quantization strategy to mitigate the impact of long-tailed activations, a Detection-aware Task Regularization (DTR) mechanism that incorporates IoU-based loss to guide parameter fine-tuning, and a Scale-and-Task-Aware Head-wise Quantization (STAHQ) scheme that aligns quantization granularity with the head's functional structure. Extensive experiments on various YOLO models demonstrate that TASKQ significantly outperforms existing PTQ methods, especially in low-bit scenarios, establishing a new state-of-the-art for end-to-end YOLO quantization.
|
| |
| 09:00-10:30, Paper TuI1I.167 | Add to My Program |
| Developing Vision-Language-Action Model from Egocentric Videos |
|
| Yoshida, Tomoya | Kyoto University |
| Kurita, Shuhei | National Institute of Informatics |
| Nishimura, Taichi | Sony Interactive Entertainment |
| Mori, Shinsuke | Kyoto University |
Keywords: Data Sets for Robot Learning, Visual Learning, Learning from Demonstration
Abstract: Egocentric videos capture how humans manipulate objects and tools, providing diverse motion cues for learning object manipulation. Unlike the costly, expert-driven manual teleoperation commonly used in training Vision-Language-Action models (VLAs), egocentric videos offer a scalable alternative. However, prior studies that leverage such videos for training robot policies typically rely on auxiliary annotations, such as detailed hand-pose recordings. Consequently, it remains unclear whether VLAs can be trained directly from raw egocentric videos. In this work, we address this challenge by leveraging EgoScaler, a framework that extracts 6DoF object manipulation trajectories from egocentric videos without requiring auxiliary recordings. We apply EgoScaler to four large-scale egocentric video datasets and automatically refine noisy or incomplete trajectories, thereby constructing a new large-scale dataset for VLA pre-training. Our experiments with a state-of-the-art pi_0 architecture in both simulated and real-robot environments yield three key findings: (i) pre-training on our dataset improves task success rates by over 20% compared to training from scratch, (ii) the performance is competitive with that achieved using real-robot datasets, and (iii) combining our dataset with real-robot data yields further improvements. These results demonstrate that egocentric videos constitute a promising and scalable resource for advancing VLA research.
|
| |
| 09:00-10:30, Paper TuI1I.168 | Add to My Program |
| PIPS: Planar Instance 3D Reconstruction Leveraging Planar Structural Priors |
|
| Wang, Jiahui | Beijing Institute of Technology |
| Chen, Ye | Beijing Institute of Technology |
| Deng, Yinan | Beijing Institute of Technology |
| Yang, Yi | Beijing Institute of Technology |
| Yue, Yufeng | Beijing Institute of Technology |
Keywords: Mapping
Abstract: Planar structures, ubiquitous in man-made indoor environments, enable compact and accurate scene abstraction for various downstream tasks. Recent methods distill planar features into learning-based MVS geometries to obtain coherent 3D plane estimation from multi-view inputs. However, the lack of explicit planar instance definitions hinders semantic–geometry alignment, leading to distorted geometry and mismatched semantics. To address this, we propose PIPS, a planar-instance 3D reconstruction method that leverages planar structural priors for both single-view planar segmentation (SGPS module) and multi-view instance association (MVPI module). The planar instance point clouds are regularized by planar distances and then converted into complete planar meshes via an instance-level planar meshing strategy. Extensive experiments on hundreds of indoor scenes demonstrate the superior performance of our method, which is less dependent on annotations and requires no feature optimization. The effectiveness of each component is further verified through comprehensive ablation studies. The project page of PIPS is available at https://pips325.github.io.
|
| |
| 09:00-10:30, Paper TuI1I.169 | Add to My Program |
| Hydrosoft: Non-Holonomic Hydroelastic Models for Compliant Tactile Manipulation |
|
| Oller, Miquel | University of Michigan |
| Dang, An | University of Michigan |
| Fazeli, Nima | University of Michigan |
Keywords: Contact Modeling, Dexterous Manipulation, Bimanual Manipulation
Abstract: Tactile sensors have long been valued for their perceptual capabilities, offering rich insights into the otherwise hidden interface between the robot and grasped objects. Yet their inherent compliance—a key driver of force-rich interactions—remains underexplored. The central challenge is to capture the complex, nonlinear dynamics introduced by these passive compliant elements. Here, we present a computationally efficient non-holonomic hydroelastic model that accurately models path-dependent contact force distributions and dynamic surface area variations. Our insight is to extend the object’s state space, explicitly incorporating the distributed forces generated by the compliant sensor. Our differentiable formulation not only accounts for path-dependent behavior but also enables gradient-based trajectory optimization, seamlessly integrating with high-resolution tactile feedback. We demonstrate the effectiveness of our approach across a range of simulated and real-world experiments and demonstrate the importance of modeling the path dependence of sensor dynamics.
|
| |
| 09:00-10:30, Paper TuI1I.170 | Add to My Program |
| Joint Flow Trajectory Optimization for Feasible Robot Motion Generation from Video Demonstrations |
|
| Dong, Xiaoxiang | Carnegie Mellon University |
| Johnson-Roberson, Matthew | Carnegie Mellon University; Vanderbilt University |
| Zhi, Weiming | The University of Sydney; Vanderbilt University |
Keywords: Learning from Demonstration, Probabilistic Inference
Abstract: Learning from human video demonstrations offers a scalable alternative to teleoperation or kinesthetic teaching, but poses challenges for robot manipulators due to embodiment differences and joint feasibility constraints. We address this problem by proposing the Joint Flow Trajectory Optimization (JFTO) framework for grasp pose generation and object trajectory imitation under the video-based Learning from Demonstration (LfD) paradigm. Rather than directly imitating human hand motions, our method treats demonstrations as object-centric guides, balancing three objectives: (i) selecting a feasible grasp pose, (ii) generating object trajectories consistent with demonstrated motions, and (iii) ensuring collision-free execution within robot kinematics. To capture the multimodal nature of demonstrations, we extend flow matching to SE(3) for probabilistic modeling of object trajectories, enabling density-aware imitation that avoids mode collapse. The resulting optimization integrates grasp similarity, trajectory likelihood, and collision penalties into a unified differentiable objective. We validate our approach in both simulation and real-world experiments across diverse real-world manipulation tasks.
|
| |
| 09:00-10:30, Paper TuI1I.171 | Add to My Program |
| CEDex: Cross-Embodiment Dexterous Grasp Generation at Scale from Human-Like Contact Representations |
|
| Wu, Zhiyuan | King's College London |
| Potamias, Rolandos Alexandros | Imperial College London |
| Zhang, Xuyang | King's College London |
| Zhang, Zhongqun | Nankai University |
| Deng, Jiankang | Imperial College London |
| Luo, Shan | King's College London |
Keywords: Grasping, Dexterous Manipulation
Abstract: Cross-embodiment dexterous grasp synthesis refers to adaptively generating and optimizing grasps for various robotic hands with different morphologies. This capability is crucial for achieving versatile robotic manipulation in diverse environments and requires substantial amounts of reliable and diverse grasp data for effective model training and robust generalization. However, existing approaches either rely on physics-based optimization that lacks human-like kinematic understanding or require extensive manual data collection processes that are limited to anthropomorphic structures. In this paper, we propose CEDex, a novel cross-embodiment dexterous grasp synthesis method at scale that bridges human grasping kinematics and robot kinematics by aligning robot kinematic models with generated human-like contact representations. Given an object's point cloud and an arbitrary robotic hand model, CEDex first generates human-like contact representations using a Conditional Variational Auto-encoder pretrained on human contact data. It then performs kinematic human contact alignment through topological merging to consolidate multiple human hand parts into unified robot components, followed by a signed distance field-based grasp optimization with physics-aware constraints. Using CEDex, we construct the largest cross-embodiment grasp dataset to date, comprising 500K objects across four gripper types with 20M total grasps. Extensive experiments show that CEDex outperforms state-of-the-art approaches and our dataset benefits cross-embodiment grasp learning with high-quality diverse grasps.
|
| |
| 09:00-10:30, Paper TuI1I.172 | Add to My Program |
| Biarticular Rigid Powered Lower Extremity Exoskeleton Robot |
|
| Chen, Tianchi | Chongqing University |
| Liu, Zhi | Chongqing University |
| Li, Chaoyang | Chongqing University |
| Chen, Xiaoan | Chongqing University |
| Hu, Jianjun | Chongqing University |
| Wu, Jiaxun | Chongqing University |
| He, Ye | Chongqing University |
Keywords: Prosthetics and Exoskeletons, Biologically-Inspired Robots, Hardware-Software Integration in Robotics
Abstract: Lower extremity exoskeletons designed for multi-joint assistance are increasingly explored for rehabilitation and human augmentation. However, conventional monoarticular designs often suffer from joint misalignment and actuator redundancy, limiting their efficiency and user comfort. This study presents a biarticular rigid powered lower extremity exoskeleton that simultaneously assists the knee and ankle joints through a single actuator, enabling coordinated torque generation across adjacent joints. A hierarchical control framework combining gait segmentation, impedance-based torque generation, and gravity/friction compensation is implemented to provide phase-specific assistance. Experimental results show that the proposed exoskeleton reduces gastrocnemius activation by up to 63.4% and metabolic cost by up to 11.6% during stair ascent, with corresponding reductions of 28.3% and 8.2% during level walking. These findings demonstrate the effectiveness of the biarticular and underactuated structure in enhancing locomotor efficiency, highlighting its potential as a compact and practical solution for dynamic and diverse mobility scenarios.
|
| |
| 09:00-10:30, Paper TuI1I.173 | Add to My Program |
| Real-Time Online Learning for Model Predictive Control Using a Spatio-Temporal Gaussian Process Approximation |
|
| Bartels, Lars | ETH Zürich |
| Lahr, Amon | ETH Zürich |
| Carron, Andrea | ETH Zurich |
| Zeilinger, Melanie N. | ETH Zurich |
Keywords: Model Learning for Control, Machine Learning for Robot Control, Optimization and Optimal Control
Abstract: Learning-based model predictive control (MPC) can enhance control performance by correcting for model inaccuracies, enabling more precise state trajectory predictions than traditional MPC. A common approach is to model unknown residual dynamics as a Gaussian process (GP), which leverages data and also provides an estimate of the associated uncertainty. However, the high computational cost of online learning poses a major challenge for real-time GP-MPC applications. This work presents an efficient implementation of an approximate spatio-temporal GP model, offering online learning at constant computational complexity. It is optimized for GP-MPC, where it enables improved control performance by learning more accurate system dynamics online in real-time, even for time-varying systems. The performance of the proposed method is demonstrated by simulations and hardware experiments in the exemplary application of autonomous miniature racing.
|
| |
| 09:00-10:30, Paper TuI1I.174 | Add to My Program |
| R2-LIO: Real-Time and Robust LiDAR-Inertial Odometry in Dynamic Environments |
|
| Changjun, Gu | Chongqing University of Posts and Telecommunications |
| Huang, Ziyi | Chongqing University of Posts and Telecommunications |
| Sun, Gan | South China University of Technology |
| Dong, Jiahua | Mohamed Bin Zayed University of Artificial Intelligence |
| Leng, Jiaxu | Chongqing University of Posts and Telecommunication |
| Gao, Xinbo | Chongqing University of Posts and Telecommunications |
Keywords: Localization, SLAM, Mapping
Abstract: LiDAR-Inertial Odometry (LIO) is crucial for robot navigation and autonomous driving. Most existing methods rely on the assumption of a static environment, indiscriminately using all LiDAR measurements for localization. However, LiDAR data acquired in urban scenes often contain dynamic objects such as vehicles and pedestrians, which can adversely affect localization accuracy—particularly when using solid-state LiDAR with a relatively narrow field of view. To address this issue, we propose a novel Real-time and Robust solid-state LiDAR-Inertial Odometry (R2-LIO) framework that remove the dynamic objects to improve the localization accuracy and robustness. Specifically, we design a dynamic point removal mechanism based on voxel state changes, which removes dynamic points and preserving most static points to effectively reduce interference from dynamic objects. In addition, we introduce a line search mechanism into the Error State Iterated Kalman Filter (ESIKF) to improve the localization accuracy. Experimental results on the challenging YULAN and HeLiPR datasets show that R2-LIO surpasses existing methods, verifying its effectiveness in improving the localization accuracy and robustness.
|
| |
| 09:00-10:30, Paper TuI1I.175 | Add to My Program |
| MorphoBall: A Bio-Inspired Transformable Spherical Robot with Dual Terrestrial Gaits and Surface Swimming Capability |
|
| Liu, Jinyuan | Zhejiang University of Technology |
| Tian, Guangzhi | HKUST |
| Jin, Yuqiang | Zhejiang University of Technology |
| Shi, Ling | The Hong Kong University of Science and Technology |
| Fu, Minglei | Zhejiang University of Technology |
| Zhang, Wen-An | Zhejiang University of Technology, China |
| Chen, Bo | Zhejiang University of Technology |
Keywords: Field Robots, Marine Robotics, Biomimetics
Abstract: MorphoBall is a bio-inspired, deformable spherical robot designed for multimodal locomotion across terrestrial and aquatic environments. By integrating a dual-mode drive system (spherical rolling and differential-drive) with a morphology-mediated propulsion mechanism, MorphoBall achieves adaptive mobility in diverse terrains, including flat ground, slopes, and water surfaces. A key innovation lies in the dual-function ciliary band, which provides both passive damping during terrestrial rolling and active propulsion during aquatic navigation. Model-based controllers are developed to regulate forward velocity, trajectory curvature, and roll tilt angle, demonstrating superior stability and responsiveness compared to baseline PID implementations. Experimental results validate MorphoBall's ability to autonomously navigate structured indoor environments and traverse unstructured outdoor terrains, achieving seamless mode transitions and completing missions 34% faster than single-morphology strategies. This work highlights the potential of morphology adaptation as a tool for enhancing environmental adaptability in mobile robotics.
|
| |
| 09:00-10:30, Paper TuI1I.176 | Add to My Program |
| EZ-SP: Fast and Lightweight Superpoint-Based 3D Segmentation |
|
| Geist, Louis | ENPC |
| Landrieu, Loic | ENPC |
| Robert, Damien | University of Zurich |
Keywords: Semantic Scene Understanding, Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization
Abstract: Superpoint-based pipelines provide an efficient alternative to point- or voxel-based 3D semantic segmentation, but are often bottlenecked by their CPU-bound partition step. We propose a learnable, fully GPU partitioning algorithm that generates geometrically and semantically coherent superpoints 13× faster than prior methods. Our module is compact (under 60k parameters), trains in under 20 minutes with a differentiable surrogate loss, and requires no handcrafted features. Combined with a lightweight superpoint classifier, the full pipeline fits in < 2 MB of VRAM, scales to multi-million-point scenes, and supports real-time inference. With 72× faster inference and 120× fewer parameters, EZ-SP matches the accuracy of point-based SOTA models across three domains: indoor scans (S3DIS), autonomous driving (KITTI-360), and aerial LiDAR (DALES). Code and pretrained models are accessible at github.com/drprojects/superpoint_transformer.
|
| |
| 09:00-10:30, Paper TuI1I.177 | Add to My Program |
| ActionReasoning: Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking |
|
| Wang, Guangming | University of Cambridge |
| Ying, Qizhen | University of Oxford |
| Jing, Yixiong | University of Cambridge |
| Wysocki, Olaf | University of Cambridge |
| Sheil, Brian | University of Cambridge |
Keywords: Assembly, Compliant Assembly, Intelligent and Flexible Manufacturing
Abstract: Classical robotic systems typically rely on custom planners designed for constrained environments. While effective in restricted settings, these systems lack generalization capabilities, limiting the scalability of embodied AI and general‑purpose robots. Recent data‑driven Vision‑Language‑Action (VLA) approaches aim to learn policies from large‑scale simulation and real‑world data. However, the continuous action space of the physical world significantly exceeds the representational capacity of linguistic tokens, making it unclear if scaling data alone can yield general robotic intelligence. To address this gap, we propose ActionReasoning, an LLM-driven framework that performs explicit action reasoning to produce physics-consistent, prior-guided decisions for robotic manipulation. ActionReasoning leverages the physical priors and real-world knowledge already encoded in Large Language Models (LLMs) and structures them within a multi-agent architecture. We instantiate this framework on a tractable case study of brick stacking, where the environment states are assumed to be already accurately measured. The environmental states are then serialized and passed to a multi-agent LLM framework that generates physics-aware action plans. The experiments demonstrate that the proposed multi-agent LLM framework enables stable brick placement while shifting effort from low-level domain-specific coding to high-level tool invocation and prompting, highlighting its potential for broader generalization. This work introduces a promising approach to bridging perception and execution in robotic manipulation by integrating physical reasoning with LLMs.
|
| |
| 09:00-10:30, Paper TuI1I.178 | Add to My Program |
| DenVisCoM: Dense Vision Correspondence Mamba for Efficient and Real-Time Optical Flow and Stereo Estimation |
|
| Anand, Tushar | BITS Hyderabad |
| Bora, Maheswar | Birla Institute of Technology and Science, Pilani - Hyderabad Campus |
| Dantcheva, Antitza | Inria, Sophia Antipolis, France |
| Das, Abhijit | BITS Pilani |
Keywords: Deep Learning for Visual Perception
Abstract: In this work, we propose a novel Mamba block DenVisCoM, as well as a novel hybrid architecture specifically tailored for accurate and real-time estimation of optical flow and disparity estimation. Given that such multi-view geometry and motion tasks are fundamentally related, we propose a unified architecture to tackle them jointly. Specifically, the proposed hybrid architecture is based on DenVisCoM and a Transformer-based attention block that efficiently addresses real-time inference, memory footprint, and accuracy for at the same time for joint estimation of motion and 3D dense perception tasks. We extensively analyze the benchmark trade-off of accuracy and real-time processing on a large number of datasets. Our experimental results and related analysis suggest that our proposed model can accurately estimate optical flow and disparity estimation in real time. All models and associated code are available at https://github.com/vimstereo/DenVisCoM.
|
| |
| 09:00-10:30, Paper TuI1I.179 | Add to My Program |
| RealMirror: A Comprehensive, Open-Source Vision-Language-Action Platform for Embodied AI |
|
| Tai, Cong | ZTE Corporation |
| Zheng, Zhaoyu | ZTE Corporation |
| Long, Haixu | ZTE Corporation |
| Wu, Hansheng | ZTE Corporation |
| Xiang, Haodong | Tsinghua University |
| Long, Zhengbin | ZTE Corporation |
| Xiong, Jun | ZTE Corporation |
| Shi, Rong | ZTE Corporation |
| Zhang, Shizhuang | ZTE Corporation |
| Qiu, Gang | ZTE Corporation |
| Wang, He | ZTE Corporation |
| Li, Ruifeng | ZTE Corporation |
| Huang, Jun | ZTE Corporation |
| Chang, Bin | ZTE Corporation |
| Feng, Shuai | ZTE Corporation |
| Shen, Tao | ZTE Corporation |
Keywords: Software Tools for Benchmarking and Reproducibility, Deep Learning in Grasping and Manipulation, Simulation and Animation
Abstract: The emerging field of Vision-Language-Action (VLA) for humanoid robots faces several fundamental challenges, including the high cost of data acquisition, the lack of a standardized benchmark, and the significant gap between simulation and the real world. To overcome these obstacles, we propose RealMirror, a comprehensive, open-source embodied AI VLA platform. RealMirror builds an efficient, low-cost data collection, model training, and inference system that enables end-to-end VLA research without requiring a real robot. To facilitate model evolution and fair comparison, we also introduce a dedicated VLA benchmark for humanoid robots, featuring multiple scenarios, extensive trajectories, and various VLA models. Furthermore, by integrating generative models and 3D Gaussian Splatting to reconstruct realistic environments and robot models, we successfully demonstrate zero-shot Sim2Real transfer, where models trained exclusively on simulation data can perform tasks on a real robot seamlessly, without any fine-tuning. In conclusion, with the unification of these critical components, RealMirror provides a robust framework that significantly accelerates the development of VLA models for humanoid robots. Project page: https://terminators2025.github.io/RealMirror.github.io
|
| |
| 09:00-10:30, Paper TuI1I.180 | Add to My Program |
| Decentralized Triangulation Formation without Communication: A Vision Transformer Based Learning Approach |
|
| Huang, Xinchi | Stevens Institute of Technology |
| Yang, Guang | Stevens Institute of Technology |
| Guo, Yi | Stevens Institute of Technology |
Keywords: Machine Learning for Robot Control, Multi-Robot Systems, Motion and Path Planning
Abstract: Multi-robot cooperative control has been extensively studied using model-based distributed control methods. However, such control methods rely on sensing and perception modules in a sequential pipeline of design, and the separation of perception and controls may cause processing latency and compounding errors that affect control performance. End-to-end learning overcomes such limitation by learning directly from onboard sensing data, and outputs control command to robots. Challenges exist in end-to-end learning for multi-robot cooperative control and previous results are not scalable. We propose in this paper a novel decentralized cooperative control method for multi-robot formation using deep neural networks, in which inter-robot communication is modeled by a graph neural network (GNN). Our method takes LIDAR sensor data as input, and the control policy is learned from demonstration provided by an expert controller in a decentralized way. While training with a fixed number of robots, the learned control policy is scalable. Evaluation in a robot simulator demonstrates the triangulation formation behavior of multi-robot teams with varying sizes using the learned control policy.
|
| |
| 09:00-10:30, Paper TuI1I.181 | Add to My Program |
| MFCC Inspired Spectral Feature Extraction for Robust Touch Interaction in Social_Robots |
|
| Kim, JiSoo | UNIST (Ulsan National Institute of Science and Technology) |
| Hwang, Sun Jun | UNIST (Ulsan National Institute of Science and Technology) |
| Kim, Hyojin | UNIST (Ulsan National Institute of Science and Technology) |
| Hwang, Dong Joon | UNIST (Ulsan National Institute of Science and Technology) |
| Lee, Hui Sung | UNIST (Ulsan National Institute of Science and Technology) |
Keywords: Force and Tactile Sensing, Embedded Systems for Robotic and Automation, Touch in HRI
Abstract: Touch is a fundamental modality for conveying emotions and intentions in Human–Robot Interaction. However, conventional approaches to touch pattern recognition often lack robustness to inter-user variability, whereas alternative solutions are frequently bulky or costly. This study proposes a novel feature extraction framework for touch pattern recogni tion, which adapts MFCC from speech processing to capacitive touch signals. The proposed method preserves the strengths of MFCC—dimensionality reduction and noise robustness—while addressing the physical differences between audio and touch signals by introducing a new frequency reference axis in place of the conventional Mel scale. To evaluate its effectiveness, a representative set of social touch patterns, including gestures traditionally difficult to classify, was defined and analyzed. The proposed framework ensures stable recognition across diverse users while reducing feature dimensionality for efficient operation in lightweight models. This efficiency highlights its suitability for real-time robotic interfaces.
|
| |
| 09:00-10:30, Paper TuI1I.182 | Add to My Program |
| Multi-Task Reinforcement Learning of Drone Aerobatics by Exploiting Geometric Symmetries |
|
| Guo, Zhanyu | Zhejiang University |
| Yin, Zikang | Westlake University |
| Zhu, Guobin | School of Automation Science and Electrical Engineering, Beihang University |
| Zhao, Shiyu | Westlake University |
|
|
| |
| 09:00-10:30, Paper TuI1I.183 | Add to My Program |
| VFP: Variational Flow-Matching Policy for Multi-Modal Robot Manipulation |
|
| Zhai, Xuanran | National University of Singapore |
| Zhao, Qianyou | Shanghai Jiao Tong University |
| Yu, Qiaojun | Shanghai AI Lab |
| Hao, Ce | National University of Singapore |
Keywords: Deep Learning in Grasping and Manipulation, Dual Arm Manipulation, Dexterous Manipulation
Abstract: Flow-matching-based policies have recently emerged as a promising approach for learning-based robot manipulation, offering significant acceleration in action sampling compared to diffusion-based policies. However, conventional flow-matching methods struggle with multi-modality, often collapsing to averaged or ambiguous behaviors in complex manipulation tasks. To address this, we propose the Variational Flow-Matching Policy (VFP), which introduces a variational latent prior for mode-aware action generation and effectively captures both task-level and trajectory-level multi-modality. VFP further incorporates Kantorovich Optimal Transport (K-OT) for distribution-level alignment and utilizes a Mixture-of-Experts (MoE) decoder for mode specialization and efficient inference. We comprehensively evaluate VFP on 41 simulated tasks and 3 real-robot tasks, demonstrating its effectiveness and sampling efficiency in both simulated and real-world settings. Results show that VFP achieves a 49% relative improvement in task success rate over standard flow-based baselines in simulation, and further outperforms them on real-robot tasks, while still maintaining fast inference and a compact model size. More details are available on our project page: url{https://sites.google.com/view/varfp/}
|
| |
| 09:00-10:30, Paper TuI1I.184 | Add to My Program |
| Joint Task Assistance Planning Via Nested Branch and Bound |
|
| Daube, Omer | Technion |
| Salzman, Oren | Technion |
Keywords: Motion and Path Planning, Task and Motion Planning
Abstract: We introduce and study the Joint Task Assistance Planning problem which generalizes prior work on optimizing assistance in robotic collaboration. In this setting, two robots operate over predefined roadmaps, each represented as a graph corresponding to its configuration space. One robot, the task robot, must execute a timed mission, while the other, the assistance robot, provides sensor-based support that depends on their spatial relationship. The objective is to compute a path for both robots that maximizes the total duration of assistance given. Solving this problem is challenging due to the combinatorial explosion of possible path combinations together with the temporal nature of the problem (time needs to be accounted for as well). To address this, we propose a nested Branch and Bound framework that efficiently explores the space of robot paths in a hierarchical manner. We empirically evaluate our algorithm and demonstrate a speedup of up to two orders of magnitude when compared to a baseline approach.
|
| |
| 09:00-10:30, Paper TuI1I.185 | Add to My Program |
| Seeing Space and Motion: Enhancing Latent Actions with Geometric and Dynamic Awareness for Vision-Language-Action Models |
|
| Cai, Zhejia | Tsinghua University |
| Yang, Yandan | Alibaba |
| Chang, Xinyuan | Alibaba Group |
| Liang, Shiyi | Xi'an Jiaotong University |
| Chen, Ronghan | Sheyang Institute of Automation, Chinese Academy of Sciences |
| Xiong, Feng | AMAP |
| Xu, Mu | Alibaba |
| Huang, Ruqi | Tsinghua Shenzhen International Graduate School |
Keywords: AI-Based Methods, Deep Learning in Grasping and Manipulation, Deep Learning Methods
Abstract: Latent Action Models (LAMs) enable Vision-Language-Action (VLA) systems to learn semantic action representations from large-scale unannotated data. Yet, we identify two bottlenecks of LAMs: 1) the commonly adopted end-to-end trained image encoder suffers from poor spatial understanding; 2) LAMs can be fragile when input frames are temporally distant, leading to limited temporal perception. Such factors inevitably hinder stable and clear action modeling. To this end, we propose Farsighted-LAM, a latent action framework with geometry-aware spatial encoding and multi-scale temporal modeling, capturing structural priors and dynamic motion patterns from consecutive frames. We further propose SSM-VLA, an end-to-end VLA framework built upon Farsighted-LAM, which integrates structured perception with a visual Chain-of-Thought module to explicitly reason about environmental dynamics, enhancing decision consistency and interpretability. We validate SSM-VLA on multiple VLA tasks in both simulation and real-world settings, and achieve state-of-the-art performance. Our results demonstrate that our strategy of combining geometry-aware modeling, temporal coherence, and explicit reasoning is effective in enhancing the robustness and generalizability of embodied intelligence.
|
| |
| 09:00-10:30, Paper TuI1I.186 | Add to My Program |
| Code Generation and Conic Constraints for Model-Predictive Control on Microcontrollers with Conic-TinyMPC |
|
| Mahajan, Ishaan | Columbia University |
| Nguyen, Khai | Massachusetts Institute of Technology |
| Schoedel, Samuel | Carnegie Mellon University |
| Nedumaran, Elakhya | Carnegie Mellon University |
| Mata, Moises | Columbia University |
| Plancher, Brian | Dartmouth College |
| Manchester, Zachary | Massachusetts Institute of Technology |
Keywords: Optimization and Optimal Control, Software Architecture for Robotic and Automation, Embedded Systems for Robotic and Automation
Abstract: Model-predictive control (MPC) is a state-of-the-art control method for constrained robotic systems, yet deployment on resource-limited hardware remains difficult. This challenge is magnified by expressive conic constraints, which offer greater modeling power but require significantly more computation than linear alternatives. To address this challenge, we extend recent work developing fast, structure-exploiting, cached solvers for embedded applications based on the Alternating Direction Method of Multipliers (ADMM) to provide support for second-order cones, as well as C++ code generation from Python, MATLAB, and Julia. Microcontroller benchmarks show that our solver provides up to a two-order-of-magnitude speedup, ranging from 10.6x to 142.7x, over state-of-the-art embedded solvers on QP and SOCP problems, and enables us to fit order-of-magnitude larger problems in memory. We validate our solver's deployed performance through simulation and hardware experiments, including trajectory tracking with conic constraints on a 27g Crazyflie quadrotor. Our open-source code is available at https://tinympc.org.
|
| |
| 09:00-10:30, Paper TuI1I.187 | Add to My Program |
| Learning-Based Robust Control: Unifying Exploration and Distributional Robustness for Reliable Robotics Via Free Energy |
|
| Jesawada, Hozefa | New York University Abu Dhabi |
| Russo, Giovanni | University of Salerno |
| Swikir, Abdalla | Mohamed Bin Zayed University of Artificial Intelligence |
| Abu-Dakka, Fares | New York University Abu Dhabi |
Keywords: Reinforcement Learning, Robust/Adaptive Control, Probabilistic Inference
Abstract: A key challenge towards reliable robotic control is devising computational models that can both learn policies and guarantee robustness when deployed in the field. Inspired by the free energy principle in computational neuroscience, to address these challenges, we propose a model for policy computation that jointly learns environment dynamics and rewards, while ensuring robustness to epistemic uncertainties. Expounding a distributionally robust free energy principle, we propose a modification to the maximum diffusion learning framework. After explicitly characterizing robustness of our policies to epistemic uncertainties in both environment and reward, we validate their effectiveness on continuous-control benchmarks, via both simulations and real-world experiments involving manipulation with a Franka Research 3 arm. Across simulation and zero-shot deployment, our approach narrows the sim-to-real gap, and enables repeatable tabletop manipulation without task-specific fine-tuning.
|
| |
| 09:00-10:30, Paper TuI1I.188 | Add to My Program |
| SonarSweep: Fusing Sonar and Vision for Robust 3D Reconstruction Via Plane Sweeping |
|
| Chen, Lingpeng | Chinese University of Hong Kong, Shenzhen |
| Tang, Jiakun | Chinese University of Hong Kong (Shenzhen) |
| Chui, Pui Yi | The Chinese University of Hong Kong |
| Wu, Junfeng | The Chinese Unviersity of Hong Kong, Shenzhen |
| Hong, Ziyang | Heriot-Watt University |
Keywords: Marine Robotics, Sensor Fusion, Deep Learning for Visual Perception
Abstract: Accurate 3D reconstruction in visually-degraded underwater environments remains a formidable challenge. Single-modality approaches are insufficient: vision-based methods fail due to poor visibility and geometric constraints, while sonar is crippled by inherent elevation ambiguity and low resolution. Consequently, prior fusion techniques rely on heuristics and flawed geometric assumptions, leading to significant artifacts and an inability to model complex scenes. In this paper, we introduce SonarSweep, a novel end-to-end deep learning framework that overcomes these limitations by adapting the principled plane sweep algorithm for cross-modal fusion between sonar and visual data. Extensive experiments in both high-fidelity simulation and real-world environments demonstrate that SonarSweep consistently generates dense and accurate depth maps, significantly outperforming state-of-the-art methods under challenging conditions, particularly in high turbidity. To foster further research, we publicly release our code and a novel dataset featuring synchronized stereo-camera and sonar data—the first of its kind—at url{https://github.com/LIAS-CUHKSZ/SonarSweep}.
|
| |
| 09:00-10:30, Paper TuI1I.189 | Add to My Program |
| Spatially-Aware Adaptive Trajectory Optimization with Controller-Guided Feedback for Autonomous Racing |
|
| Wachter, Alexander | TU Wien |
| Willert, Alexander | TU Wien |
| Ecker, Marc-Philip | TU Wien, Austrian Institute of Technology |
| Hartl-Nesic, Christian | TU Wien |
Keywords: Motion and Path Planning, Integrated Planning and Control, Autonomous Agents
Abstract: We present a closed-loop framework for autonomous raceline optimization that combines NURBS-based trajectory representation, CMA-ES global trajectory optimization, and controller-guided spatial feedback. Instead of treating tracking errors as transient disturbances, our method exploits them as informative signals of local track characteristics via a Kalman-inspired spatial update. This enables the construction of an adaptive, acceleration-based constraint map that iteratively refines trajectories toward near-optimal performance under spatially varying track and vehicle behavior. In simulation, our approach achieves a 17.38% lap time reduction compared to a controller parametrized with maximum static acceleration. On real hardware, tested with different tire compounds ranging from low to high friction, we obtain a 7.60% lap time improvement without explicitly parametrizing friction. This demonstrates robustness to changing grip conditions in real-world scenarios.
|
| |
| 09:00-10:30, Paper TuI1I.190 | Add to My Program |
| Unified Meta-Representation and Feedback Calibration for General Disturbance Estimation |
|
| Yang, Zihan | Beihang University |
| Jia, Jindou | Beihang University |
| Wang, Meng | Beihang University |
| Liu, Yuhang | Beihang University |
| Guo, Kexin | Beihang University |
| Yu, Xiang | Beihang University |
Keywords: Machine Learning for Robot Control, Representation Learning
Abstract: Precise control in modern robotic applications is always an open issue due to unknown time-varying disturbances. Existing meta-learning-based approaches require a shared representation of environmental structures, which lack flexibility for realistic non-structural disturbances. Besides, representation error and the loss of model generalizability can lead to heavy degradation in prediction accuracy. This work presents a generalizable disturbance estimation framework that builds on meta-learning and feedback-calibrated online adaptation. By extracting features from a finite time window of past observations, a unified representation that effectively captures general non-structural disturbances can be learned without predefined structural assumptions. The online adaptation process is subsequently calibrated by a state-feedback mechanism to attenuate the learning residual. Theoretical analysis shows that simultaneous convergence of both the online learning error and the disturbance estimation error can be achieved. Through the unified meta-representation, our framework effectively estimates multiple rapidly changing disturbances, as demonstrated by quadrotor flight experiments.
|
| |
| 09:00-10:30, Paper TuI1I.191 | Add to My Program |
| From Demonstrations to Safe Deployment: Path-Consistent Safety Filtering for Diffusion Policies |
|
| Römer, Ralf | Technical University of Munich |
| Balletshofer, Julian | Technical University of Munich |
| Thumm, Jakob | Technical University of Munich |
| Pavone, Marco | Stanford University |
| Schoellig, Angela P. | TU Munich |
| Althoff, Matthias | Technische Universität München |
Keywords: Imitation Learning, Human-Robot Collaboration, Safety in HRI
Abstract: Diffusion policies (DPs) achieve state-of-the-art performance on complex manipulation tasks by learning from large-scale demonstration datasets, often spanning multiple embodiments and environments. However, they cannot guarantee safe behavior, requiring external safety mechanisms. These, however, alter actions in ways unseen during training, causing unpredictable behavior and performance degradation. To address these problems, we propose path-consistent safety filtering (PACS) for DPs. Our approach performs path-consistent braking on a trajectory computed from the sequence of generated actions. In this way, we keep the execution consistent with the training distribution of the policy, maintaining the learned, task-completing behavior. To enable real-time deployment and handle uncertainties, we verify safety using set-based reachability analysis. Our experimental evaluation in simulation and on three challenging real-world human-robot interaction tasks shows that PACS (a) provides formal safety guarantees in dynamic environments, (b) preserves task success rates, and (c) outperforms reactive safety approaches, such as control barrier functions, by up to 68% in terms of task success. Videos are available at our project website: tum-lsy.github.io/pacs.
|
| |
| 09:00-10:30, Paper TuI1I.192 | Add to My Program |
| VISTA: Generative Visual Imagination for Vision-And-Language Navigation |
|
| Huang, Yanjia | Texas A&M University |
| Wu, Mingyang | Texas A&M University |
| Li, Renjie | Texas A&M University |
| Tu, Zhengzhong | Texas A&M University |
Keywords: Motion and Path Planning, Deep Learning Methods, Vision-Based Navigation
Abstract: Vision-and-Language Navigation (VLN) tasks agents with locating specific objects in unseen environments using natural language instructions and visual cues. Many existing VLN approaches typically follow an `observe-and-reason' schema, that is, agents observe the environment and decide on the next action to take based on the visual observations of their surroundings. They often face challenges in long-horizon scenarios due to limitations in immediate observation and vision-language modality gaps. To overcome this, we present VISTA, a novel framework that employs an `imagine-and-align navigation strategy. Specifically, we leverage the generative prior of pre-trained diffusion models for dynamic visual imagination conditioned on both local observations and high-level language instructions. A Perceptual Alignment Filter module then grounds these goal imaginations against current observations, guiding an interpretable and structured reasoning process for action selection. Experiments show that VISTA sets new state-of-the-art results on Room-to-Room (R2R) and RoboTHOR benchmarks, e.g., +3.6% increase in Success Rate on R2R. Extensive ablation analysis underscores the value of integrating forward-looking imagination, perceptual alignment, and structured reasoning for robust navigation in long-horizon environments.
|
| |
| 09:00-10:30, Paper TuI1I.193 | Add to My Program |
| Selecting Spots by Explicitly Predicting Intention from Motion History Improves Performance in Autonomous Parking |
|
| Chung, Long Kiu | Georgia Institute of Technology |
| Isele, David | University of Pennsylvania, Honda Research Institute USA |
| Tariq, Faizan M. | Honda Research Institute USA, Inc |
| Bae, Sangjae | Honda Research Institute, USA |
| Kousik, Shreyas | Georgia Institute of Technology |
| D'sa, Jovin | Honda Research Institute, USA |
Keywords: Intelligent Transportation Systems, Human-Aware Motion Planning, Motion and Path Planning
Abstract: In many applications of social navigation, existing works have shown that predicting and reasoning about human intentions can help robotic agents make safer and more socially acceptable decisions. In this work, we study this problem for autonomous valet parking (AVP), where an autonomous vehicle ego agent must drop off its passengers, explore the parking lot, find a parking spot, negotiate for the spot with other vehicles, and park in the spot without human supervision. Specifically, we propose an AVP pipeline that selects parking spots by explicitly predicting where other agents are going to park from their motion history using learned models and probabilistic belief maps. To test this pipeline, we build a simulation environment with reactive agents and realistic modeling assumptions on the ego agent, such as occlusion-aware observations, and imperfect trajectory prediction. Simulation experiments show that our proposed method outperforms existing works that infer intentions from future predicted motion or embed them implicitly in end-to-end models, yielding better results in prediction accuracy, social acceptance, and task completion. Our key insight is that, in parking, where driving regulations are more lax, explicit intention prediction is crucial for reasoning about diverse and ambiguous long-term goals, which cannot be reliably inferred from short-term motion prediction alone, but can be effectively learned from motion history.
|
| |
| 09:00-10:30, Paper TuI1I.194 | Add to My Program |
| E^2DT: Efficient and Effective Decision Transformer with Experience-Aware Sampling for Robotic Manipulation |
|
| Zhao, Kaiyan | Wuhan University |
| Zhang, Borong | University of Macau |
| Wang, Yiming | University of Macau |
| Liu, Xingyu | Wuhan University |
| Li, Xuetao | Wuhan University |
| Chen, Yuyang | Northwestern University |
| Niu, Xiaoguang | Wuhan University |
Keywords: Reinforcement Learning
Abstract: In reinforcement learning (RL) for robotic manipulation, the Decision Transformer (DT) has emerged as an effective framework for addressing long-horizon tasks. However, DT’s performance depends heavily on the coverage of collected experiences. Without an active exploration mechanism, standard DT relies on uniform replay, which leads to poor sample efficiency, limited exploration, and reduced overall effectiveness. At the same time, while excessive exploration can help avoid local optima, it often delays policy convergence and leads to degraded efficiency. To address these limitations, we propose E^2DT, a DT-guided k-Determinantal Point Process sampling framework that enables the model to actively shape its own experience selection. Our framework is experience-aware, allowing E^2DT to be both efficient, by prioritizing sampling quality (e.g., high-return, high-uncertainty, and underrepresented trajectories), and effective, by ensuring diversity across trajectory windows to preserve policy optimality. Specifically, DT’s internal latent embeddings measure diversity across trajectory windows, while quality is quantified through a composite metric that integrates return-to-go (RTG) quantiles, predictive uncertainty, and stage coverage (inverse frequency). These two dimensions are integrated into a novel quality–diversity joint kernel that prioritizes the most informative experiences, thereby enabling learning that is both efficient and effective. We evaluate E^2DT on challenging robotic manipulation benchmarks in both simulation and real-robot settings. Results show that it consistently outperforms prior methods. These findings demonstrate that coupling policy learning with experience-aware sampling provides a principled path toward robust long-horizon robotic learning.
|
| |
| 09:00-10:30, Paper TuI1I.195 | Add to My Program |
| From VO to NAO: Reactive Robot Navigation Using Velocity and Acceleration Obstacles |
|
| Stern, Asher | Ariel University |
| Ben-Yosef, Oz | Ariel University |
| Shiller, Zvi | Ariel University |
Keywords: Motion and Path Planning, Collision Avoidance
Abstract: This paper addresses the problem of robot navigation in challenging dynamic environments by extending the Velocity Obstacle (VO) framework to the Nonlinear Acceleration Obstacle (NAO). The NAO represents the set of robot accelerations that would lead to collisions with an obstacle moving along an arbitrary trajectory. By formulating the problem in the acceleration domain, the method allows direct selection of accelerations, the natural control input of second-order systems, to generate safe avoidance maneuvers in complex dynamic environments. Simulation results show that NAO enables real-time collision avoidance while explicitly accounting for both robot kinematics (velocity) and dynamics (acceleration). The proposed framework thus provides a reactive and efficient basis for autonomous navigation in complex dynamic environments.
|
| |
| 09:00-10:30, Paper TuI1I.196 | Add to My Program |
| Learning Push-Grasp Synergy for Occluded Objects in Cluttered Environments |
|
| Li, Ziang | Tsinghua University |
| Wu, Haorui | Tsinghua University |
| Chen, Zhiqi | Tsinghua University |
| Zhang, Haozhe | Tsinghua University |
| Huang, Yuzhe | Beijing University of Aeronautics and Astronautics |
| Zhang, Changshui | Tsinghua University |
Keywords: Deep Learning in Grasping and Manipulation
Abstract: Successfully executing grasping tasks within highly cluttered spaces is still a significant hurdle in robotics, especially in scenarios involving severe target occlusion. To tackle this, we present a novel self-supervised framework driven by deep reinforcement learning that enables robots to acquire push–grasp synergy for reliable manipulation under occlusions. The core contribution of this research is the target switching mechanism that dynamically selects alternative targets when the goal object is severely occluded. Furthermore, we utilize a strategy for selecting actions based on object masks to reduce the action space, thereby improving efficiency and minimizing ineffective operations. Comprehensive evaluations across both simulated and physical environments confirm that our method achieves robust grasping performance under severe or complete occlusions. Notably, the learned policy is readily transferable to physical environments and generalizes effectively to previously unseen objects.
|
| |
| 09:00-10:30, Paper TuI1I.197 | Add to My Program |
| Acoustic Peg-In-Hole Assembly with Phased Transducer Array and Microscope |
|
| Yang, Saida | University of Chinese Academy of Sciences |
| Zhong, Chengxi | ShanghaiTech University |
| Jiang, Yujie | ShanghaiTech University |
| Su, Hu | Institute of Automation, Chinese Academy of Science |
| Liu, Song | ShanghaiTech University |
Keywords: Automation at Micro-Nano Scales, Micro/Nano Robots, Nanomanufacturing
Abstract: Microassembly is becoming increasingly critical in modern smart manufacturing, placing higher demands on system performance—particularly for in situ and in vivo applications in biomedicine, photonics, sensors, and microrobotics. Non-contact mechanical microassembly has emerged as a promising solution, addressing challenges such as part contamination, limited environmental compatibility, and undesired microscopic forces. This paper presents an ultrasonic-driven non-contact microassembly system capable of performing a representative peg-in-hole assembly. The system primarily consists of an ultrasonic phased transducer array, which serves as a holographic acoustic end-effector, and a microscope that provides visual feedback. Elliptical and semi-elliptical holographic acoustic end-effectors are designed and generated by generative adversarial networks. A homogeneous transformation strategy is employed to generate pre-planned phase-only hologram (POH) sequence, while a closed-loop control strategy dynamically adjusts the end-effector’s pose by integrating real-time visual feedback. Experimental results demonstrate that, with disturbances compensated by the closed-loop strategy, the system can stably adjust the peg’s position and orientation to achieve acceptable alignment accuracy. It successfully manipulates high-aspect-ratio objects to complete the peg-in-hole assembly in fluidic and strong magnetic environments. Moreover, the system requires no preset object position or orientation and does not alter the object’s form or structure during operation, demonstrating strong potential for broader in situ applications.
|
| |
| 09:00-10:30, Paper TuI1I.198 | Add to My Program |
| Dexterous Planar Pushing under Uncertain Object Properties: A Contact-Aware Goal-Oriented Approach |
|
| Lee, Yongseok | Pohang University of Science and Technology |
| Kim, Keehoon | POSTECH, Pohang University of Science and Technology |
Keywords: Manipulation Planning, Dexterous Manipulation, Contact Modeling
Abstract: Robotic pushing is a versatile non-prehensile manipulation skill that enables robots to handle ungraspable objects without specialized tools. This paper introduces a contact-aware, goal-oriented pushing framework that achieves dexterous and robust manipulation by explicitly allowing free-motion of the end-effector. Central to our approach is the contact-aware generalized velocity–motion model (C-GVMM), which captures the relationship between pusher velocity and slider motion across all contact modes, including separation. Unlike prior methods that rely on predefined trajectories or fixed contact-mode sequences, our framework enables seamless transitions among sticking, sliding, and separating modes. Building upon C-GVMM, we employ Model Predictive Path Integral (MPPI) control to generate goal-directed actions, and UKF-based online estimation to handle the uncertain object properties in real-world setting. We validate our approach through both numerical simulations and real-robot experiments, demonstrating that the framework accomplishes diverse pushing tasks with more optimal pusher and slider motion with high success rates. These results demonstrate the practical viability of the proposed approach for real-world robotic pushing tasks.
|
| |
| 09:00-10:30, Paper TuI1I.199 | Add to My Program |
| DexKnot: Generalizable Visuomotor Policy Learning for Dexterous Bag-Knotting Manipulation |
|
| Zhang, Jiayuan | Peking University |
| Wu, Ruihai | Peking University |
| Chen, Haojun | Peking University |
| Wang, Yuran | Peking University |
| Zhong, Yifan | Peking University |
| Zhang, Ceyao | Peking University |
| Yang, Yaodong | Peking University |
| Chen, Yuanpei | South China University of Technology |
Keywords: Representation Learning, Imitation Learning, Dexterous Manipulation
Abstract: Knotting plastic bags is a common task in daily life, yet it is challenging for robots due to the bags' infinite degrees of freedom and complex physical dynamics. Existing methods often struggle in generalization to unseen bag instances or deformations. To address this, we present DexKnot, a framework that combines keypoint affordance with diffusion policy to learn a generalizable bag-knotting policy. Our approach learns a shape-agnostic representation of bags from keypoint correspondence data collected through real-world manual deformation. For an unseen bag configuration, the keypoints can be identified by matching the representation to a reference. These keypoints are then provided to a diffusion transformer, which generates robot action based on a small number of human demonstrations. DexKnot enables effective policy generalization by reducing the dimensionality of observation space into a sparse set of keypoints. Experiments show that DexKnot achieves reliable and consistent knotting performance across a variety of previously unseen instances and deformations.
|
| |
| 09:00-10:30, Paper TuI1I.200 | Add to My Program |
| Control Barrier Corridors: From Safety Functions to Safe Goal Sets |
|
| Arslan, Omur | Eindhoven University of Technology |
| Atanasov, Nikolay | University of California, San Diego |
Keywords: Robot Safety, Integrated Planning and Control, Sensor-based Control
Abstract: Safe autonomy is a critical requirement and a key enabler for robots to operate in complex environments. Control barrier functions and safe motion corridors are two widely used but distinct safety methods, functional and geometric, respectively, for planning and control. Control barrier functions filter control inputs to limit the decay rate of safety, whereas safe motion corridors are geometrically constructed to define a local safe zone around the system state. This paper introduces a new notion of control barrier corridors, unifying these two approaches by converting control barrier functions into local safe goal regions for reference goal selection in feedback control systems. We show, with examples on fully actuated systems, kinematic unicycles, and linear output regulation systems, that individual state safety can be extended locally over control barrier corridors for convex barrier functions, provided the control convergence rate matches the barrier decay rate. Such safe control barrier corridors enable safely reachable persistent goal selection over continuously changing barrier corridors during motion, which we demonstrate for verifiably safe path following in autonomous exploration of unknown environments.
|
| |
| 09:00-10:30, Paper TuI1I.201 | Add to My Program |
| Causal Transformer-Based Online Action Recognition for High-Level Control of a Unitree Go1 Robot |
|
| Bandi, Chaitanya | Chemnitz University of Technology |
| Kitz, Kristof | Chemnitz University of Technology |
| Thomas, Ulrike | Chemnitz University of Technology |
Keywords: Human Detection and Tracking, Gesture, Posture and Facial Expressions, Deep Learning Methods
Abstract: We present a new causal transformer system consisting of Spatial-Attention Tokenization (SAT) with MultiResolution Causal Temporal Mixing (MRCTM) to perform online skeleton-based action recognition during human–robot interaction. The novel architecture uses Spatial-Attention Tokenization (SAT) to generate soft tokens from human joint groups. MRCTM performs causal convolutions and selfattention operations to detect both detailed motion patterns and extended temporal relationships. We introduce GoHAR12 dataset as an evaluation tool as it contains 12 gesture and posture classes which are recorded in human-robot interaction (HRI) settings and directly translate to high-level commands for the Unitree Go1 quadruped. The proposed model reaches 98.4% accuracy on the GoHAR-12 dataset and it shows superior performance in distinguishing between actions that are quite similar in motion, maintains strong results on public benchmarks such as NTU-RGB+D and NW-UCLA. We demonstrate how causal transformer performs for reliable realtime skeleton-based control of the Unitree Go1 robot.
|
| |
| 09:00-10:30, Paper TuI1I.202 | Add to My Program |
| Relationship-Aware Hierarchical 3D Scene Graph for Task Reasoning |
|
| Gassol Puigjaner, Albert | NTNU - Norwegian University of Science and Technology |
| Zacharia, Angelos | NTNU - Norwegian University of Science and Technology |
| Alexis, Kostas | NTNU - Norwegian University of Science and Technology |
Keywords: Semantic Scene Understanding
Abstract: Representing and understanding 3D environments in a structured manner is crucial for autonomous agents to navigate and reason about their surroundings. While traditional Simultaneous Localization and Mapping (SLAM) methods generate metric reconstructions and can be extended to metric-semantic mapping, they lack a higher level of abstraction and relational reasoning. To address this gap, 3D scene graphs have emerged as a powerful representation for capturing hierarchical structures and object relationships. In this work, we propose an enhanced hierarchical 3D scene graph that integrates open-vocabulary features across multiple abstraction levels and supports object-relational reasoning. Our approach leverages a Vision Language Model (VLM) to infer semantic relationships. Notably, we introduce a task reasoning module that combines Large Language Models (LLM) and a VLM to interpret the scene graph’s semantic and relational information, enabling agents to reason about tasks and interact with their environment more intelligently. We validate our method by deploying it on a quadruped robot in multiple environments and tasks, highlighting its ability to reason about them.
|
| |
| 09:00-10:30, Paper TuI1I.203 | Add to My Program |
| Learning-Guided Force-Feedback Model Predictive Control with Obstacle Avoidance for Robotic Deburring |
|
| Wojciechowski, Krzysztof | LAAS-CNRS |
| Gursoy, Ege | LIRMM, University of Montpellier CNRS |
| Haffemayer, Arthur | CTU, CIIRC |
| Kleff, Sebastien | Inria Center at the University of Bordeaux |
| Bonnet, Vincent | University Paul Sabatier |
| Lamiraux, Florent | CNRS |
| Mansard, Nicolas | CNRS |
Keywords: Industrial Robots, Control Architectures and Programming, Force and Tactile Sensing
Abstract: Model Predictive Control (MPC) is widely used for torque-controlled robots, but classical formulations often neglect real-time force feedback and struggle with contact-rich industrial tasks under collision constraints. Deburring in particular requires precise tool insertion, stable force regulation, and collision-free circular motions in challenging configurations, which exceeds the capability of standard MPC pipelines. We propose a framework that integrates force-feedback MPC with diffusion-based motion priors to address these challenges. The diffusion model serves as a memory of motion strategies, providing robust initialization and adaptation across multiple task instances, while MPC ensures safe execution with explicit force tracking, torque feasibility, and collision avoidance. We validate our approach on a torque-controlled manipulator performing industrial deburring tasks. Experiments demonstrate reliable tool insertion, accurate normal force tracking, and circular deburring motions even in hard-to-reach configurations and under obstacle constraints. To our knowledge, this is the first integration of diffusion motion priors with force-feedback MPC for collision-aware, contact-rich industrial tasks.
|
| |
| 09:00-10:30, Paper TuI1I.204 | Add to My Program |
| AROSpect: A ROS 2 Timing Introspection Framework |
|
| Dust, Lukas | SCAILAB AB, Mälardalen University |
| Timperley, Christopher Steven | Carnegie Mellon University |
| Gu, Rong | Mälardalen University |
Keywords: Software, Middleware and Programming Environments, Methods and Tools for Robot System Design, Control Architectures and Programming
Abstract: This paper introduces AROSpect, a framework for timing introspection and controlled experimentation for ROS 2-based applications. AROSpect enables developers to model system components using standardized templates, inject synthetic delays, and measure end-to-end latencies across message paths. Through instrumentation, the framework supports iterative refinement of timing parameters and identification of misconfigurations. A case study using a multi-agent turtlesim system demonstrates how AROSpect can guide developers to understand the effects of adapting timing parameters, contributing toward more predictable robotic systems.
|
| |
| 09:00-10:30, Paper TuI1I.205 | Add to My Program |
| HIPPO-MAT: Decentralized Task Allocation Using GraphSAGE and Multi-Agent Deep Reinforcement Learning |
|
| Ratnabala, Lavanya | Skolkovo Institute of Science and Technology |
| Peter Vimalathas, Robinroy | Intelligent Space Robotics Laboratory, Skolkovo Institute of Science and Technology |
| Fedoseev, Aleksey | Skolkovo Institute of Science and Technology |
| Tsetserukou, Dzmitry | Skolkovo Institute of Science and Technology |
|
|
| |
| 09:00-10:30, Paper TuI1I.206 | Add to My Program |
| SoftHand Model-W: A 3D-Printed, Anthropomorphic, Underactuated Robot Hand with Integrated Wrist and Carpal Tunnel |
|
| Merritt, Dhillon | University of Bristol |
| Ford, Christopher | University of Bristol |
| Li, Haoran | University of Bristol |
| Smith, Malia | Massachusetts Institute of Technology |
| Chen, Zhixing | Massachusetts Institute of Technology |
| Psomopoulou, Efi | University of Bristol |
| Lepora, Nathan | University of Bristol |
Keywords: Multifingered Hands, Grippers and Other End-Effectors
Abstract: This paper presents the SoftHand Model-W: a 3D-printed, underactuated, anthropomorphic robot hand based on the Pisa/IIT SoftHand, with an integrated antagonistic tendon mechanism and 2 degree-of-freedom tendon-driven wrist. These four degrees-of-acuation provide active flexion and extension to the five fingers, and active flexion/extension and radial/ulnar deviation of the palm through the wrist, while preserving the synergistic and self-adaptive features of such SoftHands. A carpal tunnel-inspired tendon routing allows remote motor placement in the forearm, reducing distal inertia and maintaining a compact form factor. The SoftHand-W is mounted on a 6-axis robot arm and tested with two reorientation tasks requiring coordination between the hand and arm's pose: cube stacking and in-plane disc rotation. Results comparing task time, arm joint travel, and configuration changes with and without wrist actuation show that adding the wrist reduces compensatory and reconfiguration movements of the arm for a quicker task-completion time. Moreover, the wrist enables pick-and-place operations that would be impossible otherwise. Overall, the SoftHand Model-W demonstrates how proximal degrees of freedom are key to achieving versatile, human-like manipulation in real world robotic applications, with a compact design enabling deployment in research and assistive settings.
|
| |
| 09:00-10:30, Paper TuI1I.207 | Add to My Program |
| Scan, Materialize, Simulate: A Generalizable Framework for Physically Grounded Robot Planning |
|
| Elhafsi, Amine | Stanford University |
| Morton, Daniel | Stanford University |
| Pavone, Marco | Stanford University |
Keywords: Integrated Planning and Learning, AI-Enabled Robotics, Simulation and Animation
Abstract: Autonomous robots must reason about the physical consequences of their actions to operate effectively in unstructured, real-world environments. We present Scan, Materialize, Simulate (SMS), a unified framework that combines 3D Gaussian Splatting for accurate scene reconstruction, visual foundation models for semantic segmentation, vision-language models for material property inference, and physics simulation for reliable prediction of action outcomes. By integrating these components, SMS enables generalizable physical reasoning and object-centric planning without the need to relearn foundational physical dynamics. We empirically validate SMS in a billiards-inspired manipulation task and a challenging quadrotor landing scenario, demonstrating robust performance on both simulated domain transfer and real-world experiments. Our results highlight the potential of bridging differentiable rendering for scene reconstruction, foundation models for semantic understanding, and physics-based simulation to achieve physically grounded robot planning across diverse settings. Our project page with additional materials can be found at https://sites.google.com/view/scan-materialize-simulate.
|
| |
| 09:00-10:30, Paper TuI1I.208 | Add to My Program |
| Efficient Real-World Benchmarking for Practical Fine-Grained Product Identification in Retail Robotics for Picking and Stock Taking |
|
| Lindermayr, Jochen | Fraunhofer IPA |
| Jordan, Florian | Fraunhofer IPA |
| Odabasi, Cagatay | Fraunhofer IPA |
| Kraus, Werner | Fraunhofer IPA |
| Bormann, Richard | Fraunhofer IPA |
| Huber, Marco F. | University of Stuttgart |
Keywords: Inventory Management, Data Sets for Robotic Vision, Recognition
Abstract: The rapid evolution of retail robotics is set to transform in-store operations through advanced automation, spanning vision-based inventory tracking, order picking, packing, and restocking. Yet fine-grained product identification remains a bottleneck: assortments change, packaging evolves, and shelves host thousands of near-duplicates—requiring perception systems that can adapt quickly with minimal setup. This paper targets that gap with two contributions. First, we present a semi-automated, robot-assisted acquisition pipeline that records 3D scene ground truth via iterative placement, projecting it into each image, yielding dense, low-cost annotations at scale. Second, we extend IPA-3D1K with challenging real shelf scenes containing 130 near-duplicate SKUs. While scenes are not paired one-to-one, the same product set appears across synthetic and real images, enabling controlled, object-level sim/real analyses under occlusion, rearrangement, and lighting variation. Using frozen DINOv3 features, our baseline recognition pipeline allows index updates in minutes. We evaluate training-free or fast approaches (kNN and a lightweight classifier head) to assess the capabilities and limitations of this representation in fine-grained retail identification. Experiments show that on the FineGrainedOCR dataset the lightweight head improves over kNN by sim11 percentage points, narrowing the gap to fully trained models to 1.9–5.3 pp. On IPA-3D1K (1,000 SKUs), synthetic-scene retrieval is strong (Top-1 90%, Top-2 95%), while exact disambiguation among near-duplicates remains challenging. We find that confidence thresholds enable targeted triage during inference, and a neighborhood-based risk signal predicts confusion during training, indicating where specialized modules are most beneficial.
|
| |
| 09:00-10:30, Paper TuI1I.209 | Add to My Program |
| Modeling 3D Pedestrian-Vehicle Interactions for Vehicle-Conditioned Pose Forecasting |
|
| Zhu, Guangxun | University of Glasgow |
| Liu, Xuan | University of Glasgow |
| Pugeault, Nicolas | University of Glasgow |
| Wei, Chongfeng | University of Glasgow |
| Ho, Edmond S. L. | University of Glasgow |
Keywords: Human-Aware Motion Planning, Autonomous Vehicle Navigation, Human Detection and Tracking
Abstract: Accurately predicting pedestrian motion is crucial for safe and reliable autonomous driving in complex urban environments. In this work, we present a 3D vehicle-conditioned pedestrian pose forecasting framework that explicitly incorporates surrounding vehicle information. To support this, we enhance the Waymo-3DSkelMo dataset with aligned 3D vehicle bounding boxes, enabling realistic modeling of multi-agent pedestrian–vehicle interactions. We introduce a sampling scheme to categorize scenes by pedestrian and vehicle count, facilitating training across varying interaction complexities. Our proposed network adapts the TBIFormer architecture with a dedicated vehicle encoder and pedestrian–vehicle interaction cross-attention module to fuse pedestrian and vehicle features, allowing predictions to be conditioned on both historical pedestrian motion and surrounding vehicles. Extensive experiments demonstrate substantial improvements in forecasting accuracy and validate different approaches for modeling pedestrian–vehicle interactions, highlighting the importance of vehicle-aware 3D pose prediction for autonomous driving.
|
| |
| 09:00-10:30, Paper TuI1I.210 | Add to My Program |
| SeaViper: An Efficient Thin 2D Surface-Swimming Soft Robot |
|
| Veilleux, Elias | Princeton University |
| Cheng, Hsin | Princeton University |
| Wagner, Sigurd | Princeton University |
| Verma, Naveen | Princeton University |
| Sturm, James | Princeton University |
| Chen, Minjie | Princeton University |
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Robot Materials and Design
Abstract: This paper introduces SeaViper, a soft extendable aquatic vibrating intelligent piezoelectric robot that extends previously developed land-based systems into the aquatic domain. The aquatic domain introduces new fundamental mechanisms of motion as well as new robot-platform requirements. To study these, we present the mechanical and electrical design of SeaViper and investigate the drive–frequency response of three prototype configurations, with energy efficiency as a key design consideration. The prototypes achieve a peak velocity of up to 33.2 cm/s (1.38 body-length per second) with an estimated power of 2 W and a minimum cost of transport (CoT) of 3.9, significantly improving upon the performance of the prior land prototype. Measured thrust data combined with current-sense analysis enable estimation of useful mechanical output and end-to-end electromechanical efficiency. Velocity and CoT are benchmarked against both other robotic swimmers and aquatic animals, highlighting the general gap to biological performance. To further advance the sheet-like, untethered design, the aquatic prototype integrates a microcontroller, wireless communication, sensing, and on-board battery charging circuitry, paving the way for future bio-inspired morphologies at the air–water interface with advanced driving patterns.
|
| |
| 09:00-10:30, Paper TuI1I.211 | Add to My Program |
| Pack It In: Packing into Partially Filled Containers through Contact |
|
| Russell, David Mackenzie Charles | University of Leeds |
| Xu, Zisong | University of Nanjing |
| Roa, Maximo A. | German Aerospace Center (DLR) |
| Dogar, Mehmet R | University of Leeds |
Keywords: Manipulation Planning, Integrated Planning and Control, Industrial Robots
Abstract: The automation of warehouse operations is crucial for improving productivity and reducing human exposure to hazardous environments. One operation frequently performed in warehouses is bin-packing where items need to be placed into containers, either for delivery to a customer, or for temporary storage in the warehouse. Whilst prior bin-packing works have largely been focused on packing items into empty containers and have adopted collision-free strategies, it is often the case that containers will already be partially filled with items, often in suboptimal arrangements due to transportation about a warehouse. This paper presents a contact-aware packing approach that exploits purposeful interactions with previously placed objects to create free space and enable successful placement of new items. This is achieved by using a contact-based multi-object trajectory optimizer within a model predictive controller, integrated with a physics-aware perception system that estimates object poses even during inevitable occlusions, and a method that suggests physically-feasible locations to place the object inside the container.
|
| |
| 09:00-10:30, Paper TuI1I.212 | Add to My Program |
| Soft Vortex Gripper for Dexterous Manipulation Using Hand-Like Robots |
|
| Kojouharov, Martin | The University of Texas at Austin |
| Kang, Dong Ho | The University of Texas at Austin |
| Rowland, Drake | University of Texas at Austin |
| Mykhailyshyn, Roman | National Institute of Advanced Industrial Science and Technology (AIST) |
| Sentis, Luis | The University of Texas at Austin |
| Majewicz Fey, Ann | The University of Texas at Austin |
|
|
| |
| 09:00-10:30, Paper TuI1I.213 | Add to My Program |
| Wind-Aware Optimal Trajectory Planning for Efficient Gliding of Fixed-Wing Aerial Systems |
|
| Morando, Luca | New York University |
| Bobbili, Nishanth | University of California, Berkeley |
| Masci, Luca | New York University |
| Loianno, Giuseppe | UC Berkeley |
Keywords: Aerial Systems: Applications
Abstract: Gliding offers small fixed-wing UAVs extended endurance and silent operation but requires accurate energy management, especially under wind disturbances and obstacle constraints. Traditional Total Energy Control Systems based controllers regulate the trade between potential and kinetic energy reactively, often requiring fine-tuning and trim-conditions knowledge. In this work, we shift the regulation to the planning level and present a nonlinear, multi-cost trajectory planner for small UAV gliders. The method generates mathcal{C}^3 continuous trajectories based on Bernstein polynomials, mapped into control commands through differential flatness, and re-planned online to match experimentally derived sink polar curves. A simulated netto variometer is integrated into the optimization to estimate air mass motion, constraining the glide to energy-balanced states. Consecutive gliding trajectories are linked by cruising segments computed through trajectories initialized on Dubins path-based waypoints, enabling hybrid missions that combine powered and unpowered flight. The approach is validated in CFD simulations and real-world experiments with a fixed-wing platform, showing reliable stabilization of sink rate, airspeed, and glide ratio under wind gusts and in presence of obstacles.
|
| |
| 09:00-10:30, Paper TuI1I.214 | Add to My Program |
| Two Degree-Of-Freedom Vibratory Transport in a Grasp |
|
| Yako, Connor | Stanford University |
| Yuan, Shenli | Robotics and AI Institute |
| Salisbury, Kenneth | Stanford University |
Keywords: In-Hand Manipulation, Grippers and Other End-Effectors
Abstract: In this paper, we use asymmetric vibrations to demonstrate two degree-of-freedom (DoF) in-hand manipulation of grasped parts. The asymmetric vibrations are achieved through closed-loop position control of a moving surface, which applies a periodic stick-slip waveform to the part to be manipulated. We show analytically how two vibratory waveform parameters, the sticking acceleration and the slipping acceleration, affect average part velocity when moving against gravity. The theoretical trends are then validated using an experimental setup where the squeeze force is controlled and part motion is recorded by a high-resolution encoder. We also develop a 2-DoF vibratory surface capable of translation in one direction and rotation about the surface normal. Using two of these 2-DoF surfaces in a parallel jaw gripper configuration, we bidirectionally translate and rotate a variety of grasped parts, as well as demonstrate that the same waveform trends for translation also persist for in-plane rotation.
|
| |
| 09:00-10:30, Paper TuI1I.215 | Add to My Program |
| FUNCanon: Learning Pose-Aware Action Primitives Via Functional Object Canonicalization for Generalizable Robotic Manipulation |
|
| Xu, Hongli | TU Munich |
| Zhang, Lei | University of Hamburg |
| Hu, Xiaoyue | Technical University of Munich |
| Zhong, Boyang | Technical University of Munich |
| Bai, Kaixin | University of Hamburg |
| Marton, Zoltan-Csaba | Agile Robots SE |
| Bing, Zhenshan | Technical University of Munich |
| Chen, Zhaopeng | University of Hamburg |
| Knoll, Alois | Tech. Univ. Muenchen TUM |
| Zhang, Jianwei | University of Hamburg |
Keywords: Imitation Learning, Visual Learning, Transfer Learning
Abstract: General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision–language models. An object-centric and action-centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim-to-real deployment, showing that functional canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/funcanon-anonymous.
|
| |
| 09:00-10:30, Paper TuI1I.216 | Add to My Program |
| A Convex Formulation of Compliant Contact between Filaments and Rigid Bodies |
|
| Li, Wei-Chen | Georgia Institute of Technology |
| Chou, Glen | Georgia Institute of Technology |
Keywords: Simulation and Animation, Contact Modeling, Modeling, Control, and Learning for Soft Robots
Abstract: We present a computational framework for simulating filaments interacting with rigid bodies through contact. Filaments are challenging to simulate due to their codimensionality, i.e., they are one-dimensional structures embedded in three-dimensional space. Existing methods often assume that filaments remain permanently attached to rigid bodies. Our framework unifies discrete elastic rod (DER) modeling, a pressure field patch contact model, and a convex contact formulation to accurately simulate frictional interactions between slender filaments and rigid bodies - capabilities not previously achievable. Owing to the convex formulation of contact, each time step can be solved to global optimality, guaranteeing complementarity between contact velocity and impulse. We validate the framework by assessing the accuracy of frictional forces and comparing its physical fidelity against baseline methods. Finally, we demonstrate its applicability in both soft robotics, such as a stochastic filament-based gripper, and deformable object manipulation, such as shoelace tying, providing a versatile simulator for systems involving complex filament-filament and filament–rigid body interactions.
|
| |
| 09:00-10:30, Paper TuI1I.217 | Add to My Program |
| Motion Planning with Precedence Specifications Via Augmented Graphs of Convex Sets |
|
| You, Shilin | University of Texas at Dallas |
| Luna, Gael | University of Texas at Dallas |
| Shaikh, Juned | University of Texas at Dallas |
| Gostin, David | University of Texas at Dallas |
| Xiang, Yu | University of Texas at Dallas |
| Koeln, Justin | University of Texas at Dallas |
| Summers, Tyler | University of Texas at Dallas |
Keywords: Task and Motion Planning, Motion and Path Planning
Abstract: We present an algorithm for planning trajectories that avoid obstacles and satisfy key-door precedence specifi- cations expressed with a fragment of signal temporal logic. Our method includes a novel exact convex partitioning of the obstacle free space that encodes connectivity among convex free space sets, key sets, and door sets. We then construct an augmented graph of convex sets that exactly encodes the key-door precedence specifications. By solving a shortest path problem in this augmented graph of convex sets, our pipeline provides an exact solution up to a finite parameterization of the trajectory. To illustrate the effectiveness of our approach, we present a method to generate key-door mazes that provide challenging problem instances, and we perform numerical experiments to evaluate the proposed pipeline. Our pipeline is faster by several orders of magnitude than recent state-of-the art methods that use general purpose temporal logic tools.
|
| |
| 09:00-10:30, Paper TuI1I.218 | Add to My Program |
| Robot Crash Course: Learning Soft and Stylized Falling |
|
| Strauch, Pascal | Disney Research |
| Müller, David | Disney Research |
| Christen, Sammy | Disney Research |
| Serifi, Agon | Disney Research |
| Grandia, Ruben | Disney Research |
| Knoop, Espen | Disney Research |
| Bächer, Moritz | Disney Research |
Keywords: Humanoid and Bipedal Locomotion, Reinforcement Learning, Machine Learning for Robot Control
Abstract: Despite recent advances in robust locomotion, bipedal robots operating in the real world remain at risk of falling. While most research focuses on preventing such events, we instead concentrate on the phenomenon of falling itself. Specifically, we aim to reduce physical damage to the robot while providing users with control over the robot's end pose. To this end, we propose a robot-agnostic reward function that balances the achievement of a desired end pose with impact minimization and the protection of critical robot parts during reinforcement learning. To make the policy robust to a broad range of initial falling conditions, and to enable the specification of an arbitrary and unseen end pose at inference time, we introduce a simulation-based sampling strategy of initial and end poses. Through simulated and real-world experiments, our work demonstrates that even bipedal robots can perform controlled, soft falls.
|
| |
| 09:00-10:30, Paper TuI1I.219 | Add to My Program |
| Touch2Insert: Zero-Shot Peg Insertion by Touching Intersections of Peg and Hole |
|
| Yajima, Masaru | Institute of Science Tokyo |
| Shin, Yuma | Institute of Science Tokyo |
| Kawakami, Rei | Tokyo Institute of Technology |
| Kanezaki, Asako | Tokyo Institute of Technology |
| Ota, Kei | AI Robot Association |
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation, Sensor-based Control
Abstract: Reliable insertion of industrial connectors remains a central challenge in robotics, requiring sub-millimeter precision under uncertainty and often without full visual access. Vision-based approaches struggle with occlusion and limited generalization, while learning-based policies frequently fail to transfer to unseen geometries. To address these imitations, we leverage tactile sensing, which captures local surface geometry at the point of contact and thus provides reliable information even under occlusion and across novel connector shapes. Building on this capability, we present Touch2Insert, a tactile-based framework for arbitrary peg insertion. Our method reconstructs cross-sectional geometry from high-resolution tactile images and estimates the relative pose of the hole with respect to the peg in a zero-shot manner. By aligning reconstructed shapes through registration, the framework enables insertion from a single contact without task-specific training. To evaluate its performance, we conducted experiments with three diverse connectors in both simulation and real-robot settings. The results indicate that Touch2Insert achieved sub-millimeter pose estimation accuracy for all connectors in simulation, and attained an average success rate of 86.8% on the real robot, thereby confirming the robustness and generalizability of tactile sensing for real-world robotic connector insertion.
|
| |
| 09:00-10:30, Paper TuI1I.220 | Add to My Program |
| Robust Multimodal Dynamic Object Segmentation |
|
| Xin, Zhe | Meituan |
| Chang, Hanzhi | Meituan Inc |
| Huang, Penghui | Meituan |
| Mao, Yinian | Meituan-Dianping Group |
| Huang, Guoquan (Paul) | University of Delaware |
Keywords: Computer Vision for Automation, Deep Learning for Visual Perception, Mapping
Abstract: Dynamic object segmentation plays a critical role in many visual applications such as static scene reconstruction from dynamic videos. However, existing optical flow-based methods fail to ensure consistent static/dynamic segmentation along object boundaries, while 3D reconstruction-based approaches are highly sensitive to reconstruction errors. To address these limitations, we present a dynamic object segmentation framework that can generate both precise and complete dynamic masks by integrating multimodal cues including 2D point tracks, 3D reconstruction, and semantic information. We design a network combining Transformer architectures with feature clustering aggregation modules to perform static/dynamic classification of multimodal feature trajectories. It enables the model to adaptively determine which type of feature should dominate based on the characteristics of each scene, while also mitigating the impact of feature degradation. Additionally, we introduce a novel point-query-based SAM post-processing method capable of handling multiple objects within a single mask. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in both dynamic object segmentation and static scene reconstruction tasks.
|
| |
| 09:00-10:30, Paper TuI1I.221 | Add to My Program |
| Multifingered Force-Aware Control for Humanoid Robots |
|
| Marra, Pasquale | Italian Institute of Technology |
| Caddeo, Gabriele Mario | Istituto Italiano Di Tecnologia |
| Pattacini, Ugo | Istituto Italiano Di Tecnologia |
| Natale, Lorenzo | Istituto Italiano Di Tecnologia |
Keywords: Sensor-based Control, Multifingered Hands, Force and Tactile Sensing
Abstract: In this paper, we address force-aware control and force distribution in robotic platforms with multi-fingered hands. Given a target goal and force estimates from tactile sensors, we design a controller that adapts the motion of the torso, arm, wrist, and fingers, redistributing forces to maintain stable contact with objects of varying mass distribution or unstable contacts. To estimate forces, we collect a dataset of tactile signals and ground-truth measurements using five Xela magnetic sensors interacting with indenters, and train force estimators. We then introduce a model-based control scheme that minimizes the distance between the Center of Pressure (CoP) and the centroid of the fingertips contact polygon. Since our method relies on estimated forces rather than raw tactile signals, it has the potential to be applied to any sensor capable of force estimation. We validate our framework on a balancing task with five objects, achieving a 82.7% success rate, and further evaluate it in multi-object scenarios, achieving 80% accuracy.
|
| |
| 09:00-10:30, Paper TuI1I.222 | Add to My Program |
| SHaRe-RL: Structured, Interactive Reinforcement Learning for Contact-Rich Industrial Assembly Tasks |
|
| Stranghöner, Jannick | Universität Bielefeld |
| Hartmann, Philipp | Bielefeld University |
| Weigelt, Lisa-Marie | Bielefeld University |
| Braun, Marco | Bielefeld University |
| Wrede, Sebastian | Bielefeld University |
| Neumann, Klaus | Bielefeld University / Fraunhofer IOSB-INA |
Keywords: Human-Centered Automation, Reinforcement Learning, Compliant Assembly
Abstract: High-mix low-volume (HMLV) industrial assembly, common in small and medium-sized enterprises (SMEs), requires the same precision, safety, and reliability as high-volume automation while remaining flexible to product variation and environmental uncertainty. Current robotic systems struggle to meet these demands. Manual programming is brittle and costly to adapt, while learning-based methods suffer from poor sample efficiency and unsafe exploration in contact-rich tasks. To address this, we present SHaRe-RL, a reinforcement learning framework that leverages multiple sources of prior knowledge. By (i) structuring skills into manipulation primitives, (ii) incorporating human demonstrations and online corrections, and (iii) bounding interaction forces with per-axis compliance, SHaRe-RL enables efficient and safe online learning for long-horizon, contact-rich industrial assembly tasks. Experiments on the insertion of industrial Harting connector modules with 0.2–0.4,mm clearance show reliable learning within practical wall-clock budget and improved performance over an unstructured human-in-the-loop RL baseline. We further show that the learned policy generalizes to previously unseen connector variants. Overall, our results show that process expertise alone can effectively guide real-world RL, making deployment safer, more robust, and economically viable for industrial assembly.
|
| |
| 09:00-10:30, Paper TuI1I.223 | Add to My Program |
| Periodic Steady-State Control of a Handkerchief-Spinning Task Using a Parallel Anti-Parallelogram Tendon-Driven Wrist |
|
| Liu, Lei | Tsinghua University |
| Zhang, Haonan | Beihang University |
| Xu, HuaHang | Tsinghua University |
| Zhang, Zefan | Tsinghua University |
| Chang, Lulu | Nanjing University of Science and Technology |
| Lv, Lei | Tongji University |
| McIntosh, Andrew Ross | Tsinghua University |
| Sun, Kai | Tsinghua university |
| Bing, Zhenshan | Technical University of Munich |
| Dong, Jiahong | Tsinghua University Affiliated Beijing Tsinghua Changgung Hospital: Beijing Tsinghua Changgung Hospital |
| Sun, Fuchun | Tsinghua University |
|
|
| |
| 09:00-10:30, Paper TuI1I.224 | Add to My Program |
| SplatCtrl: Perception–Action Coupling Via Gaussian Scene Representations and Reactive Robot Control |
|
| Jain, Siddarth | Mitsubishi Electric Research Laboratories (MERL) |
| Choi, Ho Jin | University of Pennsylvania |
Keywords: Reactive and Sensor-Based Planning, RGB-D Perception, Safety in HRI
Abstract: Robotic manipulators excel in structured environments but face substantial challenges in unstructured and dynamic settings. This paper presents SplatCtrl, a unified framework for real-time scene reconstruction and reactive robot motion generation to enable collision-free robotic arm control in previously unseen and continuously changing environments. Building on 3D Gaussian Splatting (3D-GS), we introduce a hybrid voxel-based filtering and dynamic Gaussian relocation strategy that supports efficient scene reconstruction from RGBD streams while accommodating environmental changes. For safe and reactive control, we further propose a method for deriving continuous signed distance functions from isotropic Gaussians, providing stable and differentiable collision probability estimates that bridge classical distance fields with the modern implicit representation. These continuous distance metrics are incorporated into control barrier functions, resulting in a unified perception–action coupling framework that supports smooth and reliable real-time motion generation in response to scene changes. Experimental validation in simulation, on physical robot, and within shared human–robot workspace demonstrates the framework’s effectiveness, achieving integrated scene reconstruction and reactive control in uncertain, and dynamic environments.
|
| |
| 09:00-10:30, Paper TuI1I.225 | Add to My Program |
| Coupled Particle Filters for Robust Affordance Estimation |
|
| Lowin, Patrick | Technische Universität Berlin |
| Mengers, Vito | Technische Universität Berlin |
| Brock, Oliver | Technische Universität Berlin |
Keywords: Sensor Fusion, Perception for Grasping and Manipulation, RGB-D Perception
Abstract: Robotic affordance estimation is challenging due to visual, geometric, and semantic ambiguities in sensory input. We propose a method that disambiguates these signals using two coupled recursive estimators for sub-aspects of affordances: graspable and movable regions. Each estimator encodes property-specific regularities to reduce uncertainty, while their coupling enables bidirectional information exchange that focuses attention on regions where both agree, i.e., affordances. Evaluated on a real-world dataset, our method outperforms three recent affordance estimators (Where2Act, Hands-as-Probes, and HRP) by 308%, 245%, and 257% in precision, and remains robust under challenging conditions such as low light or cluttered environments. Furthermore, our method achieves a 70% success rate in our real-world evaluation. These results demonstrate that coupling complementary estimators yields precise, robust, and embodiment-appropriate affordance predictions.
|
| |
| 09:00-10:30, Paper TuI1I.226 | Add to My Program |
| Stein Variational Ergodic Surface Coverage with SE(3) Constraints |
|
| Li, Jiayun | TU Darmstadt |
| Jin, Yufeng | TU Darmstadt |
| Teng, Sangli | University of California, Berkeley |
| Gong, Dejian | TU Darmstadt |
| Chalvatzaki, Georgia | TU Darmstadt |
Keywords: Constrained Motion Planning, Computational Geometry, Motion and Path Planning
Abstract: Surface manipulation tasks require robots to generate trajectories that comprehensively cover complex 3D surfaces while maintaining precise end-effector poses. Existing ergodic trajectory optimization (TO) methods demonstrate success in coverage tasks, while struggling with point-cloud targets due to the nonconvex optimization landscapes and the inadequate handling of SE(3) constraints in sampling-as-optimization (SAO) techniques. In this work, we introduce a preconditioned SE(3) Stein Variational Gradient Descent (SVGD) approach for SAO ergodic trajectory generation. Our proposed approach comprises multiple innovations. First, we reformulate point-cloud ergodic coverage as a manifold-aware sampling problem. Second, we derive SE(3)-specific SVGD particle updates, and, third, we develop a preconditioner to accelerate TO convergence. Our sampling-based framework consistently identifies superior local optima compared to strong optimization-based and SAO baselines while preserving the SE(3) geometric structure. Experiments on a 3D point-cloud surface coverage benchmark and robotic surface drawing tasks demonstrate that our method achieves superior coverage quality with tractable computation in our setting relative to existing TO and SAO approaches, and is validated in real-world robot experiments.
|
| |
| 09:00-10:30, Paper TuI1I.227 | Add to My Program |
| Reaction Templates: A Formal Approach to Realize Reactivity in Task and Motion Planning-Based Action Execution |
|
| Köpken, Anne | German Aerospace Center (DLR) |
| Bauer, Adrian Simon | German Aerospace Center (DLR) |
| Batti, Nesrine | German Aerospace Center (DLR) |
| Leidner, Daniel | German Aerospace Center (DLR) |
Keywords: Failure Detection and Recovery, Service Robotics, Reactive and Sensor-Based Planning
Abstract: Recent advancements in artificial intelligence have broadened the spectrum of tasks that robots can effectively tackle. However, the seamless execution of prolonged action sequences continues to pose a considerable challenge, attributed to limitations in the abilities of today's robots to react to unforeseen situations or failures. In response, we introduce Reaction Templates (RTs), a formal approach for integrating reactivity into task and motion planning. Operating concurrently with the primary execution logic, RTs enable a clear differentiation between planned actions and the necessary recovery strategies for handling unexpected events. This design promotes scalability by establishing reusable building blocks and customizable parameters, thereby enhancing flexibility in application. We provide a thorough introduction to the RT concept, elucidating its principles, mechanisms, and the rationale behind its design decisions. The resulting benefits of the approach are demonstrated through experimental validation with the humanoid robot Rollin’ Justin.
|
| |
| 09:00-10:30, Paper TuI1I.228 | Add to My Program |
| ForecastOcc: Vision-Based Semantic Occupancy Forecasting |
|
| Mohan, Riya | University of Freiburg |
| Juana Valeria, Hurtado | University of Freiburg |
| Mohan, Rohit | University of Freiburg |
| Valada, Abhinav | University of Freiburg |
Keywords: Deep Learning for Visual Perception, Semantic Scene Understanding, Computer Vision for Transportation
Abstract: Autonomous driving requires forecasting both geometry and semantics over time to effectively reason about future environment states. Existing vision-based occupancy forecasting methods focus on motion-related categories such as static and dynamic objects, while semantic information remains largely absent. Recent semantic occupancy forecasting approaches address this gap but rely on past occupancy predictions obtained from separate networks. This makes current methods sensitive to error accumulation and prevents learning spatio-temporal features directly from images. In this work, we present ForecastOcc, the first framework for vision-based semantic occupancy forecasting that jointly predicts future occupancy states and semantic categories. Our framework yields semantic occupancy forecasts for multiple horizons directly from past camera images, without relying on externally estimated maps. We evaluate ForecastOcc in two complementary settings: multi-view forecasting on the Occ3D-nuScenes dataset and monocular forecasting on SemanticKITTI, where we establish the first benchmark for this task. We introduce the first baselines by adapting two 2D forecasting modules within our framework. Importantly, we propose a novel architecture that incorporates a temporal cross-attention forecasting module, a 2D-to-3D view transformer, a 3D encoder for occupancy prediction, and a semantic occupancy head for voxel-level forecasts across multiple horizons. Extensive experiments on both datasets show that ForecastOcc consistently outperforms baselines, yielding semantically rich, future-aware predictions that capture scene dynamics and semantics critical for autonomous driving.
|
| |
| 09:00-10:30, Paper TuI1I.229 | Add to My Program |
| Tactile-Conditioned Diffusion Policy for Force-Aware Robotic Manipulation |
|
| Helmut, Erik | Technische Universität Darmstadt |
| Funk, Niklas Wilhelm | TU Darmstadt |
| Schneider, Tim | Technical University Darmstadt |
| de Farias, Cristiana | TU Darmstadt |
| Peters, Jan | Technische Universität Darmstadt |
Keywords: Imitation Learning, Deep Learning Methods, Force and Tactile Sensing
Abstract: Contact-rich manipulation depends on applying the correct grasp forces throughout the manipulation task, especially when handling fragile or deformable objects. Most existing imitation learning approaches often treat visuotactile feedback only as an additional observation, leaving applied forces as an uncontrolled consequence of gripper commands. In this work, we present Force-Aware Robotic Manipulation (FARM), an imitation learning framework that integrates high-dimensional tactile data to infer tactile-conditioned force signals, which in turn define a matching force-based action space. We collect human demonstrations using a modified version of the hand-held Universal Manipulation Interface (UMI) gripper that integrates a GelSight Mini visual tactile sensor. For deploying the learned policies, we developed an actuated variant of the UMI gripper with geometry matching our hand-held version. During policy rollouts, the proposed FARM diffusion policy jointly predicts robot pose, grip width, and grip force. FARM outperforms several baselines across three tasks with distinct force requirements—high-force, low-force, and dynamic force adaptation—demonstrating the advantages of its two key components: leveraging force-grounded, high-dimensional tactile observations and a force-based control space. The codebase and design files are open-sourced and available at https://tactile-farm.github.io.
|
| |
| 09:00-10:30, Paper TuI1I.230 | Add to My Program |
| Underactuated Multimodal Jumping Robot for Extraterrestrial Exploration |
|
| Wagner, Neil R. | University of Illinois Urbana-Champaign |
| Yim, Justin K. | University of Illinois Urbana-Champaign |
Keywords: Legged Robots, Space Robotics and Automation, Underactuated Robots
Abstract: We present a rolling and jumping underactuated monopedal robot designed to explore multimodal locomotion on low-gravity bodies. It uses only two reaction wheels to control its spatial orientation with two controllers: a balancing controller which can aim the robot’s jump direction on the ground, and an aerial reorientation controller which can aim the robot’s leg for landing after flight. We demonstrate rolling, targeted jumping and landing, and self-righting using only three actuators total, keeping system size to 0.33 m and 1.25 kg. Simple switching between locomotion modes enables the system to deal with differing landscapes and environmental conditions.
|
| |
| 09:00-10:30, Paper TuI1I.231 | Add to My Program |
| IDfRA: Self-Verification for Iterative Design in Robotic Assembly |
|
| Khendry, Nishka | University of Cambridge |
| Margadji, Christos | University of Cambridge |
| Pattinson, Sebastian William | University of Cambridge |
Keywords: Intelligent and Flexible Manufacturing, Assembly, Planning, Scheduling and Coordination
Abstract: Design for Robotic Assembly (DfRA) remains largely dependent on manual planning and heuristic simulation, limiting scalability and robustness in complex industrial settings. Although large language models (LLMs) show promise for semantic reasoning and task planning, most approaches remain tightly coupled to pre-built simulators that assume an accurate world model. We introduce Iterative Design for Robotic Assembly (IDfRA), a closed-loop framework that combines an LLM for plan generation with a vision–language model (VLM) for execution assessment. Given a target structure and a partial environmental signature, the LLM proposes an assembly plan, the robot executes it once at test time, and the VLM evaluates the resulting state to provide feedback for replanning. Through this iterative planning–execution–verification loop, the system progressively improves semantic fidelity and physical feasibility. Crucially, IDfRA does not require an accurate a priori world model before deployment. Instead, physical constraints are discovered online through interaction, enabling adaptation to under-specified environments. Empirical evaluation demonstrates that IDfRA attains 73.3% top-1 accuracy in semantic recognisability, surpassing the baseline on this metric. Moreover, the resulting assembly plans exhibit robust physical feasibility, achieving an overall 86.9% construction success rate, with design quality improving across iterations, albeit not always monotonically. Pairwise human evaluation further corroborates the advantages of IDfRA relative to alternative approaches. By integrating self-verification with context-aware adaptation, the framework evidences strong potential for deployment in unstructured manufacturing scenarios.
|
| |
| 09:00-10:30, Paper TuI1I.232 | Add to My Program |
| Learning Unified Probabilistic Spatial Relation Representation from Visual Demonstrations |
|
| Hannuschka, Paul Emil | Karlsruhe Institute of Technology |
| Gao, Jianfeng | Karlsruhe Institute of Technology (KIT) |
| Asfour, Tamim | Karlsruhe Institute of Technology (KIT) |
Keywords: Semantic Scene Understanding, Learning Categories and Concepts, Representation Learning
Abstract: The ability to interpret and reason about spatial relations is fundamental for robotic manipulation tasks. For instance, a robot must understand that "inside" requires different geometric constraints than "touching", and "closer" involves dynamic changes in distance relationships. Despite progress in modeling spatial relations, existing approaches face two critical limitations: they either oversimplify object geometry to points or bounding boxes, or they lack generative capabilities for synthesizing new spatial configurations. This paper introduces a novel generative and probabilistic model that jointly encodes object sizes, distances, and orientations within a unified representation, which captures distance-based, directional, and topological spatial relations while providing explicit uncertainty quantification. The model learns both static and dynamic semantic spatial relations from one or a few visual demonstrations and generalizes to novel contexts and configurations. We evaluate our approach across a set of spatial reasoning and robot manipulation tasks, demonstrating the model's robust performance with varied object shapes, sizes, and spatial arrangements. Videos and source code are available at https://sites.google.com/view/spatial-relations.
|
| |
| 09:00-10:30, Paper TuI1I.233 | Add to My Program |
| Adaptor: Advancing Assistive Teleoperation with Few-Shot Learning and Cross-Operator Generalization |
|
| Liu, Yu | Jilin University |
| Yin, Yihang | IO Intelligence |
| Huang, Tianlv | Jilin University |
| Yan, Fei | Jilin University |
| Xu, Yuan | Jilin University |
| Hong, Weinan | Jilin University |
| Han, Wei | Jilin University |
| Cao, Yue | Columbia University |
| Chen, Xiangyu | IO-AI.tech |
| Fan, Zipei | Jilin University |
| Song, Xuan | Jilin University |
Keywords: Telerobotics and Teleoperation, Intention Recognition, Human-Robot Collaboration
Abstract: Assistive teleoperation enhances efficiency via shared control, yet inter-operator variability, stemming from diverse habits and expertise, induces highly heterogeneous trajectory distributions that undermine intent recognition stability. We present Adaptor, a few-shot framework for robust cross-operator intent recognition. The Adaptor bridges the domain gap through two stages: (i) preprocessing, which models intent uncertainty by synthesizing trajectory perturbations via noise injection and performs geometry-aware keyframe extraction; and (ii) policy learning, which encodes the processed trajectories with an Intention Expert and fuses them with the pre-trained vision–language model context to condition an Action Expert for action generation. Experiments on real-world and simulated benchmarks demonstrate that Adaptor achieves state-of-the-art performance, improving success rates and efficiency over baselines. Moreover, the method exhibits low variance across operators with varying expertise, demonstrating robust cross-operator generalization.
|
| |
| 09:00-10:30, Paper TuI1I.234 | Add to My Program |
| LISN: Language-Instructed Social Navigation with VLM-Based Controller Modulating |
|
| Chen, Junting | National University of Singapore |
| Li, Yunchuan | National University of Singapore |
| Jiang, Panfeng | ShanghaiTech University |
| Du, Jiacheng | National University of Singapore |
| Chen, Zixuan | Nanjing Univeristy |
| Tie, Chenrui | National University of Singapore |
| Deng, Jiajun | University of Adelaide |
| Shao, Lin | National University of Singapore |
Keywords: Vision-Based Navigation, Semantic Scene Understanding, AI-Enabled Robotics
Abstract: Towards human-robot coexistence, socially aware navigation is significant for mobile robots. Yet existing studies on this area focus mainly on path efficiency and pedestrian collision avoidance, which are essential but represent only a fraction of social navigation. Beyond these basics, robots must also comply with user instructions, aligning their actions to task goals and social norms expressed by humans. In this work, we present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation. Built on Rosnav-Arena 3.0, it is the first standardized social navigation benchmark to incorporate instruction following and scene understanding across diverse contexts. To address this task, we further propose Social-Nav-Modulator, a fast–slow hierarchical system where a VLM agent modulates costmaps and controller parameters. Decoupling low-level action generation from the slower VLM loop reduces reliance on high-frequency VLM inference while improving dynamic avoidance and perception adaptability. Our method achieves an average success rate of 91.3%, which surpasses the most competitive baseline by 63%, with most of the improvements observed in challenging tasks such as following a person in a crowd and navigating while strictly avoiding instruction-forbidden regions. The project website is at: https://social-nav.github.io/LISN-project/
|
| |
| 09:00-10:30, Paper TuI1I.235 | Add to My Program |
| MetricNet: Recovering Metric Scale in Generative Navigation Policies |
|
| Nayak, Abhijeet Kishore | University of Technology Nuremberg |
| Oliveira Makowski, Débora | University of Technology Nuremberg |
| Gode, Samiran | University of Technology Nuremberg |
| Schmid, Cordelia | Inria |
| Burgard, Wolfram | University of Technology Nuremberg |
Keywords: Machine Learning for Robot Control, Visual Learning, Vision-Based Navigation
Abstract: Generative navigation policies have made rapid progress in improving end-to-end learned navigation. Despite their promising results, this paradigm has two structural problems. First, the sampled trajectories exist in an abstract, unscaled space without metric grounding. Second, the control strategy discards the full path, instead moving directly towards a single waypoint. This leads to short-sighted and unsafe actions, moving the robot towards obstacles that a complete and correctly scaled path would circumvent. To address these issues, we propose MetricNet, an effective add-on for generative navigation that predicts the metric distance between waypoints, grounding policy outputs in metric coordinates. We evaluate our method in simulation with a new benchmarking framework and show that executing MetricNet-scaled waypoints significantly improves both navigation and exploration performance. Beyond simulation, we further validate our approach in real-world experiments. Finally, we propose MetricNav, which integrates MetricNet into a navigation policy to guide the robot away from obstacles while still moving towards the goal.
|
| |
| 09:00-10:30, Paper TuI1I.236 | Add to My Program |
| Learning Agile Striker Skills for Humanoid Soccer Robots from Noisy Sensory Input |
|
| Xu, Zifan | The University of Texas at Austin |
| Seo, Myoungkyu | The University of Texas at Austin |
| Lee, Dongmyeong | The University of Texas at Austin |
| Fu, Hao | The University of Texas at Austin |
| Hu, Jiaheng | The University of Texas at Austin |
| Cui, Jiaxun | The University of Texas at Austin |
| Jiang, Yuqian | The University of Texas at Austin |
| Wang, Zhihan | The University of Texas at Austin |
| Brund, Anastasiia | The University of Texas at Austin |
| Biswas, Joydeep | The University of Texas at Austin |
| Stone, Peter | The University of Texas at Austin |
Keywords: Humanoid and Bipedal Locomotion, Sensorimotor Learning, Reinforcement Learning
Abstract: Learning fast and robust ball-kicking skills is a critical capability for humanoid soccer robots, yet it remains a challenging problem due to the need for rapid leg swings, postural stability on a single support foot, and robustness under noisy sensory input and external perturbations (e.g., opponents). This paper presents a reinforcement learning (RL)–based training pipeline that enables humanoid robots to execute robust continual ball-kicking with adaptability to different ball-goal configurations. The pipeline extends a typical teacher-student training framework--in which a teacher policy is trained with ground truth state information and the student learns to mimic it with noisy, imperfect sensing--by including four training stages: (1) long-distance ball chasing (teacher); (2) directional kicking (teacher); (3) teacher policy distillation (student), and (4) student adaptation and refinement (student). Key design elements--including tailored reward functions, realistic noise modeling, and online constrained RL for adaptation and refinement--are critical for closing the sim-to-real gap and sustaining performance under perceptual uncertainty. Extensive evaluations in both simulation and on a real robot demonstrate strong kicking accuracy and goal-scoring success across diverse ball–goal configurations. Ablation studies further highlight the necessity of the constrained RL, noise modeling, and the adaptation stage. This work presents a training pipeline for robust continual humanoid ball-kicking under imperfect perception, establishing a benchmark task for visuomotor skill learning in humanoid whole-body control.
|
| |
| 09:00-10:30, Paper TuI1I.237 | Add to My Program |
| Programmable Assembly and Cooperative Manipulation of Heterogeneous Microspheres Via Optoelectronic Tweezers |
|
| Niu, Wenyan | Beihang University |
| Wang, Ao | Beihang University |
| Huang, Shunxiao | Beihang University |
| Ye, Jingwen | Beihang University |
| Chen, Zaiyang | The Chinese University of Hong Kong |
| Li, Chan | Beihang University |
| Sun, Hongyan | Beihang University |
| Zeng, Zijin | Beihang University |
| Guo, Yingjian | Beihang University |
| Feng, Lin | Beihang University |
Keywords: Micro/Nano Robots, Automation at Micro-Nano Scales
Abstract: The programmable assembly and actuation of micro- and nanostructures remain key challenges in the development of micro-robotics. This work presents a programmable assembly and cooperative actuation strategy for heterogeneous microspheres based on optoelectronic tweezers (OET). By employing Ag-PS microspheres as actuators and PS microspheres as payloads, we constructed stable “actuator–payload” units and investigated their frequency response and dynamic characteristics. The proposed method enables controlled assembly into core–satellite and satellite–core configurations with tunable coordination angles. Furthermore, the cooperative effect of the dual actuating units was revealed, enabling the composite system to maintain a continuous and precise circular trajectory following a ring-shaped light pattern. In addition, the modular assembly strategy was used to construct chain-like structures exceeding 172 μm in length, thereby confirming the approach's scalability. This work expands the application of OET from particle transport to modular microstructure construction and multi-actuator cooperative control, offering new opportunities for designing microbotic systems and their biomedical applications.
|
| |
| 09:00-10:30, Paper TuI1I.238 | Add to My Program |
| Predictive Local Planning with Multi-Step Reward and Q-Value Forecasting |
|
| Du, Yuhan | Zhejiang University |
| Cui, Yuxiang | Zhejiang University |
| Peng, Yulin | Zhejiang University |
| Pan, Yiyuan | Zhejiang University |
| Cai, Tianhao | Zhejiang University |
| Wang, Yue | Zhejiang University |
| Xiong, Rong | Zhejiang University |
Keywords: Motion and Path Planning, Planning under Uncertainty, Reinforcement Learning
Abstract: Planning in dynamic environments often relies on explicit future observation prediction or value-based estimation, both of which can be brittle or hard to generalize in uncertain settings. We propose a novel model-based reinforcement learning framework that performs trajectory rollout and optimization entirely in a learned latent space. Instead of predicting future observations explicitly, our method evaluates candidate trajectories through multi-step reward prediction and terminal Q-value estimation in the latent domain, enabling robust and generalizable planning in dynamic environments. A policy model generates an initial trajectory in latent space, which is then refined via a smoothness-regularized optimization using Model Predictive Path Integral (MPPI), guided by the predicted cumulative reward and Q-values. This avoids the complexity of future state reconstruction while ensuring dynamically feasible execution. To enhance the model's deployment performance in crowded or interactive scenarios, we further introduce a lightweight social reward that penalizes unsafe overtaking and encourages yielding behavior. Experiments in both simulation and real-world environments show improved success rate, efficiency, and social acceptability compared to strong baselines.
|
| |
| 09:00-10:30, Paper TuI1I.239 | Add to My Program |
| AVR: Active Vision-Driven Precise Robot Manipulation with Viewpoint and Focal Length Optimization |
|
| Liu, Yushan | Tsinghua University |
| Mu, Shilong | Xspark AI |
| Chao, Xintao | Tsinghua University |
| Li, Zizhen | National University of Singapore |
| Mu, Yao | Shanghai Jiao Tong University |
| Chen, Tianxing | The University of Hong Kong |
| Li, Shoujie | Nanyang Technological University |
| Lyu, Chuqiao | Tsinghua University |
| Zhang, Xiao-Ping | Tsinghua University |
| Ding, Wenbo | Tsinghua University |
Keywords: Imitation Learning, Telerobotics and Teleoperation, Learning from Demonstration
Abstract: Robotic manipulation in complex scenes demands precise perception of task-relevant details, yet fixed or suboptimal viewpoints often impair fine-grained perception and induce occlusions, constraining imitation-learned policies. We present AVR (Active Vision-driven Robotics), a bimanual teleoperation and learning framework that unifies head-tracked viewpoint control (HMD-to-2-DoF gimbal) with motorized optical zoom to keep targets centered at an appropriate scale during data collection and deployment. In simulation, an AVR plugin augments RoboTwin demonstrations by emulating active vision (ROI-conditioned viewpoint change, aspect-ratio-preserving crops with explicit zoom ratios, and super-resolution), yielding 5–17% gains in task success across diverse manipulations. On our real-world platform, AVR improves success on most tasks, with over 25% gains compared to the static-view baseline, and extended studies further demonstrate robustness under occlusion, clutter, and lighting disturbances, as well as generalization to unseen environments and objects. These results pave the way for future robotic precision manipulation methods in the pursuit of human-level dexterity and precision.
|
| |
| 09:00-10:30, Paper TuI1I.240 | Add to My Program |
| Human Motion Intent Inferencing in Teleoperation through a SINDy Paradigm |
|
| Bowman, Michael | University of Pennsylvania |
| Zhang, Xiaoli | Colorado School of Mines |
Keywords: Telerobotics and Teleoperation, Intention Recognition, Human-Robot Collaboration
Abstract: Intent inferencing in teleoperation has been instrumental in aligning operator goals and coordinating actions with robotic partners. However, current intent inference methods often ignore subtle motion that can be strong indicators for a sudden change in intent. Specifically, we aim to tackle 1) if we can detect sudden jumps in operator trajectories, 2) how to appropriately use these sudden jump motions to infer an operator’s goal state, and 3) how to incorporate these discontinuous and continuous dynamics to infer operator motion. Our framework, called Psychic, models these small indicative motions through a jump-drift-diffusion stochastic differential equation to cover discontinuous and continuous dynamics. Kramers-Moyal (KM) coefficients allow us to detect jumps with a trajectory which we pair with a statistical outlier detection algorithm to nominate goal transitions. Through identifying jumps, we can perform early detection of existing goals and discover undefined goals in unstructured scenarios. Our framework then applies a Sparse Identification of Nonlinear Dynamics (SINDy) model using KM coefficients with the goal transitions as a control input to infer an operator’s motion behavior in unstructured scenarios. We demonstrate Psychic can produce probabilistic reachability sets and compare our strategy to a negative log-likelihood model fit. We perform a retrospective study on 600 operator trajectories in a hands-free teleoperation task to evaluate the efficacy of our opensource package, Psychic, in both offline and online learning.
|
| |
| 09:00-10:30, Paper TuI1I.241 | Add to My Program |
| Stowable Tape Spring Truss Leg for Robotic Mobility |
|
| Peña, Angelica | University of California, Berkeley |
| Galassi, Andrew | University of California, Berkeley |
| Stuart, Hannah S. | University of California, Berkeley |
Keywords: Compliant Joints and Mechanisms, Legged Robots, Wheeled Robots
Abstract: Obstructed terrains, such as boulders, crevices, and rubble, limit the locomotion of wheeled mobile robots. Transformable wheel-to-leg designs enable better traversal; small wheels provide driving efficiency and deployable legs enable stepping over obstructions. We propose an approach to such leg deployment utilizing a novel tape spring truss structure. It achieves large shape changes -- demonstrating transformation ratios from 3.77 to 5.38 between wheel radius and leg length -- in a light-weight (8 g) and compact way. Prior tape spring mechanisms have not yet used a string-tensioned truss formation. By tensioning the tape spring via a string mechanism, the wheel’s emerging deployable legs are strong enough to traverse obstacles greater than one wheel diameter. Yet, it can also be stowed using just the weight of the rover to coil the tape spring. Adjusting the string pretension allows for optimization of the leg’s transverse buckling load, resulting in a strong truss despite low mass and stowed volume. We validate the truss’s capability by incorporating it into a two-wheeled mobile rover platform, demonstrating utility in mobility across obstructed terrain.
|
| |
| 09:00-10:30, Paper TuI1I.242 | Add to My Program |
| PokeNet: Learning Kinematic Models of Articulated Objects from Human Observations |
|
| Gupta, Anmol | Arizona State University |
| Gu, Weiwei | Arizona State University |
| Patil, Omkar Deepak | Arizona State University |
| Lee, Jun Ki | Seoul National University |
| Gopalan, Nakul | Arizona State University |
Keywords: Perception for Grasping and Manipulation, Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization
Abstract: Articulation modeling enables robots to learn joint parameters of articulated objects for effective manipulation which can then be used downstream for skill learning or planning. Existing approaches often rely on prior knowledge about the objects, such as the number or type of joints. Some of these approaches also fail to recover occluded joints that are only revealed during interaction. Others require large numbers of multi-view images for every object, which is impractical in real-world settings. Furthermore, prior works neglect the order of manipulations, which is essential for many multi-DoF objects where one joint must be operated before another, such as a dishwasher. We introduce PokeNet, an end-to-end framework that estimates articulation models from a single human demonstration without prior object knowledge. Given a sequence of point cloud observations of a human manipulating an unknown object, PokeNet predicts joint parameters, infers manipulation order, and tracks joint states over time. PokeNet outperforms existing state-of-the art methods, improving joint axis and state estimation accuracy by an average of over 27% across diverse objects, including novel and unseen categories. We demonstrate these gains in both simulation and real-world environments.
|
| |
| 09:00-10:30, Paper TuI1I.243 | Add to My Program |
| An HMDP-MPC Decision-Making Framework with Adaptive Safety Margins and Hysteresis for Autonomous Driving |
|
| Li, Siyuan | Loughborough University |
| Liu, Chengyuan | Loughborough Univeristy |
| Chen, Wen-Hua | The Hong Kong Polytechnic University |
Keywords: Planning under Uncertainty, Task and Motion Planning
Abstract: This paper presents a unified decision-making framework that integrates Hybrid Markov Decision Processes (HMDPs) with Model Predictive Control (MPC), augmented by velocity-dependent safety margins and a prediction-aware hysteresis mechanism. Both the ego and surrounding vehicles are modeled as HMDPs, allowing discrete maneuver transition and kinematic evolution to be jointly considered within the MPC optimization. Safety margins derived from the Intelligent Driver Model (IDM) adapt to traffic context but vary with speed, which can cause oscillatory decisions and velocity fluctuations. To mitigate this, we propose a frozen-release hysteresis mechanism with distinct trigger and release thresholds, effectively enlarging the reaction buffer and suppressing oscillations. Decision continuity is further safeguarded by a two-layer recovery scheme: a global bounded relaxation tied to IDM margins and a deterministic fallback policy. The framework is evaluated through a case study, an ablation against a no-hysteresis baseline, and large-scale randomized experiments across 18 traffic settings. Across 8,050 trials, it achieves a collision rate of only 0.05%, with 98.77% of decisions resolved by nominal MPC and minimal reliance on relaxation or fallback. These results demonstrate the robustness and adaptability of the proposed decision-making framework in heterogeneous traffic conditions.
|
| |
| 09:00-10:30, Paper TuI1I.244 | Add to My Program |
| Safety-Critical Dynamic Motion Generation for Manipulators Using Differentiable Distance Fields in Configuration Space |
|
| Chi, Xuemin | Zhejiang University, Idiap Research Institute |
| Huang, Jihao | Zhejiang University |
| Li, Yiming | Idiap Research Institute, EPFL |
| Dai, Bolun | New York University |
| Liu, Zhitao | Zhejiang University |
| Calinon, Sylvain | Idiap Research Institute, EPFL |
Keywords: Collision Avoidance, Whole-Body Motion Planning and Control, Manipulation Planning
Abstract: Generating collision-free motions in dynamic environments is a challenging problem for high-dimensional robotics, particularly under real-time constraints. Control Barrier Functions (CBFs), widely utilized in safety-critical control, have shown significant potential for motion generation. However, for high-dimensional robot manipulators, existing QP formulations and CBF-based methods rely on positional information, overlooking higher-order derivatives such as velocities. This limitation may lead to reduced success rates, decreased performance, and inadequate safety constraints. To address this, we construct time-varying CBFs (TVCBFs) that consider dynamic obstacles. Our approach leverages recent developments on distance fields for articulated manipulators, a differentiable representation that enables the mapping of objects' position and velocity into the robot's joint space, offering a comprehensive understanding of the system's interactions. This allows the manipulator to be treated as a point-mass system thus simplifying motion generation tasks. Additionally, we introduce a time-varying control Lyapunov function (TVCLF) to enable whole-body contact motions. Our approach integrates the TVCBFs, TVCLF, and manipulator physical constraints within a unified QP framework. We validate our method through simulations and comparisons with state-of-the-art approaches, demonstrating its effectiveness on a 7-axis Franka arm in real-world experiments. Source codes, experimental data and videos are available on the project webpage: url{https://sites.google.com/view/sdfcdf-tvcbfs-qp}.
|
| |
| 09:00-10:30, Paper TuI1I.245 | Add to My Program |
| Dual Prompt-Driven Feature Encoding for Nighttime UAV Tracking |
|
| Wang, Yiheng | Tongji University |
| Fu, Changhong | Tongji University |
| Yao, Liangliang | Tongji University |
| Zuo, Haobo | University of Hong Kong |
| Zhang, Zijie | Tongji University |
Keywords: Aerial Systems: Applications, Computer Vision for Automation, Deep Learning for Visual Perception
Abstract: Robust feature encoding constitutes the foundation of UAV tracking by enabling the nuanced perception of target appearance and motion, thereby playing a pivotal role in ensuring reliable tracking. However, existing feature encoding methods often overlook critical illumination and viewpoint cues, which are essential for robust perception under challenging nighttime conditions, leading to degraded tracking performance. To overcome the above limitation, this work proposes a dual prompt-driven feature encoding method that integrates prompt-conditioned feature adaptation and context-aware prompt evolution to promote domain-invariant feature encoding. Specifically, the pyramid illumination prompter is proposed to extract multi-scale frequency-aware illumination prompts. The dynamic viewpoint prompter modulates deformable convolution offsets to accommodate viewpoint variations, enabling the tracker to learn view-invariant features. Extensive experiments validate the effectiveness of the proposed dual prompt-driven tracker (DPTracker) in tackling nighttime UAV tracking. Ablation studies highlight the contribution of each component in DPTracker. Real-world tests under diverse nighttime UAV tracking scenarios further demonstrate the robustness and practical utility. The code and demo videos are available at https://github.com/yiheng-wang-duke/DPTracker.
|
| |
| 09:00-10:30, Paper TuI1I.246 | Add to My Program |
| Semantic-Level Conflict Traffic Scenario Generation Via Spatiotemporal Polygon Anchors |
|
| Li, Yunwei | Tsinghua University |
| Wang, Anran | East China University of Science and Technology |
| Wu, Siyu | Tsinghua University |
| Fu, Shengjie | Tsinghua University |
| Feng, Shuo | Tsinghua University |
| Wang, Hong | Tsinghua University |
| Li, Jun | Tsinghua University |
Keywords: Intelligent Transportation Systems, Automation Technologies for Smart Cities, AI-Based Methods
Abstract: Autonomous Driving Systems (ADS) require rigorous and complex testing under diverse conditions to fulfill various demands and purposes of testing tasks, such as occlusion-triggered events, necessitating semantic-level control in scenario generation. Existing methods, reliant on low-level state controls, struggle to represent high-level semantic intents for task-oriented testing. We propose SPATSG, a novel framework for event-driven, semantically aligned traffic scenario generation, leveraging Spatiotemporal Polygon Anchors (SPA) to bridge high-level test requirements and low-level diffusion guidance. SPAs encapsulate critical geometric and temporal patterns of traffic agents, derived from a set of targeted scenarios. During diffusion denoising, SPATSG integrates SPAs via an auxiliary loss to steer sampling toward desired semantics. A dynamic resampling strategy further intensifies guidance and prioritizes promising trajectory candidates progressively to balance exploration and refinement. We evaluate SPATSG on SinD, a Chinese intersection benchmark featuring complex interactions and diverse conflicts. Experiments on occlusion-triggered scenario generation show that SPATSG demonstrates superior semantic controllability, effectively reveals risk events across ADS, and maintains diversity and realism compared to baselines. This work offers a principled, interpretable approach for semantically controllable ADS testing and evaluation.
|
| |
| 09:00-10:30, Paper TuI1I.247 | Add to My Program |
| Trajectory Optimization through Mixed-Integer Optimization of Contact Dynamics for Switching End Effector Locomotion |
|
| Morgan, Jared | Worcester Polytechnic Institute |
| Agheli, Mahdi | Worcester Polytechnic Institute |
Keywords: Legged Robots, Optimization and Optimal Control, Dynamics
Abstract: Trajectory optimizers for legged robots typically assume a single end effector on each leg, often a foot or wheel, without switching to another. Robots employing point-modeled end effectors, compared to those with wheeled end effectors, often benefit in adaptability and maneuverability but at the cost of higher energy expenditure and lower speed. While current hardware supports switching between these two end-effector types, existing research has largely focused on maintaining stability during switching, with little attention to determining when each type is most effective. To our knowledge, this paper introduces the first framework that simultaneously optimizes both trajectories and end-effector contact dynamics through mixed-integer optimization. We validate our approach by solving and executing trajectories with a whole-body controller in Gazebo across a variety of terrains, including ramps and stepping stones. The results show that our framework not only handles diverse terrains but also exploits contact dynamics to reduce cost of transport and increase speed compared to foot-only locomotion.
|
| |
| 09:00-10:30, Paper TuI1I.248 | Add to My Program |
| Multi-Robot Collision Avoidance with Probabilistic Mahalanobis Distance Constraints |
|
| Chen, Zhaodong | Sun Yat-Sen University |
| Liu, Dingfu | Sun Yat-Sen University |
| Feng, Chuqing | Sun Yat-Sen University |
| Shan, Yunxiao | Sun Yat-Sen University |
Keywords: Multi-Robot Systems, Collision Avoidance, Planning under Uncertainty
Abstract: In multi-robot systems operating under uncertainty, maintaining safe inter-robot distances while avoiding collisions with obstacles is crucial. Although chance-constrained methods have been widely adopted to handle such uncertainties, existing approaches often exhibit conservatism due to their reliance on linearized integration regions. To address this limitation, this paper introduces a novel probabilistic Mahalanobis distance constraint that enables tighter reformulations of collision avoidance constraints both between robots and between robots and obstacles. These constraints are integrated into a Model Predictive Path Integral (MPPI) control framework for efficient trajectory optimization. The effectiveness of the proposed method is validated through comprehensive simulations comparing it against state-of-the-art approaches, as well as through real-world experiments conducted across various scenarios.
|
| |
| 09:00-10:30, Paper TuI1I.249 | Add to My Program |
| Suction Leap-Hand: Suction Cups on a Multi-Fingered Hand Enable Embodied Dexterity and In-Hand Teleoperation |
|
| Zhaole, Sun | The University of Edinburgh |
| Mao, Xiaofeng | Edinburgh University |
| Zhu, Jihong | University of York |
| Zhang, Yuanlong | Tsinghua University |
| Fisher, Robert | University of Edinburgh |
Keywords: Dexterous Manipulation, In-Hand Manipulation, Multifingered Hands
Abstract: Dexterous in-hand manipulation remains a foundational challenge in robotics, with progress often constrained by the prevailing paradigm of imitating the human hand. This anthropomorphic approach creates two critical barriers: 1) it limits robotic capabilities to tasks humans can already perform, and 2) it makes data collection for learning-based methods exceedingly difficult. Both challenges are caused by traditional force-closure which requires coordinating complex, multi-point contacts based on friction, normal force, and gravity to grasp an object. In this work, we propose a paradigm shift: moving away from replicating human mechanics toward the design of novel robotic embodiments. We introduce the Suction Leap Hand (SLeap Hand), a multi-fingered hand featuring integrated fingertip suction cups that realize a new form of suction-enabled dexterity. By replacing complex force-closure grasps with stable, single-point adhesion, our design fundamentally simplifies in-hand teleoperation and facilitates the collection of high-quality demonstration data. More importantly, this suction-based embodiment unlocks a new class of dexterous skills that are difficult or even impossible for the human hand, such as one-handed paper cutting and in-hand writing. Our work demonstrates that by moving beyond anthropomorphic constraints, novel embodiments can not only lower the barrier for collecting robust manipulation data but also enable the stable, single-handed completion of tasks that would typically require two human hands. Our webpage is https://sites.google.com/view/sleaphand.
|
| |
| 09:00-10:30, Paper TuI1I.250 | Add to My Program |
| Robot Control Stack: A Lean Ecosystem for Robot Learning at Scale |
|
| Jülg, Tobias Thomas | University of Technology Nuremberg |
| Krack, Pierre | University of Technology Nuremberg |
| Bien, Seongjin | University of Technology Nuremberg |
| Blei, Yannik | University of Technology Nuremberg |
| Gamal, Khaled | UTN |
| Nakahara, Ken | TU Dresden |
| Hechtl, Johannes | Siemens, Technische Universität Nürnberg (UTN) |
| Calandra, Roberto | TU Dresden |
| Burgard, Wolfram | University of Technology Nuremberg |
| Walter, Florian | Technical University Munich |
Keywords: Software Architecture for Robotic and Automation, Imitation Learning, Transfer Learning
Abstract: Vision-Language-Action models (VLAs) mark a major shift in robot learning. They replace specialized architectures and task-tailored components of expert policies with large-scale data collection and setup-specific fine-tuning. In this machine learning-focused workflow that is centered around models and scalable training, traditional robotics software frameworks become a bottleneck, while robot simulations offer only limited support for transitioning from and to real-world experiments. In this work, we close this gap by introducing Robot Control Stack (RCS), a lean ecosystem designed from the ground up to support research in robot learning with large-scale generalist policies. At its core, RCS features a modular and easily extensible layered architecture with a unified interface for simulated and physical robots, facilitating sim-to-real transfer. Despite its minimal footprint and dependencies, it offers a complete feature set, enabling both real-world experiments and large-scale training in simulation. Our contribution is twofold: First, we introduce the architecture of RCS and explain its design principles. Second, we evaluate its usability and performance along the development cycle of VLA and RL policies. Our experiments also provide an extensive evaluation of Octo, OpenVLA, and Pi-Zero on multiple robots and shed light on how simulation data can improve real-world policy performance. Our code, datasets and videos are available at https://robotcontrolstack.github.io/
|
| |
| 09:00-10:30, Paper TuI1I.251 | Add to My Program |
| SEEC: Stable End-Effector Control with Model-Enhanced Residual Learning for Humanoid Loco-Manipulation |
|
| Jang, Jaehwi | Georgia Institute of Technology |
| Wang, Zhuoheng | Tsinghua University |
| Zhou, Ziyi | Georgia Institute of Technology |
| Wu, Feiyang | Georgia Institute of Technology |
| Zhao, Ye | Georgia Institute of Technology |
Keywords: Legged Robots, Reinforcement Learning, Mobile Manipulation
Abstract: Arm end-effector stabilization is essential for humanoid loco-manipulation tasks, yet it remains challenging due to the high degrees of freedom and inherent dynamic instability of bipedal robot structures. Previous model-based controllers achieve precise end-effector control but rely on precise dynamics modeling and estimation, which often struggle to capture real-world factors (e.g., friction and backlash) and thus degrade in practice. On the other hand, learning-based methods can better mitigate these factors via exploration and domain randomization, and have shown potential in real-world use. However, they often overfit to training conditions, requiring retraining with the entire body, and still struggle to adapt to unseen scenarios. To address these challenges, we propose a novel stable end-effector control (SEEC) framework with model-enhanced residual learning that learns to achieve precise and robust end-effector compensation for lower-body induced disturbances through model-guided reinforcement learning (RL) with a perturbation generator. This design allows the upper-body policy to achieve accurate end-effector stabilization as well as adapt to unseen locomotion controllers with no additional training. We validate our framework in different simulators and transfer trained policies to the Booster T1 humanoid robot. Experiments demonstrate that our method consistently outperforms baselines and robustly handles diverse and demanding loco-manipulation tasks.
|
| |
| 09:00-10:30, Paper TuI1I.252 | Add to My Program |
| Proprioceptive Image: An Image Representation of Proprioceptive Data from Quadruped Robots for Contact Estimation Learning |
|
| Fischer Abati, Gabriel | Istituto Italiano Di Tecnologia |
| Soares, João Carlos Virgolino | Istituto Italiano Di Tecnologia |
| Turrisi, Giulio | Istituto Italiano Di Tecnologia |
| Barasuol, Victor | Istituto Italiano Di Tecnologia |
| Semini, Claudio | Istituto Italiano Di Tecnologia |
Keywords: Legged Robots, AI-Based Methods
Abstract: This paper presents a novel approach for representing proprioceptive time-series data from quadruped robots as structured two-dimensional images, enabling the use of convolutional neural networks for learning locomotion-related tasks. The proposed method encodes temporal dynamics from multiple proprioceptive signals, such as joint positions, IMU readings, and foot velocities, while preserving the robot’s morphological structure in the spatial arrangement of the image. This transformation captures inter-signal correlations and gait-dependent patterns, providing a richer feature space than direct time-series processing. We apply this concept in the problem of contact estimation, a key capability for stable and adaptive locomotion on diverse terrains. Experimental evaluations on both real-world datasets and simulated environments show that our image-based representation consistently enhances prediction accuracy and generalization over conventional sequence-based models, underscoring the potential of cross-modal encoding strategies for robotic state learning. Our method achieves superior performance on the contact dataset, improving contact state accuracy from 87.7% to 94.5% over the recently proposed MI-HGNN method, using a 15 times shorter window size.
|
| |
| 09:00-10:30, Paper TuI1I.253 | Add to My Program |
| Proactive Grasp Assistance in a Robotic Hand Exoskeleton Improves Performance and Preference in Challenging Tasks |
|
| Davis, Benjamin | University of California, Berkeley |
| Huynh, Emily | University of California, Berkeley |
| Stuart, Hannah S. | UC University of California, Berkeley |
Keywords: Wearable Robotics, Physical Human-Robot Interaction, Physically Assistive Devices
Abstract: Advancements in perception, planning, and control, enable the development of wearable robots capable of proactively assisting users in avoiding potentially negative outcomes. However, the introduction of robotic assistance in general is often associated with a loss in the sense of agency, a factor traditionally associated with overall device acceptance. Recent work provides a different perspective, showing that contextual proactive assistance is well-received for teleoperation or shared workspace tasks. Still, no works have investigated the impact of proactive assistance for wearable grasping devices, where physical interactions have increased potential for disrupting the user's experience. In this study, we analyze the impact of proactive assistance in a hand exoskeleton with an abstracted grasping task of varying difficulty. We show that in general, the presence of assistance does not significantly reduce experience or the sense of agency. In fact, in a difficult task, subjects strongly prefer proactive assistance, likely as a result of its provided utility. When the task is easily completed without assistance, subjects indicate no strong preference for assisted conditions. Our results challenge the notion of a direct trade-off between robotic assistance and agency, suggesting that well-designed assistance can improve performance and user preference without compromising their sense of control.
|
| |
| 09:00-10:30, Paper TuI1I.254 | Add to My Program |
| CrazyMARL: Decentralized Direct Motor Control Policies for Cooperative Aerial Transport of Cable-Suspended Payloads |
|
| Lorentz, Viktor | TU Berlin |
| Wahba, Khaled | TU Berlin |
| Auddy, Sayantan | TU Berlin |
| Toussaint, Marc | TU Berlin |
| Hoenig, Wolfgang | TU Berlin |
Keywords: Aerial Systems: Mechanics and Control, Cooperating Robots, Reinforcement Learning
Abstract: Collaborative transportation of cable-suspended payloads by teams of Unmanned Aerial Vehicles (UAVs) has the potential to enhance payload capacity, adapt to different payload shapes, and provide built-in compliance, making it attractive for applications ranging from disaster relief to precision logistics. However, multi-UAV coordination under disturbances, nonlinear payload dynamics, and slack–taut cable modes remains a challenging control problem. To our knowledge, no prior work has addressed these cable mode transitions in the multi-UAV context, instead relying on simplifying rigid-link assumptions. We propose CrazyMARL, a decentralized Reinforcement Learning (RL) framework for multi-UAV cable-suspended payload trans- port. Simulation results demonstrate that the learned policies can outperform classical decentralized controllers in terms of disturbance rejection and tracking precision, achieving an 80% recovery rate from harsh conditions compared to 44% for the baseline method. We also achieve successful zero-shot sim- to-real transfer and demonstrate that our policies are highly robust under harsh conditions, including wind, random external disturbances, and transitions between slack and taut cable dynamics. This work paves the way for autonomous, resilient UAV teams capable of executing complex payload missions in unstructured environments. Code and videos can be found on the website: https://imrclab.github.io/CrazyMARL.
|
| |
| 09:00-10:30, Paper TuI1I.255 | Add to My Program |
| Force-Guided Collaborative Control and Digital Twin of a Snake Robot for Cranial Bone Surgery |
|
| Law, Jones | University of Toronto |
| Fan, Liheng | University of Toronto |
| Sabahi-Pourkashani, Camron | University of Waterloo |
| Roshanfar, Majid | Postdoctoral Research Fellow at the Hospital for Sick Children (SickKids) |
| Mahmood, Kashfia | University of Toronto |
| Munawar, Adnan | Johns Hopkins University |
| Looi, Thomas | Hospital for Sick Children |
| Diller, Eric D. | University of Toronto |
| Podolsky, Dale | University of Toronto |
|
|
| |
| 09:00-10:30, Paper TuI1I.256 | Add to My Program |
| DiffDef: A Diffusion Model for Generating Multimodal Goal Shapes from Demonstrations for Deformable Object Manipulation |
|
| Thach, Bao | University of Utah |
| Watts, Tanner | Vanderbilt University |
| Kim, Siyeon | University of Utah |
| Jordan, Britton | University of Utah |
| Devendran Shanthi, Mohanraj | University of Utah |
| Ho, Shing-Hei | University of Utah |
| Ferguson, James | University of Utah |
| Hermans, Tucker | University of Utah |
| Kuntz, Alan | Vanderbilt University |
Keywords: Surgical Robotics: Planning, Learning from Demonstration, Bimanual Manipulation
Abstract: Deformable object manipulation is pivotal to numerous real-world robotic applications. A promising paradigm in this field is the shape servoing task, focusing on controlling deformable objects into desired goal shapes. However, prior works typically rely on impractical goal shape acquisition methods, such as laborious domain-knowledge engineering or manual manipulation. Crucially, existing methods fail in multi-modal goal settings, where multiple distinct goal shapes can all lead to successful task completion, a common scenario in many robotic applications. In this paper, we address this problem by developing DiffDef, a novel neural network that leverages a denoising diffusion model to learn a distribution over multiple valid goal shapes, rather than predicting a single deterministic outcome. DiffDef enables the generation of diverse goal shapes, thereby avoiding the mode-averaging artifacts inherent in deterministic models used by previous approaches. We demonstrate our method’s effectiveness on several robotic tasks inspired by both manufacturing and surgical applications, both in simulation and on two physical robotic platforms: the da Vinci Research Kit (dVRK) robot and a bimanual KUKA-based robotic system.
|
| |
| 09:00-10:30, Paper TuI1I.257 | Add to My Program |
| Safe Robotics Control with Directional Projection Control Barrier Functions Via Differentiable Optimization |
|
| Wei, Yan | Zhejiang University of Technology |
| Yao, Jiajie | Zhejiang University of Technology |
| Yu, Xinyi | Zhejiang University of Technology |
| Ou, Linlin | Zhejiang University of Technology |
Keywords: Robot Safety, Collision Avoidance, Optimization and Optimal Control
Abstract: Collision avoidance is essential for robotic systems. This paper presents a method for designing directional projection control barrier functions (CBFs) based on differentiable optimization for second-order robotic systems. The approach reduces high-order CBFs to first-order ones and estimates collision risk by examining the intersection of projections along the relative velocity direction. Under the assumption that both the target and obstacles are convex polyhedra whose projections yield convex polygons, a tunable uniform scaling function, centered at the centroid, is introduced to pad the convex polygon. The strict convexity of this padded region is rigorously proven. Using the minimum scaling factor that leads to intersection between two projected convex polygons, a CBF is constructed and incorporated into a tracking controller to ensure collision avoidance. The effectiveness of the proposed method is validated through simulations with a 2D mobile robot and a 7-DOF Franka manipulator.
|
| |
| 09:00-10:30, Paper TuI1I.258 | Add to My Program |
| Real-Time Model Predictive Control of Nonlinear Coupled Joints Using MPPI: Application to Humanoid Ankle Joints |
|
| Park, Gunoo | CNRS |
| Bak, Jaewan | Korea Institute of Science and Technology |
| Seo, Yunsoo | University of Texas at Austin |
| Im, Euncheol | Korea Institute of Science and Technology |
| Lee, Hoseok | Korea Institute of Science and Technology |
| Lee, Jongbok | KIST |
| Mansard, Nicolas | CNRS |
| Lee, Jongwon | Korea Institute of Science and Technology |
| Lee, Yisoo | Korea Institute of Science and Technology |
Keywords: Actuation and Joint Mechanisms, Optimization and Optimal Control, Humanoid Robot Systems
Abstract: Modern robotic systems increasingly employ nonlinear coupled joints, which present significant challenges in control. Unlike traditional serial chain configurations, where simplicity was the primary concern, parallel mechanisms such as those found in humanoid ankle joints add another layer of complexity. In this work, we propose an actuation controller for nonlinear coupled joints based on Model Predictive Path Integral (MPPI) control framework: a sampling-based model predictive control framework that incorporates nonlinearity and coupling effect simultaneously. Highly nonlinear Actuator-Joint mapping, expressed through lightweight neural network, enables intuitive controller design by exposing the actuator space control to the joint space command. Also, our method enables posing joint limit constraints, enabling safe operation on a real-robot platform. To experimentally validate our method, joint position control of a humanoid ankle joint with 2-DOF has been conducted, where accurate, real-time control and constraint-respecting behavior has been demonstrated.
|
| |
| 09:00-10:30, Paper TuI1I.259 | Add to My Program |
| Physics-Informed Diffusion Mamba Transformer for Real-World Driving |
|
| Zhou, Hang | The Hong Kong University of Science and Technology (Guangzhou) |
| Zhang, Qiang | The Hong Kong University of Science and Technology (Guangzhou) |
| Liu, Peiran | Hong Kong University of Science and Technology (GuangZhou) |
| Qin, Yihao | The Hong Kong University of Science and Technology (Guangzhou) |
| Yan, Zhaoxu | MoSense Technologies (Hong Kong) Limited |
| Ji, Yiding | Hong Kong University of Science and Technology (Guangzhou) |
Keywords: Imitation Learning, Autonomous Vehicle Navigation, Learning from Demonstration
Abstract: Autonomous driving systems demand trajectory planners that not only model the inherent uncertainty of future motions but also respect complex temporal dependencies and underlying physical laws. While diffusion-based generative models excel at capturing multi-modal distributions, they often fail to incorporate long-term sequential contexts and domain-specific physical priors. In this work, we bridge these gaps with two key innovations. First, we introduce a Diffusion Mamba Transformer architecture that embeds mamba and attention into the diffusion process, enabling more effective aggregation of sequential input contexts from sensor streams and past motion histories. Second, we design a Port-Hamiltonian Neural Network module that seamlessly integrates energy-based physical constraints into the diffusion model, thereby enhancing trajectory predictions with both consistency and interpretability. Extensive evaluations on standard autonomous driving benchmarks demonstrate that our unified framework significantly outperforms state-of-the-art baselines in predictive accuracy, physical plausibility, and robustness, thereby advancing safe and reliable motion planning.
|
| |
| 09:00-10:30, Paper TuI1I.260 | Add to My Program |
| Robotic Relay of Free-Space Optical Beams for Medical Applications |
|
| Ma, Guangshen | Duke University |
| Codd, Patrick | Duke University |
| Draelos, Mark | University of Michigan |
Keywords: Medical Robots and Systems, Computer Vision for Medical Robotics, Sensor-based Control
Abstract: Medical robotic laser systems require precise positioning and movements to control laser beam paths associated with sensors and other optical systems in many applications (e.g., laser surgery, laser-based tissue diagnosis). While existing robotic laser beam control systems were developed for microscale control to achieve highly precise steering and focusing, they assume a single robot which is limiting in applications where the beam path must cover large areas and angles (e.g., 360-degree full-view object scanning). To expand imaging flexibility, we propose a novel robot-mirror framework to use robot-attached mirrors to control a 3D free-space laser beam, which is referred to as “N-mirror-N-robot system” where N is the number of mirrors and robots. This framework allows for general laser beam planning to trace targets based on geometric constraints of 3D obstacles and fixed orientations and positions with unlimited number of robot-and-mirror combinations. We develop a prototype for the special case with a single mirror attached to the robot (N = 1). This prototype integrates an RGB-D depth camera for object tracking, a 6- DOF robot-attached mirror, and a laser diode source. We propose a computational framework for system kinematics and calibration. Simulation and real experiments are conducted to track specified paths, markers, phantoms, and real tissue to verify the system feasibility. The results show an average object tracking error of approximately 2.0 mm that is close to the depth accuracy of the camera. This N = 1 prototype shows promise for N > 1 case and the potential for general 3D laser planning under arbitrary geometric constraints.
|
| |
| 09:00-10:30, Paper TuI1I.261 | Add to My Program |
| GaussianPretrain: A Simple Unified 3D Gaussian Representation for Visual Pre-Training in Autonomous Driving |
|
| Xu, Shaoqing | University of Macau, Xiaomi EV |
| Li, Fang | XiaoMI EV |
| Jiang, Shengyin | Xiaomi |
| Song, Ziying | Beijing Jiaotong University |
| Yang, Zhi-Xin | University of Macau |
Keywords: Computer Vision for Transportation, Object Detection, Segmentation and Categorization, Automation Technologies for Smart Cities
Abstract: Self-supervised learning has made substantial strides in image processing, while visual pre-training for autonomous driving is still in its infancy. Existing methods often focus on learning geometric scene information while neglecting texture or treating both aspects separately, hindering comprehensive scene understanding. In this context, we are excited to introduce GaussianPretrain, a novel pre-training paradigm that achieves a holistic understanding of the scene by uniformly integrating geometric and texture representations. Conceptualizing 3D Gaussian anchors as volumetric LiDAR points, our method learns a deepened understanding of scenes to enhance pre-training performance with detailed spatial structure and texture, achieving that 40.6% faster than NeRF-based method UniPAD with 70% GPU memory only. We demonstrate the effectiveness of GaussianPretrain across multiple 3D perception tasks, showing significant performance improvements, such as a 7.05% increase in NDS for 3D object detection, boosts mAP by 1.9% in HD map construction and 0.8% improvement on Occupancy prediction. These significant gains highlight GaussianPretrain’s theoretical innovation and strong practical potential, promoting visual pre-training development for autonomous driving. Source code is available at https://github.com/Public-BOTs/GaussianPretrain.
|
| |
| 09:00-10:30, Paper TuI1I.262 | Add to My Program |
| SpikeATac: A Multimodal Tactile Finger with Taxelized Dynamic Sensing for Dexterous Manipulation |
|
| Chang, Eric T. | Columbia University |
| Ballentine, Peter | Columbia University |
| He, Zhanpeng | Columbia University |
| Kim, DoGon | Columbia University |
| Jiang, Kai | Columbia University |
| Liang, Hua Hsuan | Columbia University |
| Palacios, Joaquin | Columbia University |
| Wang, William | Columbia University |
| Piacenza, Pedro | Samsung Research America |
| Kymissis, Ioannis | Columbia University |
| Ciocarlie, Matei | Columbia University |
Keywords: Force and Tactile Sensing, In-Hand Manipulation, Multifingered Hands
Abstract: In this work, we introduce SpikeATac, a multimodal tactile finger combining a taxelized and highly sensitive dynamic response (PVDF) with a static transduction method (capacitive) for multimodal touch sensing. Named for its `spiky' response, SpikeATac's 16-taxel PVDF film sampled at 4 kHz provides fast, sensitive dynamic signals to the very onset and breaking of contact. We characterize the sensitivity of the different modalities, and show that SpikeATac provides the ability to stop quickly and delicately when grasping fragile, deformable objects. Beyond parallel grasping, we show that SpikeATac can be used in a learning-based framework to achieve new capabilities on a dexterous multifingered robot hand. We use a learning recipe that combines reinforcement learning from human feedback with tactile-based rewards to fine-tune the behavior of a policy to modulate force. Our hardware platform and learning pipeline together enable a difficult dexterous and contact-rich task that has not previously been achieved: in-hand manipulation of fragile objects. Videos are available at roamlab.github.io/spikeatac.
|
| |
| 09:00-10:30, Paper TuI1I.263 | Add to My Program |
| From Evolutionary Design to Additive Manufacturing: Closing the Loop for Magnetic Soft Robots |
|
| Abu-Shaera, Rawaan | McMaster University |
| Palanichamy, Veerash | McMaster University |
| Clancy, Kaitlyn | McMaster University |
| Norouziani, Fatemeh | McMaster University |
| Kelly, Stephen | McMaster University |
| Onaizah, Onaizah | McMaster University |
| |
| 09:00-10:30, Paper TuI1I.264 | Add to My Program |
| VG3T: Visual Geometry Grounded Gaussian Transformer |
|
| Kim, Junho | Kookmin University |
| Lee, Seongwon | Kookmin University |
Keywords: Deep Learning for Visual Perception, Recognition, Visual Learning
Abstract: Generating a coherent 3D scene representation from multi-view images is a fundamental yet challenging task. Existing methods often struggle with multi-view fusion, leading to fragmented 3D representations and sub-optimal performance. To address this, we introduce VG3T, a novel multi-view feed-forward network that predicts a 3D semantic occupancy via a 3D Gaussian representation. Unlike prior methods that infer Gaussians from single-view images, our model directly predicts a set of semantically attributed Gaussians in a joint, multi-view fashion. This novel approach overcomes the fragmentation and inconsistency inherent in view-by-view processing, offering a unified paradigm to represent both geometry and semantics. We also introduce two key components, Grid-Based Sampling and Positional Refinement, to mitigate the distance-dependent density bias common in pixel-aligned Gaussian initialization methods. Our VG3T shows a notable 1.7%p improvement in mIoU while using 46% fewer primitives than the previous state-of-the-art on the nuScenes benchmark, highlighting its superior efficiency and performance.
|
| |
| 09:00-10:30, Paper TuI1I.265 | Add to My Program |
| Personalized Autonomous Driving Via Optimal Control with Clearance Constraints from Questionnaires |
|
| Lim, YongJae | Seoul National University |
| Kim, Dabin | Seoul National University |
| Kim, H. Jin | Seoul National University |
Keywords: Autonomous Vehicle Navigation, Constrained Motion Planning, Motion and Path Planning
Abstract: Driving without considering the preferred separation distance from surrounding vehicles may cause discomfort for users. To address this limitation, we propose a planning framework that explicitly incorporates user preferences regarding the desired level of safe clearance from surrounding vehicles. We design a questionnaire purposefully tailored to capture user preferences relevant to our framework, while minimizing unnecessary questions. Specifically, the questionnaire considers various interaction-relevant factors, including the surrounding vehicle’s size, speed, position, and maneuvers of surrounding vehicles, as well as the maneuvers of the ego vehicle. The response indicates the user-preferred clearance for the scenario defined by the question and is incorporated as constraints in the optimal control problem. However, it is impractical to account for all possible scenarios that may arise in a driving environment within a single optimal control problem, as the resulting computational complexity renders real-time implementation infeasible. To overcome this limitation, we approximate the original problem by decomposing it into multiple subproblems, each dealing with one fixed scenario. We then solve these subproblems in parallel and select one using the cost function from the original problem. To validate our work, we conduct simulations using different user responses to the questionnaire. We assess how effectively our planner reflects user preferences compared to preference-agnostic baseline planners by measuring preference alignment.
|
| |
| 09:00-10:30, Paper TuI1I.266 | Add to My Program |
| A Whole-Body Control Framework for Human-Like Walking with Knee Stretch on Flat-Foot Humanoids |
|
| Kim, Taehyun | Korea University, Korea Institute of Science and Technology (KIST) |
| Yoo, Sookyoung | Korea Institute of Science and Technology |
| Lim, Myo-Taeg | Korea University |
| Oh, Yonghwan | Korea Institute of Science & Technology (KIST) |
|
|
| |
| 09:00-10:30, Paper TuI1I.267 | Add to My Program |
| Safe Explicable Policy Search |
|
| Hanni, Akkamahadevi | Arizona State University |
| Montano, Jonathan | Arizona State University |
| Zhang, Yu (Tony) | Arizona State University |
Keywords: Human-Robot Teaming, Safety in HRI, Human-Centered Robotics
Abstract: When users work with AI agents, they form conscious or subconscious expectations of them. Meeting user expectations is crucial for such agents to engage in successful interactions and teaming. However, users may form expectations of an agent that differ from the agent’s planned behaviors. These differences lead to the consideration of two separate decision models in the planning process to generate explicable behaviors. However, little has been done to incorporate safety considerations, especially in a learning setting. We present Safe Explicable Policy Search (SEPS), which aims to provide a learning approach to explicable behavior generation while minimizing the safety risk, both during and after learning. We formulate SEPS as a constrained optimization problem where the agent aims to maximize an explicability score subject to constraints on safety and a suboptimality criterion based on the agent’s model. SEPS innovatively combines the capabilities of Constrained Policy Optimization and Explicable Policy Search to introduce the capability of generating safe explicable behaviors to domains with continuous state and action spaces, which is critical for robotic applications. We evaluate SEPS in safety-gym environments and with a physical robot experiment to show its efficacy and relevance in human-AI teaming.
|
| |
| 09:00-10:30, Paper TuI1I.268 | Add to My Program |
| Uncertainty-Aware Adaptive Dynamics for Underwater Vehicle–Manipulator Robots |
|
| Morgan, Edward | Louisiana State University |
| Dadson, Nenyi Kweku | Louisiana State University |
| Barbalata, Corina | Louisiana State University |
Keywords: Marine Robotics, Mobile Manipulation, Calibration and Identification
Abstract: Accurate and adaptive dynamic models are critical for underwater vehicle–manipulator systems where hydrodynamic effects induce time‐varying parameters. This paper introduces a novel uncertainty‐aware adaptive dynamics model framework that remains linear in lumped vehicle and manipulator parameters, and embeds convex physical consistency constraints during online estimation. Moving horizon estimation is used to stack horizon regressors, enforce realizable inertia, damping, friction, and hydrostatics, and quantify uncertainty from parameter evolution. Experiments on a BlueROV2 Heavy with a 4‐DOF manipulator demonstrate rapid convergence and calibrated predictions. Manipulator fits achieve R^2=0.88 to 0.98 with slopes near unity, while vehicle surge, heave, and roll are reproduced with good fidelity under stronger coupling and noise. Median solver time is approximately 0.023s per update, confirming online feasibility. A comparison against a fixed parameter model shows consistent reductions in MAE and RMSE across degrees of freedom. Results indicate physically plausible parameters and confidence intervals with near 100% coverage, enabling reliable feedforward control and simulation in underwater environments.
|
| |
| 09:00-10:30, Paper TuI1I.269 | Add to My Program |
| GM3: A General Physical Model for Micro-Mobility Vehicles |
|
| Cai, Grace | University of Maryland, College Park |
| Parepally, Nithin | University of Maryland, College Park |
| Zheng, Laura | University of Maryland, College Park |
| Lin, Ming C. | University of Maryland at College Park |
Keywords: Simulation and Animation, Dynamics, Kinematics
Abstract: Modeling the dynamics of micro-mobility vehicles (MMV) is becoming increasingly important for training autonomous vehicle systems and building urban traffic simulations. However, mainstream tools rely on variants of the Kinematic Bicycle Model (KBM) or mode-specific physics that miss tire slip, load transfer, and rider/vehicle lean. To our knowledge, no unified, physics-based model captures these dynamics across the full range of common MMVs and wheel layouts. We propose the "Generalized Micro-mobility Model" (GM3), a tire-level formulation based on the tire brush representation that supports arbitrary wheel configurations, including single/double track and multi-wheel platforms. We introduce an interactive model-agnostic evaluation and visualization framework that decouples vehicle/layout specification from dynamics to compare the GM3 with the KBM and other models, consisting of fixed step RK4 integration, human-in-the-loop and scripted control, real-time trajectory traces, and logging for analysis. We also empirically validate the GM3 on the Stanford Drone Dataset's deathCircle (roundabout) scene.
|
| |
| 09:00-10:30, Paper TuI1I.270 | Add to My Program |
| RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation |
|
| Jiang, Yuming | Alibaba DAMO Academy |
| Huang, Siteng | Alibaba |
| Xue, Shengke | Alibaba Group |
| Zhao, Yaxi | Alibaba |
| Cen, Jun | The Hong Kong University of Science and Technology |
| Leng, Sicong | Nanyang Technological University |
| Li, Kehan | Alibaba Group |
| Guo, Jiayan | Alibaba DAMO Academy |
| Wang, Kexiang | Alibaba Group |
| Chen, Mingxiu | Alibaba |
| Wang, Fan | Alibaba Group |
| Zhao, Deli | Alibaba Group |
| Li, Xin | Alibaba DAMO Academy |
Keywords: Deep Learning Methods
Abstract: This paper presents RynnVLA-001, a vision-language-action (VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model to predict future frames based on an image and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.
|
| |
| 09:00-10:30, Paper TuI1I.271 | Add to My Program |
| Tightly-Coupled Dynamic Object Tracking and RGB-D Inertial Odometry Estimation with Dual Quadrics |
|
| Shimada, Toyozo | Toyohashi University of Technology |
| Koide, Kenji | National Institute of Advanced Industrial Science and Technology |
| Takanose, Aoki | National Institute of Advanced Industrial Science and Technology |
| Oishi, Shuji | National Institute of Advanced Industrial Science and Technology (AIST) |
| Yokozuka, Masashi | Nat. Inst. of Advanced Industrial Science and Technology |
| Miura, Jun | Toyohashi University of Technology |
|
|
| |
| 09:00-10:30, Paper TuI1I.272 | Add to My Program |
| Preventing Robotic Jailbreaking Via Multimodal Domain Adaptation |
|
| Marchiori, Francesco | University of Padova |
| Sinha, Rohan | Stanford University |
| Agia, Christopher George | Stanford University |
| Robey, Alexander | University of Pennsylvania |
| Pappas, George J. | University of Pennsylvania |
| Conti, Mauro | University of Padova |
| Pavone, Marco | Stanford University |
Keywords: AI-Enabled Robotics, Transfer Learning, Robot Safety
Abstract: Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly deployed in robotic environments but remain vulnerable to jailbreaking attacks that bypass safety mechanisms and drive unsafe or physically harmful behaviors in the real world. Data-driven defenses such as jailbreak classifiers show promise, yet they struggle to generalize in domains where specialized datasets are scarce, limiting their effectiveness in robotics and other safety-critical contexts. To address this gap, we introduce J-DAPT, a lightweight framework for multimodal jailbreak detection through attention-based fusion and domain adaptation. J-DAPT integrates textual and visual embeddings to capture both semantic intent and environmental grounding, while aligning general-purpose jailbreak datasets with domain-specific reference data. Evaluations across autonomous driving, maritime robotics, and quadruped navigation show that J-DAPT boosts detection accuracy to very high levels (up to 100% in certain scenarios) under our evaluation protocol. These results demonstrate that J-DAPT provides a practical defense for securing VLMs in robotic applications. Additional materials are made available at: https://j-dapt.github.io.
|
| |
| 09:00-10:30, Paper TuI1I.273 | Add to My Program |
| A Safety-Aware Shared Autonomy Framework with BarrierIK Using Control Barrier Functions |
|
| Guler, Berk | Technical University of Darmstadt |
| Pompetzki, Kay | Technical University of Darmstadt |
| Sun, Yuanzheng | Technical University of Darmstadt |
| Manschitz, Simon | Honda Research Institute Europe |
| Peters, Jan | Technical University of Darmstadt |
Keywords: Telerobotics and Teleoperation
Abstract: Shared autonomy blends operator intent with autonomous assistance. In cluttered environments, linear blending can produce unsafe commands even when each source is individually collision-free. Many existing approaches model obstacle avoidance through potentials or cost terms, which only enforce safety as a soft constraint. In contrast, safety-critical control requires hard guarantees. We investigate the use of control barrier functions (CBFs) at the inverse kinematics (IK) layer of shared autonomy, targeting post-blend safety while preserving task performance. Our approach is evaluated in simulation on representative cluttered environments and in a VR teleoperation study comparing pure teleoperation with shared autonomy. Across conditions, employing CBFs at the IK layer reduces violation time and increases minimum clearance while maintaining task performance. In the user study, participants reported higher perceived safety and trust, lower interference, and an overall preference for shared autonomy with our safety filter. Additional materials available at https://berkguler.github.io/barrierik.
|
| |
| 09:00-10:30, Paper TuI1I.274 | Add to My Program |
| Cross-View Exocentric and Egocentric Fusion for Robust Microsurgical Anastomosis Understanding |
|
| Liu, Yuxuan | Shanghai Jiao Tong University |
| Zhuge, Yuyang | Shanghai Key Laboratory of Flexible Medical Robotics, Tongren Hospital, Institute of Medical Robotics, Shanghai Jiao Tong Univer |
| Zhou, Xinyao | Shanghai Jiao Tong University |
| Luo, Yating | Shanghai Jiao Tong University |
| Luan, Yunfei | Shanghai Jiao Tong University |
| Guo, Yao | Shanghai Jiao Tong University |
| Yang, Guang-Zhong | Shanghai Jiao Tong University |
|
|
| |
| 09:00-10:30, Paper TuI1I.275 | Add to My Program |
| TOPO-Bench: An Open-Source Topological Mapping Evaluation Framework with Quantifiable Perceptual Aliasing |
|
| Wang, Jiaming | National University of Singapore |
| Chen, Jizhuo | National University of Singapore |
| Liu, Diwen | National University of Sinagpore |
| Da, Jiaxuan | National University of Singapore |
| Hu, Jiamo | National University of Singapore |
| Xue, Zhiwei | National University of Singapore |
| Kästner, Linh | National University Singapore |
| Soh, Harold | National University of Singapore |
Keywords: Performance Evaluation and Benchmarking, Mapping, Localization
Abstract: Topological mapping offers a compact and robust representation for navigation, but progress in the field is hindered by the lack of standardized evaluation metrics, datasets, and protocols. Existing systems are evaluated in different environments under different criteria, preventing fair and reproducible comparison. Moreover, a key challenge---perceptual aliasing---remains under-quantified despite its strong influence on system performance. We address these gaps by (i) formalizing emph{topological consistency} as the fundamental property of topological maps and showing that, under mild assumptions, localization accuracy provides an efficient and interpretable surrogate metric, and (ii) introducing, to our knowledge, the first quantitative measure of dataset ambiguity for fair comparison across environments. To support this protocol, we curate a diverse benchmark dataset with calibrated ambiguity levels, implement and release deep learning-based baseline systems, and evaluate them alongside classical methods. Our experiments provide new insights into the limitations of current approaches under perceptual aliasing. All datasets, baselines, and evaluation tools are publicly released to foster consistent and reproducible research in topological mapping.
|
| |
| 09:00-10:30, Paper TuI1I.276 | Add to My Program |
| Curriculum Multi-Task Self-Supervision Improves Lightweight Architectures for Onboard Satellite Hyperspectral Image Segmentation |
|
| Carlesso, Hugo | CNRS, IRIT |
| Mothe, Josiane | Univ of Toulouse, CLLE |
| Ionescu, Radu | University of Bucharest |
Keywords: Object Detection, Segmentation and Categorization, AI-Based Methods, Aerial Systems: Applications
Abstract: Hyperspectral imaging (HSI) captures detailed spectral signatures across hundreds of contiguous bands per pixel, being indispensable for remote sensing applications such as land-cover classification, change detection, and environmental monitoring. Due to the high dimensionality of HSI data and the slow rate of data transfer in satellite-based systems, compact and efficient models are required to support onboard processing and minimize the transmission of redundant or low-value data. To this end, we introduce a novel curriculum multi-task self-supervised learning (CMTSSL) framework designed for lightweight architectures for HSI analysis. CMTSSL integrates masked image modeling with decoupled spatial and spectral jigsaw puzzle solving, guided by a curriculum learning strategy that progressively increases data difficulty during self-supervision. This enables the encoder to jointly capture fine-grained spectral continuity, spatial structure, and global semantic features. Unlike prior dual-task SSL methods, CMTSSL simultaneously addresses spatial and spectral reasoning within a unified and computationally efficient design, being particularly suitable for training lightweight models for onboard satellite deployment. We validate our approach on four public benchmark datasets, demonstrating consistent gains in downstream segmentation tasks, using architectures that are over 16, 000× lighter than some state-of-the-art models. These results highlight the potential of CMTSSL in generalizable representation learning with lightweight architectures for real-world HSI applications. Our code is publicly available at: https://github.com/hugocarlesso/CMTSSL.
|
| |
| 09:00-10:30, Paper TuI1I.277 | Add to My Program |
| Efficient Event Camera Volume System |
|
| Soto, Juan | Purdue University |
| Noronha, Ian | Purdue University |
| Bharti, Saru | Purdue University |
| Kaur, Upinder | Purdue University |
Keywords: Visual Learning, Representation Learning, Software Tools for Benchmarking and Reproducibility
Abstract: Event cameras promise low latency and high dynamic range, yet their sparse output challenges integration into standard robotic pipelines. We introduce EECVS (Efficient Event Camera Volume System), a novel framework that models event streams as continuous-time Dirac impulse trains, enabling artifact-free compression through direct transform evaluation at event timestamps. Our key innovation combines density-driven adaptive selection among DCT, DTFT, and DWT transforms with transform-specific coefficient pruning strategies tailored to each domain's sparsity characteristics. The framework eliminates temporal binning artifacts while automatically adapting compression strategies based on real-time event density analysis. On EHPT-XC and MVSEC datasets, our framework achieves superior reconstruction fidelity with DTFT delivering the lowest earth mover distance. In downstream segmentation tasks, EECVS demonstrates robust generalization. Notably, our approach demonstrates exceptional cross-dataset generalization: when evaluated with EventSAM segmentation, EECVS achieves mean IoU 0.87 on MVSEC versus 0.44 for voxel grids at 24 channels, while remaining competitive on EHPT-XC. Our ROS2 implementation provides real-time deployment with DCT processing achieving 1.5 ms latency and 2.7× higher throughput than alternative transforms, establishing the first adaptive event compression framework that maintains both computational efficiency and superior generalization across diverse robotic scenarios.
|
| |
| 09:00-10:30, Paper TuI1I.278 | Add to My Program |
| A Model-Based Framework for Assessing Operator Performance in Navigational Bronchoscopy |
|
| Deng, Zhaoxing | Informatics, University of Edinburgh |
| Hanley, David | DePaul University |
| Zhang, Francis Xiatian | University of Edinburgh |
| Dhaliwal, Kev | University of Edinburgh, Center for Inflammation Research, |
| Khadem, Mohsen | University of Edinburgh |
Keywords: Medical Robots and Systems, Integrated Planning and Control
Abstract: Bronchoscopy is a critical procedure for diagnosing and treating pulmonary diseases, but its safe and effective execution demands substantial operator training. Insufficient experience is associated with higher complication rates, including bleeding, pneumothorax, and bronchospasm. Existing assessment tools provide structured evaluations, yet they remain heavily reliant on subjective expert judgment and limited sensory feedback. To address this limitation, we propose a model-based framework for objective performance evaluation in navigational bronchoscopy. Our approach leverages pose data from electromagnetic (EM) trackers, routinely used in clinical navigation, and embeds nonholonomic kinematic constraints that characterize expert-like trajectories. Using the model and a Model Predictive Path Integral (MPPI) control, we generate optimal reference trajectories and define error metrics that quantify deviations between operator-executed and model-predicted motions. We hypothesize that these deviations provide robust discriminative features for distinguishing between expert and novice performance. Experiments on a phantom lung dataset comprising 11 operators and 98 procedures demonstrate that the proposed metrics significantly separate skill levels, enabling the construction of an effective classifier for operator proficiency. This framework offers an interpretable, data-driven alternative to supervisor-dependent assessments and represents a step toward scalable, objective skill evaluation and transfer in bronchoscopy training and robotic platforms.
|
| |
| 09:00-10:30, Paper TuI1I.279 | Add to My Program |
| A Case Study on LLM-Guided Reinforcement Learning for Decentralized Autonomous Driving |
|
| Anvar, Timur | University of Virginia |
| Chen, Jeffrey | University of Virginia |
| Wang, Yuyan | University of Virginia |
| Chandra, Rohan | University of Virginia |
Keywords: Intelligent Transportation Systems
Abstract: Autonomous vehicle navigation in complex environments such as dense and fast-moving highways and merging scenarios remains an active area of research. In the past decade, many planning and control approaches have used reinforcement learning (RL) with notable success. However, a key limitation of RL is its reliance on well-specified reward functions, which often fail to capture the full semantic and social complexity of diverse, out-of-distribution situations. As a result, a rapidly growing line of research explores using Large Language Models (LLMs) to replace or supplement RL for direct planning and control, on account of their ability to reason about rich semantic context. However, LLMs present significant drawbacks: they can be unstable in zero-shot safety-critical settings, produce inconsistent outputs, and often depend on expensive API calls with network latency. This motivates our investigation into whether small, locally deployed LLMs (≤14B parameters) can meaningfully support autonomous highway driving through reward shaping rather than direct control. These models are attractive for practical deployment as they can run on a single GPU and avoid external API dependencies. We present a case study comparing RL-only, LLM-only, and hybrid approaches, where LLMs augment RL rewards by scoring state-action transitions during training, while standard RL policies execute at test time. Our findings reveal that RL-only agents achieve moderate success rates (73-89%) with reasonable efficiency, LLM-only agents can reach higher success rates (up to 94%) but with severely degraded speed performance, and hybrid approaches consistently fall between these extremes. Critically, despite explicit efficiency instructions, LLM-influenced approaches exhibit systematic conservative bias with substantial model-dependent variability, highlighting important limitations of current small LLMs for safety-critical control tasks.
|
| |
| 09:00-10:30, Paper TuI1I.280 | Add to My Program |
| MTE-SLAM: Multi-Tier Feature Fusion for Efficient Neural Semantic SLAM |
|
| Lu, Danqi | Shenzhen University |
| Huang, Changxin | Shenzhen University |
| Chen, Zhuangzhuang | Shenzhen University |
| Lin, Zhiliang | Shenzhen University |
| Li, Dachong | Shenzhen University |
| Chang, Yanbin | Shenzhen University |
| Li, Jianqiang | Shenzhen University, |
Keywords: SLAM
Abstract: Neural implicit representations have demonstrated excellent performance in Simultaneous Localization and Mapping (SLAM) by virtue of their ability to jointly model geometry, color and camera poses. Recent studies have attempted to integrate scene semantic information into implicit representation frameworks, significantly improving the ability of environmental understanding. Nevertheless, most existing methods rely on direct semantic coloring or rough fusing other modalities, resulting in underutilized semantic clues. This further causes problems such as blurred small objects, loss of fine structures and unclear regional boundaries. Additionally, redundant features introduced in the process reduce system efficiency. To address these challenges, we propose MTE-SLAM, an accurate and efficient end-to-end neural RGB-D semantic SLAM framework that synergizes Multi-Tier Feature Fusion (MTFF) and Feature Redundancy Suppressor (FRS). MTFF progressively fuses semantic features at global and local scales. The global context enhancement module captures scene-level semantic correlations, while the local continuity enhancement module refines neighborhood consistency, generating detailed and coherent semantic maps. FRS adaptively filters redundant features based on their importance and temporal variation, reducing parameters and computation while preserving representational power to accelerate training and inference. Comprehensive evaluations on Replica and ScanNet demonstrate that MTE-SLAM achieves centimeter-level reconstruction, state-of-the-art tracking and semantic accuracy, and runs up to four times faster than existing semantic SLAM systems.
|
| |
| 09:00-10:30, Paper TuI1I.281 | Add to My Program |
| Surgical Video Understanding with Label Interpolation |
|
| Kim, Garam | Korea Institute of Science and Technology |
| Jeong, Tae Kyeong | Korea Institute of Science and Technology |
| Park, Juyoun | Korea Institute of Science and Technology |
Keywords: Medical Robots and Systems, Computer Vision for Medical Robotics
Abstract: Robot-assisted surgery (RAS) has become a critical paradigm in modern surgery, promoting patient recovery and reducing the burden on surgeons through minimally invasive approaches. To fully realize its potential, however, a precise understanding of the visual data generated during surgical procedures is essential. Previous studies have predominantly focused on single-task approaches, but real surgical scenes involve complex temporal dynamics and diverse instrument interactions that limit comprehensive understanding. Moreover, the effective application of multi-task learning (MTL) requires sufficient pixel-level segmentation data, which are difficult to obtain due to the high cost and expertise required for annotation. In particular, long-term annotations such as phases and steps are available for every frame, whereas short-term annotations such as surgical instrument segmentation and action detection are provided only for key frames, resulting in a significant temporal–spatial imbalance. To address these challenges, we propose a novel framework that combines optical flow–based segmentation label interpolation with multi-task learning. optical flow estimated from annotated key frames is used to propagate labels to adjacent unlabeled frames, thereby enriching sparse spatial supervision and balancing temporal and spatial information for training. This integration improves both the accuracy and efficiency of surgical scene understanding and, in turn, enhances the utility of RAS.
|
| |
| 09:00-10:30, Paper TuI1I.282 | Add to My Program |
| Spectral Security Imaging System (SSIS): Optical Authenticity for Hyperspectral Pushbroom Imaging |
|
| Gomez Toloza, Pablo Andres | Universidad Industrial De Santander |
| Quiroga Torres, Javier Andres | Universidad Industrial De Santander (Bucaramanga, Santander) |
| Garcia, Hans | Universidad Industrial De Santander |
| Arce, Gonzalo | University of Delaware |
| Arguello Fuentes, Henry | Universidad Industrial De Santander |
Keywords: Calibration and Identification, Robotics and Automation in Agriculture and Forestry, Environment Monitoring and Management
Abstract: Ensuring authenticity of hyperspectral imagery (HSI) at the moment of acquisition is critical: subtle spectral attacks can mislead downstream analysis before digital defenses take effect. By injecting the optical key before digitization, the effect is created in hardware and cannot be replicated in software, resulting in stronger protection. We present the Spectral Security Imaging System (SSIS), an acquisition-stage approach that injects a data-aware {additive} spectral key in the optical path of a pushbroom imager, binding integrity to the measurement while preserving the class-informative structure. We describe the complete system forward model, detector–key joint optimization, and a laboratory prototype together with a thorough calibration process. A laboratory dataset (unsigned and optically signed cubes) supports two evaluations. For manipulation detection, SSIS achieves detection accuracy of {92%} with visual distortion of {PSNR = 41.5 dB} and {SSIM = 0.981}. For downstream classification on clean data, macro-F1 remains close to the unsigned ceiling, about 93% of the monochromator baseline (0.915 vs 0.981) and about 99% of the pushbroom baseline (0.892 vs 0.903), while outperforming multiplicative and watermarking baselines by up to 16.1 points in macro-F1 and 19.4 points in accuracy.
|
| |
| 09:00-10:30, Paper TuI1I.283 | Add to My Program |
| A Novel Soft Gripper Design Integrating a Unilateral Fingernail-Like Mechanism for Grasping Flat Object |
|
| Ma, Wanyu | The Hong Kong Polytechnic University |
| Ren, Xuyang | Scuola Superiore Sant’Anna |
| Li, Zheng | The Chinese University of Hong Kong |
Keywords: Grippers and Other End-Effectors, Grasping, Task Planning
Abstract: The grasping capabilities of robotic arms have been extensively studied by researchers in terms of accuracy, flexibility, and versatility, enabling robots to perform various tasks in domestic, industrial, and medical scenarios. However, grasping flat objects has remained a significant challenge and is often overlooked as a limiting case in robotic manipulation. To address this highly difficult task, this paper proposes a novel gripper design that combines a pneumatic soft gripper with a unilateral fingernail structure. We evaluate two grasping strategies and corresponding target generation methods tailored for this design. The proposed system significantly improves the success rate of stably grasping flat objects that lie flush on tables. Moreover, its soft interaction with the table surface reduces the need for highly precise object-table height detection, thereby saving computational time and cost. Finally, we conduct experimental tests on various common flat objects, validating the effectiveness of both the gripper design and the grasping strategies.
|
| |
| 09:00-10:30, Paper TuI1I.284 | Add to My Program |
| DreamFlow: Local Navigation Beyond Observation Via Conditional Flow Matching in the Latent Space |
|
| Park, Jiwon | KAIST |
| Lee, Dongkyu | KAIST |
| Nahrendra, I Made Aswin | KRAFTON |
| Lim, Jaeyoung | University of California, Berkeley |
| Myung, Hyun | KAIST |
Keywords: Integrated Planning and Learning, Planning under Uncertainty, Collision Avoidance
Abstract: Local navigation in cluttered environments often suffers from dense obstacles and frequent local minima. Conventional local planners rely on heuristics and are prone to failure, while deep reinforcement learning (DRL)-based approaches provide adaptability but are constrained by limited onboard sensing. These limitations lead to navigation failures because the robot cannot perceive about structures outside its field of view. In this paper, we propose DreamFlow, a DRL-based local navigation framework that extends the robot’s perceptual horizon through conditional flow matching (CFM). The proposed CFM-based prediction module learns probabilistic mapping between local height map latent representation and broader spatial representation conditioned on navigation context. This enables the navigation policy to predict unobserved environmental features and proactively avoid potential local minima. Experimental results demonstrated that DreamFlow outperforms existing methods in terms of latent prediction accuracy and navigation performance in simulation. The proposed method was further validated in cluttered real-world environments with a quadrupedal robot. The project page is available at https://dreamflow-icra.github.io.
|
| |
| 09:00-10:30, Paper TuI1I.285 | Add to My Program |
| Coupling Tensor Trains with Graph of Convex Sets: Effective Compression, Exploration, and Planning in the C-Space |
|
| Reinerth, Gerhard | Technical University of Munich |
| Laha, Riddhiman | Technical University of Munich |
| Romano, Marcello | Technical University of Munich |
Keywords: Motion and Path Planning, Optimization and Optimal Control, Computational Geometry
Abstract: We present TANGO (Tensor ANd Graph Optimization), a novel motion planning framework that integrates tensor-based compression with structured graph optimization to enable efficient and scalable trajectory generation. While optimization-based planners such as the Graph of Convex Sets (GCS) offer powerful tools for generating smooth, optimal trajectories, they typically rely on a predefined convex characterization of the high-dimensional configuration space—a requirement that is often intractable for general robotic tasks. TANGO builds further by using Tensor Train decomposition to approximate the feasible configuration space in a compressed form, enabling rapid discovery and estimation of task-relevant regions. These regions are then embedded into a GCS-like structure, allowing for geometry-aware motion planning that respects both system constraints and environmental complexity. By coupling tensor-based compression with structured graph reasoning, TANGO enables efficient, geometry-aware motion planning and lays the groundwork for more expressive and scalable representations of configuration space in future robotic systems. Rigorous simulation studies on planar and real robots reinforce our claims of effective compression and higher quality trajectories.
|
| |
| 09:00-10:30, Paper TuI1I.286 | Add to My Program |
| SSMG-Nav: Enhancing Lifelong Object Navigation with Semantic Skeleton Memory Graph |
|
| Niu, Haochen | Shanghai Jiao Tong University |
| Zhang, Lantao | Shanghai Jiao Tong University |
| Ji, Xingwu | Shanghai Jiao Tong University |
| Ying, Rendong | Shanghai Jiao Tong University |
| Liu, Peilin | Shanghai Jiao Tong University |
| Wen, Fei | Shanghai Jiao Tong University |
Keywords: Vision-Based Navigation, Task Planning
Abstract: Navigating to out-of-sight targets from human instructions in unfamiliar environments is a core capability for service robots. Despite substantial progress, most approaches underutilize reusable, persistent memory, constraining performance in lifelong settings. Many are additionally limited to single-modality inputs and employ myopic greedy policies, which often induce inefficient back-and-forth maneuvers (BFMs). To address such limitations, we introduce SSMG-Nav, a framework for object navigation built on a Semantic Skeleton Memory Graph (SSMG) that consolidates past observations into a spatially aligned, persistent memory anchored by topological keypoints (e.g., junctions, room centers). SSMG clusters nearby entities into subgraphs, unifying entity- and space-level semantics to yield a compact set of candidate destinations. To support multimodal targets (images, objects, and text), we integrate a vision-language model (VLM). For each subgraph, a multimodal prompt synthesized from memory guides the VLM to infer a target belief over destinations. A long-horizon planner then trades off this belief against traversability costs to produce a visit sequence that minimizes expected path length, thereby reducing backtracking. Extensive experiments on challenging lifelong benchmarks and standard ObjectNav benchmarks demonstrate that, compared to strong baselines, our method achieves higher success rates and greater path efficiency, validating the effectiveness of SSMG-Nav.
|
| |
| 09:00-10:30, Paper TuI1I.287 | Add to My Program |
| Fluoroscopy-Constrained Magnetic Robot Control Via Zernike-Based Field Modeling and Nonlinear MPC |
|
| Chen, Xinhao | Johns Hopkins University |
| Yao, Hongkun | Johns Hopkins University |
| Bhattacharjee, Anuruddha | Johns Hopkins University |
| Raval, Suraj | University of Maryland, College Park |
| Mair, Lamar | Weinberg Medical Physics, Inc |
| Diaz-Mercado, Yancy | University of Maryland |
| Krieger, Axel | Johns Hopkins University |
Keywords: Medical Robots and Systems, Motion Control, Micro/Nano Robots
Abstract: Magnetic actuation enables surgical robots to navigate complex anatomical pathways while reducing tissue trauma and improving surgical precision. However, clinical deployment is limited by the challenges of controlling such systems under fluoroscopic imaging, which provides low framerate and noisy pose feedback. This paper presents a control framework that remains accurate and stable under such conditions by combining a nonlinear model predictive control (NMPC) framework that directly outputs coil currents, an analytically differentiable magnetic field model based on Zernike polynomials, and a Kalman filter to estimate the robot state. Experimental validation is conducted with two magnetic robots in a 3D-printed fluid workspace and a spine phantom replicating drug delivery in the epidural space. Results show the proposed control method remains highly accurate when feedback is downsampled to 3 Hz with added Gaussian noise (σ = 2), mimicking clinical fluoroscopy. In the spine phantom experiments, the proposed method successfully executed a drug delivery trajectory with a root mean square (RMS) position error of 1.18 mm while maintaining safe clearance from critical anatomical boundaries.
|
| |
| 09:00-10:30, Paper TuI1I.288 | Add to My Program |
| 3DME: Dual-Branch Encoder with Progressive Masking for 3D Medical Foundation Encoding Model |
|
| Yuan, Hengyi | Qingdao University |
| Cheng, Zesheng | Qingdao University |
| Chen, Huiru | Qingdao University |
| Shixuan, Wang | No.1 Middle School of Weifang |
Keywords: Visual Learning, Deep Learning for Visual Perception, Computer Vision for Medical Robotics
Abstract: Three-dimensional (3D) medical image analysis faces challenges such as massive data volume, difficulty in integrating cross-slice information, and limited model generalization. This paper proposes 3DME, a foundational model for 3D medical imaging. Its core innovations feature a dual-branch 3D encoder that integrates a Vision Transformer for modeling global long-range dependencies and a 3D graph convolutional network for capturing local voxel structures, enhanced by multi-level deformable attention for cross-planar correlation; a progressive volumetric masking strategy for self-supervised pretraining, which dynamically adjusts masking ratios and block sizes to force the model to learn cross-slice continuity and global semantics; and a unified foundation model framework supporting lightweight adaptation for downstream tasks. Experiments demonstrate that 3DME achieves state-of-the-art (SOTA) performance on 12 segmentation and classification tasks, exhibiting strong zero-shot transfer capabilities, thereby significantly enhancing model generalization and clinical deployment efficiency.
|
| |
| 09:00-10:30, Paper TuI1I.289 | Add to My Program |
| Moment Latent Reinforcement Learning for Pattern Control in Swarm Robotic Systems |
|
| Zhang, Wei | Washington University in St. Louis |
| Quan, Haoyu | Washington University in St. Louis |
| Li, Jr-Shin | Washington University in St. Louis |
Keywords: Swarm Robotics, Optimization and Optimal Control, Reinforcement Learning
Abstract: Targeted coordination of swarm robotic systems is an emerging robot control task arising from numerous applications across diverse domains, ranging from medicine and agriculture to cyber-physical systems. However, state-of-the-art control techniques for robot swarms often require comprehensive measurement data for each robot and are not scalable with the growth of the swarm size. To address these issues, in this work, we develop a latent space control architecture for robust manipulation of patterns in arbitrarily large, potentially infinite, robot swarms using only partial measurements. In particular, we model such a swarm as a parameterized control system and formulate its patterns in terms of probability distributions. We then develop a moment kernel transform, which generates a reduced latent space representation for the pattern dynamics of the robot swarm over a reproducing kernel Hilbert space. The moment representation of the robot swarm can be learned using partial measurements of the swarm. Building on this, we propose a reinforcement learning (RL)-based pattern control framework operating on the moment latent space. In this framework, the data is organized to flow between the workspace and moment latent space episodically to achieve both robust control performance and high training efficiency. The proposed moment latent RL framework is validated by various pattern control tasks involving wheeled robot swarms, using both numerical simulations and TurtleBot3 swarms in the Gazebo simulator.
|
| |
| 09:00-10:30, Paper TuI1I.290 | Add to My Program |
| SE(3)-LIO: Smooth IMU Propagation with Jointly Distributed Poses on SE(3) Manifold for Accurate and Robust LiDAR-Inertial Odometry |
|
| Shin, Gunhee | KAIST |
| Lee, Seungjae | Korea Advanced Institute of Science and Technology |
| Kong, Jei | KAIST |
| Seo, Young-Woo | Carnegie Mellon University |
| Myung, Hyun | KAIST (Korea Advanced Institute of Science and Technology) |
Keywords: SLAM, Localization, Mapping
Abstract: In estimating odometry accurately, an inertial measurement unit (IMU) is widely used owing to its high-rate measurements, which can be utilized to obtain motion information through IMU propagation. In this paper, we address the limitations of existing IMU propagation methods in terms of motion prediction and motion compensation. In motion prediction, the existing methods typically represent a 6-DoF pose by separating rotation and translation and propagate them on their respective manifold, such that the rotational variation is not effectively incorporated into translation propagation. During motion compensation, the relative transformation between predicted poses is used to compensate motion-induced distortion in other measurements, while inherent errors in the predicted poses introduce uncertainty in the relative transformation. To tackle these challenges, we represent and propagate the pose on SE(3) manifold, where propagated translation properly accounts for rotational variation. Furthermore, we precisely characterize the relative transformation uncertainty by considering the correlation between predicted poses, and incorporate this uncertainty into the measurement noise during motion compensation. To this end, we propose a LiDAR-inertial odometry (LIO), referred to as SE(3)-LIO, that integrates the proposed IMU propagation and uncertainty-aware motion compensation (UAMC). We validate the effectiveness of SE(3)-LIO on diverse datasets. Our source code and additional material are available at: https://se3-lio.github.io/.
|
| |
| 09:00-10:30, Paper TuI1I.291 | Add to My Program |
| Sym-Servo: Disambiguate Symmetric Object Pose by End-To-End Optimal Visual Servo |
|
| Li, Shuxin | Zhejiang University |
| Chen, Anzhe | Zhejiang University |
| Lu, Haojian | Zhejiang University |
| Xiong, Rong | Zhejiang University |
| Wang, Yue | Zhejiang University |
Keywords: Service Robotics, Domestic Robotics, Assembly
Abstract: Controlling symmetric objects is an indispensable but challenging task in robotic manipulation. Mainstream perception-action frameworks rely on accurate 6D pose estimation to guide the controller. However, the majority of existing 6D pose estimation methods for symmetric objects are designed to output a single pose, which can flicker between multiple equivalent solutions across consecutive frames, leading to instability in the control loop. While some approaches can output multiple hypotheses to represent the ambiguity, above methods generally cannot achieve model-free manner and strong generalization simultaneously. In this paper, we formulate the problem from a multi-solution task in pose space to an end-to-end visual servo task that admits a unique optimal solution. We propose a visual servo framework Sym-Servo. Sym-Servo uses a joint learning mechanism where a deterministic policy is trained with a diffusion-based generator to encourage the shared vision encoder to learn a symmetry-aware representation, and the policy is then refined via reinforcement and self-imitation learning to produce an efficient and stable final policy. We validate Sym-Servo with simulations and real-world experiments, demonstrating its efficiency and generalization in controlling symmetric objects in a model-free manner.
|
| |
| 09:00-10:30, Paper TuI1I.292 | Add to My Program |
| MiniBEE: A New Form Factor for Compact Bimanual Dexterity |
|
| Islam, Sharfin | Columbia University |
| Chen, Zewen | Columbia University |
| He, Zhanpeng | Columbia University |
| Bhatt, Swapneel | Columbia University |
| Permuy, Andres | Columbia University |
| Taylor, Brock | Columbia University |
| Vickery, James | Columbia University |
| Lu, Zhengbin | Columbia University |
| Zhang, Cheng | Columbia University |
| Piacenza, Pedro | Samsung Research America |
| Ciocarlie, Matei | Columbia University |
Keywords: Bimanual Manipulation, Dexterous Manipulation, Grippers and Other End-Effectors
Abstract: Bimanual robot manipulators can achieve impressive dexterity, but typically rely on two full six- or seven-degree-of-freedom arms so that paired grippers can coordinate effectively. This traditional framework increases system complexity and footprint while only exploiting a fraction of the overall workspace for dexterous interaction. We introduce the OURS (OURSLONG), a compact system in which two reduced-mobility arms (3+ DOF each) are coupled into a kinematic chain that preserves full relative positioning between grippers and enables the entirety of systems workspace to be used for dexterity. To guide our design, we formulate a kinematic dexterity metric that enlarges the dexterous workspace while keeping the mechanism lightweight and wearable. The resulting system supports two complementary modes: (i) wearable kinesthetic data collection with self-tracked gripper poses, and (ii) deployment on a standard robot arm, extending dexterity across its entire workspace. We present kinematic analysis and design optimization methods for maximizing dexterous range, and demonstrate an end-to-end pipeline in which wearable demonstrations train imitation learning policies that perform robust, real-world bimanual manipulation.
|
| |
| 09:00-10:30, Paper TuI1I.293 | Add to My Program |
| MA3DSG: Multi-Agent 3D Scene Graph Generation for Large-Scale Indoor Environments |
|
| Kim, Yirum | Gwangju Institute of Science and Technology (GIST) |
| Kim, Jaewoo | Gwangju Institute of Science and Technology (GIST) |
| Kim, Ue-Hwan | Gwangju Institute of Science and Technology (GIST) |
Keywords: Semantic Scene Understanding, Multi-Robot Systems
Abstract: Current 3D scene graph generation (3DSGG) approaches heavily rely on a single-agent assumption and small-scale environments, exhibiting limited scalability to real-world scenarios. In this work, we introduce Multi-Agent 3D Scene Graph Generation (MA3DSG) model, the first framework designed to tackle this scalability challenge using multiple agents. We develop a training-free graph alignment algorithm that efficiently merges partial query graphs from individual agents into a unified global scene graph. Leveraging extensive analysis and empirical insights, our approach enables conventional single-agent systems to operate collaboratively without requiring any learnable parameters. To rigorously evaluate 3DSGG performance, we propose MA3DSG-Bench—a benchmark that supports diverse agent configurations, domain sizes, and environmental conditions—providing a more general and extensible evaluation framework. This work lays a solid foundation for scalable, multi-agent 3DSGG research.
|
| |
| 09:00-10:30, Paper TuI1I.294 | Add to My Program |
| COMPASS: Confined-Space Manipulation Planning with Active Sensing Strategy |
|
| Li, Qixuan | Tsinghua University |
| Le, Chen | Tsinghua University |
| Huang, Dongyue | Nanyang Technological University |
| Yu, Jincheng | Tsinghua University |
| Chen, Xinlei | Tsinghua University |
Keywords: Perception for Grasping and Manipulation, Motion and Path Planning, Manipulation Planning
Abstract: Manipulation in confined and cluttered environments remains a significant challenge due to partial observability and complex configuration spaces. Effective manipulation in such environments requires an intelligent exploration strategy to safely understand the scene and search the target. In this paper, we propose COMPASS, a multi-stage exploration and manipulation framework featuring a manipulation-aware sampling-based planner. First, we reduce collision risks with an near-field awareness scan to build a local collision map. Additionally, we employ a multi-objective utility function to find viewpoints that are both informative and conducive to subsequent manipulation. Moreover, we perform a constrained manipulation optimization strategy to generate manipulation poses that respect obstacle constraints. To systematically evaluate method's performance under these difficulties, we propose a benchmark of confined-space exploration and manipulation containing four level challenging scenarios. Compared to exploration methods designed for other robots and only considering information gain, our framework increases manipulation success rate by 24.25% in simulations. Real-world experiments demonstrate our method's capability for active sensing and manipulation in confined environments.
|
| |
| 09:00-10:30, Paper TuI1I.295 | Add to My Program |
| FASIONAD: Adaptive Uncertainty-Gated Fast–Slow Fusion Framework for Safe Autonomous Driving |
|
| Luo, Ziang | TsingHua University |
| Jiang, Sicong | McGill University |
| Qian, Kangan | Tsinghua University |
| Huang, Zilin | University of Wisconsin-Madison |
| Zhu, Tianze | Tsinghua University |
| Jiao, Siwen | National University of Singapore |
| Miao, Jinyu | Tsinghua University |
| Fu, Zheng | Tsinghua University |
| Zhong, Yang | Xiaomi |
| Wang, Yunlong | NIO Inc |
| Ye, Hao | JiangSu University |
| Yang, Mengmeng | Tsinghua University |
| Jiang, Kun | Tsinghua University |
| Yang, Diange | Tsinghua University |
Keywords: Autonomous Vehicle Navigation, Motion and Path Planning, Integrated Planning and Learning
Abstract: Previous fast–slow system architectures demonstrated that pairing a reactive E2E planner with a deliberative vision-language model (VLM) can address these long-tail scenarios. However, these dual-system models that query the slow module at fixed intervals are computationally inefficient and introduce unnecessary latency during normal operation. To bridge this gap, we introduce textbf{FASIONAD}, an adaptive fast–slow framework for autonomous driving that selectively integrates E2E planning and VLM reasoning. A lightweight fast planner manages general control, while a slow reasoner is activated only when a Laplace-based uncertainty gate detects changed uncertainty. Rather than overriding control, the VLM provides concise planning states and high-level plans. These inform the planner through an information bottleneck and high-level action guidance, enhancing interpretability and safety. Evaluated on the nuScenes, Bench2Drive, and CARLA Town05 closed-loop benchmarks, FASIONAD lowers the average trajectory error by 6.7% and the collision rate by 28.1% compared with strong E2E baselines, while also markedly reducing computational overhead relative to always-on fast–slow dual systems. These results demonstrate that adaptive fast–slow fusion is a practical route to safer, more reliable, and more efficient autonomous driving.
|
| |
| 09:00-10:30, Paper TuI1I.296 | Add to My Program |
| Best-View Pedicel Localization with YOLO-DSC for Calyx-Preserving Robotic Harvesting of Cherry Tomatoes |
|
| Liana, Verianti | National Taiwan University |
| Zuo, Hao Cheng | National Taiwan University |
| Hsieh, Yun-Chi | Department of Biomechatronics Engineering National Taiwan University |
| Yen, Ping-Lang | National Taiwan University |
Keywords: Robotics and Automation in Agriculture and Forestry, Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception
Abstract: Robotic harvesting of cherry tomatoes remains challenging due to dense foliage, asynchronous ripening, and the strict market requirement for calyx-preserving cuts. The calyx frequently occludes the pedicel, making precise localization indispensable. In 640 × 480 images, pedicels span only 7–32 pixels, where even minor errors can lead to miscutting the calyx. To address this challenge, we apply YOLO-DSC to localize pedicels across dynamic frames as the arm-mounted camera moves during the best-view search. This strategy maximizes the visible pedicel length, exposing it perpendicularly to the camera and ensuring clear separation from the calyx, while null-data suppresses false positives from distractors such as leaves, stems, and calyces. In 15 autonomous trials along a 28m greenhouse row, YOLO-DSC achieved the lowest pedicel localization errors, outperforming YOLO baseline model (significant under p < 0.05). This improvement directly translated into higher harvesting success, increasing from 47% with YOLO (include null data training) to 73.3% with YOLO-DSC. These results demonstrate that integrating YOLO-DSC with best-view searching enhances recall and stability under dynamic viewpoints, enabling more reliable calyx-preserving harvesting in real greenhouse conditions.
|
| |
| 09:00-10:30, Paper TuI1I.297 | Add to My Program |
| DreamSea: Photorealistic 3D Underwater Terrain Generation by Latent Fractal Diffusion Models |
|
| Zhang, Tianyi | Aurora Innovation |
| Zhi, Weiming | The University of Sydney; Vanderbilt University |
| Mangelson, Joshua | Brigham Young University |
| Johnson-Roberson, Matthew | Vanderbilt University |
Keywords: Marine Robotics, Deep Learning for Visual Perception
Abstract: This paper tackles the problem of generating representations of underwater 3D terrain. Off-the-shelf generative models, trained on Internet-scale data but not on specialized underwater images, exhibit downgraded realism, as images of the seafloor are relatively uncommon. To this end, we introduce DreamSea, a generative model to generate hyper-realistic underwater scenes. DreamSea is trained on real-world image databases collected from underwater robot surveys. Images from these surveys contain massive real seafloor observations and covering large areas. We extract 3D geometry and latent embeddings from the data with visual foundation models, and train a diffusion model that generates realistic seafloor images in RGBD channels, conditioned on novel fractal-distribution-based latent embeddings. We then fuse the generated images into a 3D map, building a 3DGS model supervised by 2D diffusion priors which allows photorealistic novel view rendering. DreamSea is rigorously evaluated, demonstrating the ability to robustly generate large-scale underwater scenes that are consistent, diverse, and photorealistic. Our work drives impact in underwater robotics, and in particular, underwater robot simulation.
|
| |
| 09:00-10:30, Paper TuI1I.298 | Add to My Program |
| ConsistencyPlanner: Real-Time Planning with Fast-Sampling Consistency Models |
|
| Zhang, Qichao | Institute of Automation, Chinese Academy of Sciences |
| Fang, Xing | Institute of Automation, Chinese Academy of Sciences |
| Fang, Jiaqi | Guangzhou Zaofu Intelligent Technology Co., Ltd |
| Cai, Zhenwen | Caizhenwen482@hellobike.com |
| Ling, Jie | Lingjie376@hellobike.com |
| Yu, Qiankun | Yuqiankun172@hellobike.com |
| Zhao, Dongbin | Chinese Academy of Sciences |
Keywords: Imitation Learning, Motion and Path Planning, Autonomous Vehicle Navigation
Abstract: Closed-loop planning in complex, real-world driving scenarios presents a critical challenge for autonomous driving systems. While traditional rule-based methods are interpretable, their predefined heuristics lack the adaptability for dynamic traffic environments. Learning-based approaches have shown considerable promise. Conversely, learning-based approaches, despite their promise, struggle to balance the modeling diverse and multimodal driving behaviors and real-time planning, often leading to indecisive or unsafe actions. To address this limitation, we propose ConsistencyPlanner, a real-time planning framework with fast-sampling consistency models. Our approach is built upon two key technical contributions. Efficient Multimodal Sampling: We employ fast-sampling consistency models to generate a diverse set of plausible future trajectories. This enables efficient, real-time exploration of multimodal actions, overcoming the computational bottlenecks of previous iterative generative methods. Heterogeneous Feature Fusion: We introduce an attention-enhanced decoder that dynamically integrates heterogeneous input features—including scene feature and action token—into a cohesive representation for robust planning. Extensive evaluation in the Waymax simulator demonstrates superior performance in safety metrics compared to existing methods, with particularly strong results in challenging dynamic scenarios.
|
| |
| 09:00-10:30, Paper TuI1I.299 | Add to My Program |
| Ventura: Adapting Image Diffusion Models for Unified Task Conditioned Navigation |
|
| Zhang, Arthur | University of Texas at Austin |
| Meng, Xiangyun | University of Washington |
| Callari, Luca | University of Washington |
| Kim, Dong Ki | Massachusetts Institute of Tech |
| Omidshafiei, Shayegan | Field AI, Inc |
| Biswas, Joydeep | The University of Texas at Austin |
| Agha-mohammadi, Ali-akbar | NASA-JPL, Caltech |
| Shaban, Amirreza | Field AI |
Keywords: Vision-Based Navigation, Learning from Demonstration, Transfer Learning
Abstract: Robots must adapt to diverse human instructions and operate safely in unstructured, open-world environments. Recent Vision–Language models (VLMs) offer strong priors for grounding language and perception, but remain difficult to steer for navigation due to differences in action spaces and pretraining objectives that hamper transferability to robotics tasks. Towards addressing this, we introduce Ventura, a vision–language navigation system that finetunes internet-pretrained image diffusion models for path planning. Instead of directly predicting low-level actions, Ventura generates a path mask (i.e. a visual plan) in image space that captures fine-grained, context-aware navigation behaviors. A lightweight behavior-cloning policy grounds these visual plans into executable trajectories, yielding an interface that follows natural language instructions to generate diverse robot behaviors. To scale training, we supervise on path masks derived from self-supervised tracking models paired with VLM-augmented captions, avoiding manual pixel-level annotation or highly engineered data collection setups. In extensive real-world evaluations, Ventura outperforms state-of-the-art foundation model baselines on object reaching, obstacle avoidance, and terrain preference tasks, improving success rates by 33% and reducing collisions by 54% across both seen and unseen scenarios. Notably, we find that Ventura generalizes to unseen combinations of distinct tasks, revealing emergent compositional capabilities. Videos, code, and additional materials: https://venturapath.github.io.
|
| |
| 09:00-10:30, Paper TuI1I.300 | Add to My Program |
| STONE Dataset: A Scalable Multi-Modal Surround-View 3D Traversability Dataset for Off-Road Robot Navigation |
|
| Park, Konyul | Seoul National University |
| Kim, Daehun | Seoul National University |
| Oh, Jiyong | Seoul National University |
| Yu, Seunghoon | Seoul National University |
| Park, Junseo | Seoul National University |
| Park, Jaehyun | Seoul National University |
| Shin, Hongjae | Seoul National University |
| Cho, Hyungchan | Seoul National University |
| Kim, Jungho | Seoun National University |
| Choi, Jun Won | Seoul National University |
Keywords: Data Sets for Robotic Vision, Robotics and Automation in Agriculture and Forestry, Performance Evaluation and Benchmarking
Abstract: Reliable off-road navigation requires accurate estimation of traversable regions and robust perception under diverse terrain and sensing conditions. However, existing datasets lack both scalability and multi-modality, which limits progress in 3D traversability prediction. In this work, we introduce STONE, a large-scale multi-modal dataset for off-road navigation. STONE provides (1) trajectory-guided 3D traversability maps generated by a fully automated, annotation-free pipeline, and (2) comprehensive surround-view sensing with synchronized 128-channel LiDAR, six RGB cameras, and three 4D imaging radars. The dataset covers a wide range of environments and conditions, including day and night, grasslands, farmlands, construction sites, and lakes. Our auto-labeling pipeline reconstructs dense terrain surfaces from LiDAR scans, extracts geometric attributes such as slope, elevation, and roughness, and assigns traversability labels beyond the robot’s trajectory using a Mahalanobis-distance-based criterion. This design enables scalable, geometry-aware ground-truth construction without manual annotation. Finally, we establish a benchmark for voxel-level 3D traversability prediction and provide strong baselines under both single-modal and multi-modal settings.
|
| |
| 09:00-10:30, Paper TuI1I.301 | Add to My Program |
| Phase-Aware Policy Learning for Skateboard Riding of Quadruped Robots Via Feature-Wise Linear Modulation |
|
| Yoon, Minsung | Korea Advanced Institute of Science and Technology (KAIST) |
| Jeong, Jeil | Korea Advanced Institute of Science and Technology |
| Yoon, Sung-eui | KAIST |
Keywords: Reinforcement Learning, Legged Robots
Abstract: Skateboards offer a compact and efficient means of transportation as a type of personal mobility device. However, controlling them with legged robots poses several challenges for policy learning due to perception-driven interactions and multi-modal control objectives across distinct skateboarding phases. To address these challenges, we introduce Phase-Aware Policy Learning (PAPL), a reinforcement-learning framework tailored for skateboarding with quadruped robots. PAPL leverages the cyclic nature of skateboarding by integrating phase-conditioned Feature-wise Linear Modulation layers into actor–critic networks, enabling a unified policy that captures phase-dependent behaviors while sharing robot-specific knowledge across phases. Our evaluations in simulation validate command-tracking accuracy and conduct ablation studies quantifying each component’s contribution. We also compare locomotion efficiency against leg and wheel–leg baselines and show the real-world transferability.
|
| |
| 09:00-10:30, Paper TuI1I.302 | Add to My Program |
| RadarSFD: Single-Frame Diffusion with Pretrained Priors for Radar Point Clouds |
|
| Zhao, Bin | Rice University |
| Garg, Nakul | Rice University |
Keywords: Range Sensing, Mapping
Abstract: Millimeter-wave radar provides perception robust to fog, smoke, dust, and low light, making it attractive for size, weight, and power constrained robotic platforms. Current radar imaging methods, however, rely on synthetic aperture or multi-frame aggregation to improve resolution, which is impractical for small aerial, inspection, or wearable systems. We present RadarSFD, a conditional latent diffusion framework that reconstructs dense LiDAR-like point clouds from a single radar frame without motion or SAR. Our approach transfers geometric priors from a pretrained monocular depth estimator into the diffusion backbone, anchors them to radar inputs via channel-wise latent concatenation, and regularizes outputs with a dual-space objective combining latent and pixel-space losses. On the RadarHD benchmark, RadarSFD achieves state-of-the-art performance against baseline models. Qualitative results show recovery of fine walls and narrow gaps, and experiments across new environments confirm strong generalization. Ablation studies highlight the importance of pretrained initialization, radar BEV conditioning, and the dual-space loss. Together, these results establish the first practical single-frame, no-SAR mmWave radar pipeline for dense point cloud perception in compact robotic systems. The project page is available at https://phi-lab-rice.github.io/RadarSFD/
|
| |
| 09:00-10:30, Paper TuI1I.303 | Add to My Program |
| Proposal of an Embodied Airbag-Pressure-Based Control Interface for Inflatable Personal Mobility Devices |
|
| Hayami, Reon | Waseda University |
| Falk, Bill | Chalmers University of Technology |
| Sasatani, Takuya | The University of Tokyo |
| Sugano, Shigeki | Waseda University |
| Kamezaki, Mitsuhiro | The University of Tokyo |
Keywords: Soft Robot Applications, Modeling, Control, and Learning for Soft Robots, Physical Human-Robot Interaction
Abstract: The demand for personal mobility devices (PMDs) has increased, prompting studies on various control interfaces such as joysticks and handlebars. However, these interfaces remain external to the PMD, and no system has been developed in which the PMD itself functions as the interface. This study proposes an interface that measures and controls the internal air pressure of an inflatable PMD, enabling the system to recognize operator inputs, such as pressing, leaning, or pushing, directly from the PMD body. The system adjusts air pressure according to the operator's estimated weight, allowing operation with minimal force, while continuous speed control is realized through press, lean, and double-push inputs. Experimental results demonstrated that translational and angular speeds could be controlled through embodied body movements, and filter processing effectively mitigated the influence of air pressure fluctuations caused by uneven terrain, ensuring stable operation. Tests conducted on indoor and outdoor courses, including obstacles and uneven surfaces, showed operability comparable to a joystick, though narrower paths required more time to navigate. This study contributes a novel embodied air-pressure-based control paradigm that directly integrates the interface into the PMD itself.
|
| |
| 09:00-10:30, Paper TuI1I.304 | Add to My Program |
| Learning View-Invariant Sign Language Representations Via Dual-Stream Contrastive Learning |
|
| Peng, Yuting | Institute of Computing Technology, Chinese Academy of Sciences |
| Min, Yuecong | Institute of Computing Technology, Chinese Academy of Sciences |
| Chen, Xilin | Institute of Computing Technology, Chinese Academy |
Keywords: Gesture, Posture and Facial Expressions, Deep Learning for Visual Perception, Recognition
Abstract: Viewpoint shifts significantly change how gestures and facial expressions appear and frequently cause occlusions, posing a critical challenge for robust Sign Language Recognition (SLR). To address this challenge, we exploit the spatial flexibility and computational efficiency of skeleton data and propose ViSL, a dual-stream contrastive learning framework to learn underline{V}iew-underline{i}nvariant representations for underline{S}ign underline{L}anguage understanding. Specifically, the primary and lifting streams share a common visual feature extractor with different types of input: the primary stream (P-Stream) directly processes frontal-view skeleton data, and the lifting stream (L-Stream) synthesizes skeleton data from arbitrary viewpoints based on 3D estimations. We further propose a view-invariant contrastive loss to align representations across both viewpoints and streams. Experimental results on the challenging cross-view setting of MM-WLAuslan demonstrate that ViSL achieves substantial performance improvements, highlighting its potential for robust real-world SLR applications.
|
| |
| 09:00-10:30, Paper TuI1I.305 | Add to My Program |
| Emergent Co-Adaptive Strategies in Heterogeneous Multi-Robot Systems Via Meta-Learning |
|
| Wang, Haocheng | Chinese University of Hong Kong, Shenzhen |
| Wang, Lin | Shenzhen Institute of Artificial Intelligence and Robotics for Society |
| Lam, Tin Lun | The Chinese University of Hong Kong, Shenzhen |
| Zhai, Jianwang | Guilin University of Electronic Technology, School of Computer and Information Security |
| He, Xuchun | Shenzhen Institute of Artificial Intelligence and Robotics for Society |
| Gao, Yuan | Shenzhen Institute of Artificial Intelligence and Robotics for Society |
Keywords: Social HRI, Acceptability and Trust, Multi-Robot Systems
Abstract: Abstract— As teamed robots increasingly share public spaces with humans, the ability to co-adapt—to mutually adjust behavior in response to one another—becomes essential for safe, efficient, and socially acceptable operation. This paper introduces a socially co-adaptive framework for heterogeneous multi-robot systems (HMRS) that enables real-time adaptation to human behavior while preserving cooperative task execution. Our approach fuses large language models for natural language understanding with model-agnostic meta-learning to allow robots to rapidly generalize across diverse social contexts. We implement and validate the system using a real-world HMRS composed of robots with different roles—workers, a station, and a social robot—interacting with 44 human participants under induced behavioral states (relaxed vs. nervous). Results reveal significant behavioral adaptation: the system dynamically shifts between egoistic and altruistic strategies, improving crowd guidance success by 21%. It also reduces human cognitive load—specifically, physical demands by 39% and temporal demands by 39%—while increasing trust by 16% and perceived anthropomorphism by 21%. This work demonstrates the feasibility of human-robot co-adaptation at scale, laying the groundwork for socially intelligent robotic systems capable of thriving in complex, human-centered environments.
|
| |
| 09:00-10:30, Paper TuI1I.306 | Add to My Program |
| Manifold Geometry-Based Feature Decoupling for Endoscopic Image Analysis |
|
| Wen, Yan | Shanghai Jiao Tong University |
| Wang, Haodong | Shanghai Jiao Tong University |
| Chen, Lingyu | Nanjing University of Aeronautics and Astronautics |
| She, Wenbo | Shanghai Jiao Tong University |
| Han, Dingpei | Ruijin Hospital, Shanghai Jiao Tong University School of Medicine |
| Chen, Fang | Shanghai Jiao Tong University |
| Huang, Tianqi | Shanghai Jiao Tong University |
|
|
| |
| 09:00-10:30, Paper TuI1I.307 | Add to My Program |
| Color-Pair Guided Robust Zero-Shot 6D Pose Estimation and Tracking of Cluttered Objects on Edge Devices |
|
| Yang, Xingjian | University of Washington |
| Banerjee, Ashis | University of Washington |
Keywords: RGB-D Perception, Visual Learning, Visual Tracking
Abstract: Robust 6D pose estimation of novel textured objects under challenging illumination remains a significant challenge, often requiring a trade-off between accurate initial pose estimation and efficient real-time tracking. We present a unified framework explicitly designed for efficient execution on edge devices, which synergizes a robust initial estimation module with a fast motion-based tracker. The key to our approach is a shared, lighting-invariant color-pair feature representation that forms a consistent foundation for both stages. For initial estimation, this feature facilitates robust registration between the live RGB-D view and the object's 3D mesh. For tracking, the same feature logic validates temporal correspondences, enabling a lightweight model to reliably regress the object's motion. Extensive experiments on benchmark datasets demonstrate that our integrated approach is both effective and robust, providing competitive pose estimation accuracy while maintaining high-fidelity tracking even through abrupt pose changes. Code: https://github.com/smartslab/Color-Pair-Guided-Zero-Shot-6D-Pose
|
| |
| 09:00-10:30, Paper TuI1I.308 | Add to My Program |
| Structured Labeling Enables Faster Vision-Language Models for End-To-End Autonomous Driving |
|
| Jiang, Hao | Shanghai Jiao Tong University |
| Hu, Chuan | Shanghai Jiao Tong University |
| Shi, Yukang | KargoBot |
| He, Yuan | KargoBot |
| Wang, Ke | Kargobot |
| Zhang, Xi | Shanghai Jiaotong University |
| Zhang, Zhipeng | Shanghai Jiaotong University |
Keywords: Computer Vision for Transportation
Abstract: Vision-Language Models (VLMs) offer a promising approach to end-to-end autonomous driving due to their human-like reasoning capabilities. However, troublesome gaps remains between current VLMs and real-world autonomous driving applications. One major limitation is that existing datasets with loosely formatted language descriptions are not machine-friendly and may introduce redundancy. Additionally, high computational cost and massive scale of VLMs hinder the inference speed and real-world deployment. To bridge the gap, this paper introduces a structured and concise benchmark dataset, NuScenes-S, which is derived from the NuScenes dataset and contains machine-friendly structured representations. Moreover, we present FastDrive, a compact VLM baseline with 0.9B parameters. In contrast to existing VLMs with over 7B parameters and unstructured language processing(e.g., LLaVA-1.5), FastDrive understands structured and concise descriptions and generates machine-friendly driving decisions with high efficiency. Extensive experiments show that FastDrive achieves competitive performance on structured dataset, with approximately 20% accuracy improvement on decision-making tasks, while surpassing massive parameter baseline in inference speed with over 10x speedup. Additionally, ablation studies further focus on the impact of scene annotations (e.g., weather, time of day) on decision-making tasks, demonstrating their importance on decision-making tasks in autonomous driving.
|
| |
| 09:00-10:30, Paper TuI1I.309 | Add to My Program |
| Simultaneous Calibration of Noise Covariance and Kinematics for State Estimation of Legged Robots Via Bi-Level Optimization |
|
| Cheng, Denglin | Johns Hopkins Unversity |
| Kang, Jiarong | University of Wisconsin Madison |
| Xiong, Xiaobin | Shanghai Innovation Institute |
Keywords: Sensor Fusion, Legged Robots
Abstract: Accurate state estimation is critical for legged and aerial robots operating in dynamic, uncertain environments. A key challenge lies in specifying process and measurement noise covariances, which are typically unknown or manually tuned. In this work, we introduce a bi-level optimization framework that jointly calibrates covariance matrices and kinematic parameters in an estimator-in-the-loop manner. The upper level treats noise covariances and model parameters as optimization variables, while the lower level executes a full-information estimator. Differentiating through the estimator allows direct optimization of trajectory-level objectives, resulting in accurate and consistent state estimates. We validate our approach on quadrupedal and humanoid robots, demonstrating significantly improved estimation accuracy and uncertainty calibration compared to hand-tuned baselines. Our method unifies state estimation, sensor, and kinematics calibration into a principled, data-driven framework applicable across diverse robotic platforms.
|
| |
| 09:00-10:30, Paper TuI1I.310 | Add to My Program |
| BEV-Patch-PF: Particle Filtering with BEV-Aerial Feature Matching for Off-Road Geo-Localization |
|
| Lee, Dongmyeong | University of Texas at Austin |
| Quattrociocchi, Jesse | U.S. Army Research Laboratory |
| Ellis, Christian | U.S. Army Research Laboratory |
| Rana, Rwik | University of Texas at Austin |
| Adkins, Amanda | University of Texas at Austin |
| Uccello, Adam | U.S. Army Research Laboratory |
| Warnell, Garrett | U.S. Army Research Laboratory |
| Biswas, Joydeep | The University of Texas at Austin |
Keywords: Localization, Field Robots
Abstract: Localizing ground robots against aerial imagery provides a critical capability for autonomous navigation, especially in environments where GPS is unreliable or unavailable. This task is challenging due to large viewpoint differences and substantial environmental variability. Most prior methods localize each frame independently, using either global-descriptor retrieval or spatial feature alignment, which leaves them vulnerable to ambiguity and multi-modal pose hypotheses. While sequential reasoning can mitigate this uncertainty, adapting existing per-frame pipelines for sequential use introduces unfavorable trade-offs among accuracy, memory, and computation that limit their practical deployment. We propose BEV-Patch-PF, a vision-only, GPS-free sequential geo-localization system that integrates particle filtering with learned bird’s-eye-view (BEV) and aerial feature maps. For each 3-DoF particle pose hypothesis, we crop the corresponding patch from an aerial feature map computed from a local aerial image centered on the predicted pose. The resulting BEV–aerial feature match defines a per-particle log-likelihood for particle-filter updates. In addition, we learn a frame-level uncertainty estimate that adaptively flattens the observation likelihood for unreliable observations, preventing overconfident particle collapse in ambiguous regions. On two real-world off-road datasets, our method achieves 9.7 lower absolute trajectory error (ATE) on seen routes and 6.6 lower ATE on unseen routes than a retrieval-based baseline, while remaining robust under partial canopy cover and shadowing. The system runs in real time at 10 Hz on an NVIDIA Tesla T4, enabling practical robot deployment.
|
| |
| 09:00-10:30, Paper TuI1I.311 | Add to My Program |
| Swimming under Constraints: A Safe Reinforcement Learning Framework for Quadrupedal Bio-Inspired Propulsion |
|
| Cui, Xinyu | Institute of Automation, Chinese Academy of Sciences |
| Han, Fei | Westlake University |
| Xu, Hang | Westlake University |
| Zeng, Yongcheng | Institute of Automation, Chinese Academy of Sciences |
| Sun, Luoyang | Institute of Automation,Chinese Academy of Sciences |
| Zhang, RuiZhi | Institute of Automation, Chinese Academy of Sciences |
| Zhao, Jian | Zhongguancun Academy |
| Zhang, Haifeng | CASIA |
| Li, Weikun | Westlake University |
| Chen, Hao | Westlake University |
| Wang, Jun | University College London |
| Fan, Dixia | Westlake University |
Keywords: Reinforcement Learning, Bioinspired Robot Learning, Constrained Motion Planning
Abstract: Bio-inspired aquatic propulsion offers high thrust and maneuverability but is prone to destabilizing forces such as lift fluctuations, which are further amplified by six-degree-of-freedom (6-DoF) fluid coupling. We formulate quadrupedal swimming as a constrained optimization problem that maximizes forward thrust while minimizing destabilizing fluctuations. Our proposed framework, Accelerated Constrained Proximal Policy Optimization with a PID-regulated Lagrange multiplier (ACPPO-PID), enforces constraints with a PID‑regulated Lagrange multiplier, accelerates learning via conditional asymmetric clipping, and stabilizes updates through cycle‑wise geometric aggregation. Initialized with imitation learning and refined through on-hardware towing-tank experiments, ACPPO-PID produces control policies that transfer effectively to quadrupedal free-swimming trials. Results demonstrate improved thrust efficiency, reduced destabilizing forces, and faster convergence compared with state-of-the-art baselines, underscoring the importance of constraint-aware safe RL for robust and generalizable bio-inspired locomotion in complex fluid environments.
|
| |
| 09:00-10:30, Paper TuI1I.312 | Add to My Program |
| MERGE: Guided Vision-Language Models for Multi-Actor Event Reasoning and Grounding in Human–Robot Interaction |
|
| Deigmoeller, Joerg | Honda Research Institute Europe |
| Agarwal, Nakul | Honda Research Institute USA |
| Hasler, Stephan | Honda Research Institute Europe |
| Tanneberg, Daniel | Honda Research Institute Europe |
| Belardinelli, Anna | Honda Research Institute Europe |
| Ghoddoosian, Reza | Honda Research Institute USA |
| Wang, Chao | Honda Research Institute Europe GmbH |
| Ocker, Felix | Honda Research Institute Europe |
| Zhang, Fan | Honda Research Institute Europe |
| Dariush, Behzad | Honda Research Institute USA |
| Gienger, Michael | Honda Research Institute Europe |
Keywords: Human-Robot Collaboration, Data Sets for Robotic Vision, Semantic Scene Understanding
Abstract: We introduce MERGE, a system for situational grounding of actors, objects, and events in dynamic human–robot group interactions. Effective collaboration in such settings requires consistent situational awareness, built on persistent representations of people and objects and an episodic abstraction of events. MERGE achieves this by uniquely identifying physical instances of actors (humans or robots) and objects and structuring them into actor–action–object relations, ensuring temporal consistency across interactions. Central to MERGE is the integration of Vision-Language Models (VLMs) guided with a perception pipeline: a lightweight streaming module continuously processes visual input to detect changes and selectively invokes the VLM only when necessary. This decoupled design preserves the reasoning power and zero-shot generalization of VLMs while improving efficiency, avoiding both the high monetary cost and the latency of frame-by-frame captioning that leads to fragmented and delayed outputs. To address the absence of suitable benchmarks for multi-actor collaboration, we introduce the GROUND dataset, which offers fine-grained situational annotations of multi-person and human–robot interactions. On this dataset, our approach improves the average grounding score by a factor of 2 compared to the performance of VLM-only baselines--including GPT-4o, GPT-5 and Gemini 2.5 Flash--while also reducing run-time by a factor of 4. The code and data are available at www.github.com/HRI-EU/merge.
|
| |
| 09:00-10:30, Paper TuI1I.313 | Add to My Program |
| KoopCast: Trajectory Forecasting Via Koopman Operators |
|
| Lee, Jungjin | Seoul National University |
| Shin, Jaeuk | Seoul National University |
| Kim, Gihwan | Seoul National University |
| Han, Joon Ho | Seoul National University |
| Yang, Insoon | Seoul National University |
Keywords: Representation Learning, AI-Enabled Robotics, Autonomous Agents
Abstract: We present KoopCast, a lightweight yet efficient model for trajectory forecasting in general dynamic environments. Our approach leverages Koopman operator theory, which enables a linear representation of nonlinear dynamics by lifting trajectories into a higher-dimensional space. The framework follows a two-stage design: first, a probabilistic neural goal estimator predicts plausible long-term targets, specifying where to go; second, a Koopman operator-based refinement module incorporates intention and history into a nonlinear feature space, enabling linear prediction that dictates how to go. This dual structure not only ensures strong predictive accuracy but also inherits the favorable properties of linear operators while faithfully capturing nonlinear dynamics. As a result, our model offers three key advantages: (i) competitive accuracy, (ii) interpretability grounded in Koopman spectral theory, and (iii) low-latency deployment. We validate these benefits on ETH/UCY, the Waymo Open Motion Dataset, and nuScenes, which feature rich multi-agent interactions and map-constrained nonlinear motion. Across benchmarks, KoopCast consistently delivers high predictive accuracy together with mode-level interpretability and practical efficiency.
|
| |
| 09:00-10:30, Paper TuI1I.314 | Add to My Program |
| TetherBot: A Power-Tethered sUAS Platform for Autonomous Vertical Atmospheric Profiling |
|
| Rico, Daniel | The University of Nebraska - Lincoln |
| Munoz-Arriola, Francisco | The University of Nebraska - Lincoln |
| Bradley, Justin | NC State University |
Keywords: Field Robots, Distributed Robot Systems, Aerial Systems: Applications
Abstract: Resolving vertical gradients of atmospheric variables in agroecosystems is essential for understanding surface-atmosphere exchange. It is also critical for emerging carbon monitoring frameworks. Existing methods, such as eddy covariance towers and satellite remote sensing, provide observations with limited spatial resolution, leaving fine-scale structure undersampled. This work introduces TetherBot, a tethered robotic profiler integrated into the Tethered Aircraft Uncrewed System. The robot traverses a hoisted power tether, enabling persistent vertical profiling with synchronized sensing and telemetry. Field experiments across a 40 m transect demonstrated reliable operation. Barometric pressure provided consistent altitude, temperature resolved subtle stratification, and relative humidity revealed surface-layer variability. Carbon dioxide measurements were dominated by sensor noise, highlighting the need for higher-fidelity analyzers. These results demonstrate the feasibility of tethered robotic profiling as a viable approach for atmospheric monitoring. They establish a foundation for future multi-robot arrays and high-precision flux applications.
|
| |
| 09:00-10:30, Paper TuI1I.315 | Add to My Program |
| OTTO: Dynamics and Control of Wheeled Bipedal Jumping Robot |
|
| Buasakorn, Paweekorn | King Mongkut's University of Technology Thonburi |
| Thamrongaphichartkul, Kitti | King Mongkut's University of Technology Thonburi |
| Vongbunyong, Supachai | Institute of Field Robotics, King Mongkut's University of Technology Thonburi |
|
|
| |
| 09:00-10:30, Paper TuI1I.316 | Add to My Program |
| MTRDrive: Memory-Tool Synergistic Reasoning for Robust Autonomous Driving in Corner Cases |
|
| Luo, Ziang | TsingHua University |
| Qian, Kangan | Tsinghua University |
| Wang, Jiahua | Xiaomi Corporation |
| Miao, Jinyu | Tsinghua University |
| Fu, Zheng | Tsinghua University |
| Luo, Yuechen | Tsinghua University |
| Wang, Yunlong | NIO Inc |
| Jiang, Sicong | McGill University |
| Huang, Zilin | University of Wisconsin-Madison |
| Hu, Yifei | Xiaomi Coorporation |
| Yang, Yuhao | Xiaomi Coorporation |
| Ye, Hao | JiangSu University |
| Yang, Mengmeng | Tsinghua University |
| Dong, Xiaojian | Xiaomi Coorporation |
| Jiang, Kun | Tsinghua University |
| Yang, Diange | Tsinghua University |
Keywords: Autonomous Vehicle Navigation, Automation Technologies for Smart Cities
Abstract: Vision-Language Models (VLMs) have demonstrated significant potential for end-to-end autonomous driving, yet a substantial gap remains between their current capabilities and the reliability necessary for real-world deployment. A critical challenge is their fragility, characterized by hallucinations and poor generalization in out-of-distribution (OOD) scenarios. To bridge this gap, we introduce MTRDrive, a novel framework that integrates procedural driving experiences with a dynamic toolkit to enhance generalization and proactive decision-making. MTRDrive addresses these limitations through a closed-loop system that combines a memory-based experience retrieval mechanism with dynamic toolkits. This synergy enables the model to interact more effectively with its environment, improving both reasoning and decision-making capabilities with the help of our memory-tool synergistic reasoning. Additionally, we introduce a new benchmark based on complex Roadwork construction scenarios to rigorously evaluate zero-shot generalization. Extensive experiments demonstrate the effectiveness of our approach. On the public NavSim benchmark, MTRDrive achieves state-of-the-art performance with a driving metric score of 79.8% and a planning accuracy of 82.6%. To rigorously test generalization, we evaluate our model in a zero-shot setting on our new Roadwork-VLM benchmark. In this challenging out-of-distribution test, it attains a driving metric score of 80.2% and a planning accuracy of 33.5%, showcasing its strong ability to reason robustly in unseen scenarios. These results highlight the potential of MTRDrive to advance the field of autonomous driving towards safer and more reliable systems.
|
| |
| 09:00-10:30, Paper TuI1I.317 | Add to My Program |
| Mechanomyography-Based Closed-Loop Control of FES Enabling Prolonged Force Assistance by Monitoring Muscle Fatigue |
|
| Liu, Zehao | Imperial College London |
| Huo, Weiguang | Nankai University |
| Vaidyanathan, Ravi | Imperial College London |
Keywords: Medical Robots and Systems, Rehabilitation Robotics
Abstract: Functional Electrical Stimulation (FES) is a critical therapy for motor rehabilitation, yet the rapid onset of muscle fatigue severely limits its efficacy. This paper presents the design, implementation, and validation of a comprehensive, intelligent closed-loop FES system designed to provide effective force assistance by actively sensing FES-induced fatigue. The system integrates a pressure-based Mechanomyography (P_MMG) sensor for real-time feedback of muscle force capacity, a Kalman filter for robust signal estimation, and a fuzzy-logic-based Proportional-Integral-Derivative (PID) controller to modulate FES dynamically. The developed system was first validated in a comprehensive simulation and then tested with four healthy participants. The results demonstrate that the closed-loop fuzzy PID controller yielded a functionally meaningful improvement in performance over an open-loop-controlled protocol. The system substantially extended the duration of effective FES and, critically, delayed the onset of functional failure (indicated by a force drop >50%), with performance improvements showing a strong trend toward statistical significance (Wilcoxon signed-rank test, p = 0.0625). This work delivers a practical and effective solution for managing fatigue during FES therapy, holding the potential to significantly enhance rehabilitation outcomes.
|
| |
| 09:00-10:30, Paper TuI1I.318 | Add to My Program |
| BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands |
|
| Cho, Seongwon | Seoul National University |
| Ahn, Daechul | Seoul National University |
| Shin, Donghyun | Seoul National University |
| Choi, Hyeonbeom | Seoul National University |
| Kim, San | Seoul National University |
| Choi, Jonghyun | Seoul National University |
Keywords: Mobile Manipulation, Task Planning, Deep Learning in Grasping and Manipulation
Abstract: Open-vocabulary mobile manipulation (OVMM) requires robots to follow language instructions, navigate, and manipulate while updating their world representation as the environment changes dynamically. However, most prior works update their world representation only at discrete milestones, such as waypoints or the end of an action step. Such sparse updates leave robots with limited awareness between updates, causing missed objects, delayed error detection, and slower replanning. To address this limitation, we propose BINDER (Bridging INstant and DEliberative Reasoning), a dual-process framework that separates strategic planning from continuous environmental monitoring. BINDER combines a Deliberative Response Module (DRM, a multimodal LLM for task planning) with an Instant Response Module (IRM, a Video-LLM for continuous monitoring). The DRM handles strategic planning through structured 3D scene updates and guides the IRM’s focus, while the IRM processes video streams to update memory, proactively adjust actions, and trigger replanning when needed. This bidirectional coordination ensures continuous awareness without costly updates, enabling reliable and robust operation under dynamic conditions. We evaluate BINDER in three real-world environments where objects are moved during execution and show that it achieves substantially higher success rates and efficiency than state-of-the-art baselines, confirming its effectiveness for real-world deployment.
|
| |
| 09:00-10:30, Paper TuI1I.319 | Add to My Program |
| Towards Personalized Social Robots: Adaptive Prompting for Real-Time Context-Aware Conversations |
|
| Lalwani, Himanshi | New York University Abu Dhabi |
| Salam, Hanan | New York University Abu Dhabi |
Keywords: Education Robotics, Cognitive Modeling
Abstract: Social robots have demonstrated great potential in various domains. Recent advancements in Large Language Models (LLMs) have expanded the conversational capabilities of these robots, enabling more personalized user interactions. However, current systems primarily focus on behavior or task personalization, or they require extensive pre-training and fine-tuning to achieve language personalization. This paper introduces adaptive prompting, a formal framework for real-time linguistic personalization in LLM-driven robots. By structuring interaction as a sequence of interdependent prompts, adaptive prompting enables controllable, efficient, and scalable personalization without additional model training. To validate our approach, we present a system that integrates adaptive prompting in a social robot to dynamically adapt to user attributes and preferences to provide personalized productivity coaching for college students with Attention Deficit Hyperactivity Disorder (ADHD). Our findings demonstrate that personalized coaching via adaptive prompting improves user engagement and overall coaching effectiveness compared to non-personalized coaching. This indicates the effectiveness of the proposed approach for user adaptation and personalization in social robots, particularly in the aforementioned contexts.
|
| |
| 09:00-10:30, Paper TuI1I.320 | Add to My Program |
| Autonomous UAV–Quadruped Docking in Complex Terrains Via Active Posture Alignment and Constraint-Aware Control |
|
| Xu, Haozhe | Tongji University |
| Cheng, Cheng | Tongji University |
| Sang, Hongrui | Shanghai Maritime University |
| Wang, Zhipeng | Tongji University |
| He, Qiyong | Tongji University |
| Li, Xiuxian | Tongji University |
| He, Bin | Tongji University |
Keywords: Cooperating Robots, Legged Robots
Abstract: Autonomous docking between Unmanned Aerial Vehicles (UAVs) and ground robots is essential for heterogeneous systems, yet most existing approaches target wheeled platforms whose limited mobility constrains exploration in complex terrains. Quadruped robots offer superior adaptability but undergo frequent posture variations, making it difficult to provide a stable landing surface for UAVs. To address these challenges, we propose an autonomous UAV-quadruped docking framework for GPS-denied environments. On the quadruped side, a Hybrid Internal Model with Horizontal Alignment (HIM-HA), learned via deep reinforcement learning, actively stabilizes the torso to provide a level platform. On the UAV side, a three-phase strategy is adopted, consisting of long-range acquisition with a median-filtered YOLOv8 detector, close-range tracking with a constraint-aware controller that integrates a Nonsingular Fast Terminal Sliding Mode Controller (NFTSMC) and a logarithmic Barrier Function (BF) to guarantee finite-time error convergence under field-of-view (FOV) constraints, and terminal descent guided by a Safety Period (SP) mechanism that jointly verifies tracking accuracy and platform stability. The proposed framework is validated in both simulation and real-world scenarios, successfully achieving docking on outdoor staircases higher than 17 cm and rough slopes steeper than 30 degrees. Supplementary material and videos are available at: https://uav-quadruped-docking.github.io.
|
| |
| 09:00-10:30, Paper TuI1I.321 | Add to My Program |
| Automated Genomic Interpretation Via Concept Bottleneck Models for Medical Robotics |
|
| Li, Zijun | Binghamton University |
| Zhang, Jinchang | Binghamton University |
| Ming, Zhang | Los Alamos National Laboratory |
| Lu, Guoyu | Binghamton University |
Keywords: Medical Robots and Systems
Abstract: We propose an automated genomic interpretation module that transforms raw DNA sequences into actionable, interpretable decisions suitable for integration into medical automation and robotic systems. Our framework combines Chaos Game Representation (CGR) with a Concept Bottleneck Model (CBM), enforcing predictions to flow through biologically meaningful concepts such as GC content, CpG density, and k-mer motifs. To enhance reliability, we incorporate concept fidelity supervision, prior-consistency alignment, KL distribution matching, and uncertainty calibration. Beyond accurate classification of HIV subtypes across both in-house and LANL datasets, our module delivers interpretable evidence that can be directly validated against biological priors. A cost-aware recommendation layer further translates predictive outputs into decision policies that balance accuracy, calibration, and clinical utility, reducing unnecessary retests and improving efficiency. Extensive experiments demonstrate that the proposed system achieves state-of-the-art classification performance, superior concept prediction fidelity, and more favorable cost–benefit trade-offs compared to existing baselines. By bridging the gap between interpretable genomic modeling and automated decision-making, this work establishes a reliable foundation for robotic and clinical automation in genomic medicine.
|
| |
| 09:00-10:30, Paper TuI1I.322 | Add to My Program |
| Learning Dynamical System-Based Robot Motions from Demonstrations Via ODE-Driven Diffeomorphic Mappings |
|
| Zhang, Haoyu | Institute of Automation, Chinese Academy of Sciences |
| Cheng, Long | Chinese Academy of Sciences |
Keywords: Learning from Demonstration, Imitation Learning
Abstract: Learning from Demonstrations (LfD) has emerged as a prominent paradigm for imparting motion skills to robotic systems. Dynamical systems (DS) offer a potent mathematical framework for representing point-to-point motions, a critical requirement for numerous practical applications in robotics. While existing approaches typically construct DS models by employing diffeomorphic mappings to morph stable reference systems toward observed demonstrations, the requirement to preserve strict diffeomorphic properties introduces architectural constraints on neural network design, thereby constraining their expressiveness. To address this limitation, we present a DS-based LfD formulation that relaxes traditional diffeomorphism constraints. Our framework employs bidirectional temporal integration of ordinary differential equations (ODEs) to simultaneously satisfy stability guarantees and trajectory alignment objectives. A key innovation lies in a variational calculus framework for Jacobian estimation, enabling efficient computation of DS vector fields while maintaining numerical stability. Comprehensive evaluations demonstrate that our method achieves 33.7% improvement in trajectory reproduction accuracy compared to state-of-the-art baselines while preserving Lyapunov stability. The proposed methodology significantly expands the representational capacity of DS-based learning systems, enabling robust reproduction of complex motion patterns.
|
| |
| 09:00-10:30, Paper TuI1I.323 | Add to My Program |
| Semantic-Guided Progressive Object Removal with Gaussian Splatting |
|
| Huang, Xianliang | Bytedance |
| Xiao, Chen | fudan university |
| Ni, Yuanxiang | Southern University of Science and Technology |
| Liu, Guanming | Fudan University |
| Liu, Mingkai | Peking University |
| Fan, Dikai | PICO, ByteDance |
| Liu, Xiao | PICO, ByteDance |
| Zhang, Hao | Shenzhen Institute of Advanced Technology,Chinese Academy of Sciences |
|
|
| |
| 09:00-10:30, Paper TuI1I.324 | Add to My Program |
| Hierarchical Online Learning for Adaptive Sampling of Discrete Species Distributions with an AUV |
|
| Todd, Jessica | Caltech |
| McCammon, Seth | Woods Hole Oceanographic Institution |
| Yoerger, Dana | Woods Hole Oceanographic Institution |
Keywords: Integrated Planning and Learning, Marine Robotics, Probabilistic Inference
Abstract: Autonomous robots are increasingly being used in the field of scientific exploration and data acquisition. In particular, the use of robotic systems for mapping and sampling of species is becoming widespread in both aerial and underwater domains, however the problem of choosing where to sample is challenging when the phenomena of interest are discrete and sparsely distributed in space or time, such as when mapping a particular benthic species. In this paper we present a hierarchical online learning framework for reasoning about species distribution in realtime, in order to inform sampling decisions. Drawing inspiration from the Species Distribution Modelling community, a hierarchical probabilistic model is developed using the Integrated Nested Laplace Approximation framework, that enables online inference about expected target hotspots using predicted substrate distributions. Model parameters are learned online to build a prediction over the discrete targets, and the model is integrated into an anytime online planner to enable adaptive path planning. The hierarchical learning approach is demonstrated on simulated synthetic environments and shown to consistently outperform baseline methods such as Gaussian Process regression and boustrophedon coverage approaches, when robot resources are constrained.
|
| |
| 09:00-10:30, Paper TuI1I.325 | Add to My Program |
| An Autonomous and Hardware-Agnostic Vision-Servoed System for Microdevice Injection |
|
| Zheng, Yumin | Nanyang Technological University |
| Tan, Runjia | Nanyang Technological University |
| Jiao, Rui | Nanyang Technological University |
| Lee, Sunwoo | Nanyang Technological University |
Keywords: Robotics and Automation in Life Sciences, Visual Servoing, Manipulation Planning
Abstract: Automated manipulation of nanoliter-scale implantable microdevices (IMDs) typically relies on complex, custom-built robotic setups that are difficult to reproduce and require extensive manual calibration. To address this challenge, this paper proposes an easily deployable and highly reproducible vision-servoed manipulation system for IMDs. Based on standard commercial off-the-shelf devices, the proposed platform is hardware-agnostic and eliminates the need for tedious manual calibration. The automated workflow seamlessly integrates coarse positioning, auto-focus, and marker-aided centering to achieve robust precision. The system is validated using a sub-nanoliter IMD, the microscale optoelectronic tetherless electrode (MOTE). Experimental results demonstrate that the proposed framework requires minimal manual intervention and significantly reduces operating time by 47.2 % compared to manual injection performed by an experienced user. These results pave the way for economical, high-throughput, and automated IMD-based in vitro and in vivo experiments, and beyond.
|
| |
| 09:00-10:30, Paper TuI1I.326 | Add to My Program |
| DRUM: Diffusion-Based Raydrop-Aware Unpaired Mapping for Sim2Real LiDAR Segmentation |
|
| Miyawaki, Tomoya | Kyushu University |
| Nakashima, Kazuto | Kyushu University |
| Iwashita, Yumi | NASA / Caltech Jet Propulsion Laboratory |
| Kurazume, Ryo | Kyushu University |
Keywords: Transfer Learning, Deep Learning for Visual Perception, Computer Vision for Transportation
Abstract: LiDAR-based semantic segmentation is a key component for autonomous mobile robots, yet large-scale annotation of LiDAR point clouds is prohibitively expensive and time-consuming. Although simulators can provide labeled synthetic data, models trained on synthetic data often underperform on real-world data due to a data-level domain gap. To address this issue, we propose DRUM, a novel Sim2Real translation framework. We leverage a diffusion model pre-trained on unlabeled real-world data as a generative prior and translate synthetic data by reproducing two key measurement characteristics: reflectance intensity and raydrop noise. To improve sample fidelity, we introduce a raydrop-aware masked guidance mechanism that selectively enforces consistency with the input synthetic data while preserving realistic raydrop noise induced by the diffusion prior. Experimental results demonstrate that DRUM consistently improves Sim2Real performance across multiple representations of LiDAR data. The project page is available at https://miya-tomoya.github.io/drum.
|
| |
| 09:00-10:30, Paper TuI1I.327 | Add to My Program |
| MolmoAct: Action Reasoning Models That Can Reason in Space |
|
| Lee, Jason | Allen Institute for AI, University of Washington |
| Duan, Jiafei | University of Washington |
| Fang, Haoquan | University of Washington |
| Deng, Yuquan | University of Washington |
| Liu, Shuo | University of Washington |
| Li, Boyang | University of Washington |
| Fang, Bohan | University of Washington |
| Zhang, Jieyu | University of Washington |
| Wang, Yi Ru | University of Washington |
| Lee, Sangho | Allen Institute for AI |
| Han, Winson | Allen Institute for Artificial Intelligence |
| Pumacay, Wilbert | Allen Institute for AI |
| Wu, Angelica | University of Washington |
| Hendrix, Rose | Allen Institute for Artificial Intelligence |
| Farley, Karen | Allen Institute for AI |
| VanderBilt, Eli | Allen Institute for Artificial Intelligence |
| Farhadi, Ali | University of Washington |
| Fox, Dieter | University of Washington |
| Krishna, Ranjay | University of Washington |
Keywords: Big Data in Robotics and Automation, Imitation Learning, Representation Learning
Abstract: Reasoning is essential for purposeful action, yet most robotic foundation models map perception and instructions directly to control, limiting adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), which integrate perception, planning, and control through a structured three-stage pipeline. Our model, Molmoact, encodes observations and instructions into depth perception tokens, generates 2D spatial plans, and predicts fine-grained actions, enabling explainable and steerable behavior. Molmoact-7B-D achieves 70.5% zero-shot accuracy on SimplerEnv Visual Matching (surpassing Pi-0 and GR00T N1.5), 86.6% average success on LIBERO, and real-world fine-tuning gains of +10% (single-arm) and +22.7% (bimanual) over Pi-0-FAST. It further improves out-of-distribution generalization by +23.3% and ranks highest in human-preference evaluations for open-instruction following and trajectory steering. We also release Molmoact Dataset, a dataset of 10k diverse robot trajectories that yields an average +5.5% performance boost when used for training. Together with open model weights and code, this establishes Molmoact as a state-of-the-art robotic foundation model and an open blueprint for building ARMs that transform perception into grounded, purposeful action. Further experimental details and result with Molmoact Dataset and human-preference evaluations included in supplementary video.
|
| |
| 09:00-10:30, Paper TuI1I.328 | Add to My Program |
| DiffuDepGrasp: Diffusion-Based Depth Noise Modeling Empowers Sim-To-Real Robotic Grasping |
|
| Zhou, Yingting | Institute of Automation, Chinese Academy of Sciences |
| Cui, Wenbo | Institute of Automation, Chinese Academy of Sciences |
| Liu, Weiheng | Institute of Automation, Chinese Academy of Sciences; Galbot |
| Chen, Guixing | BeiJing Zinovate to Future Co. LTD |
| Li, Haoran | Institute of Automation, Chinese Academy of Sciences |
| Zhao, Dongbin | Chinese Academy of Sciences |
Keywords: Reinforcement Learning, Deep Learning in Grasping and Manipulation, Grasping
Abstract: Accurate spatial-geometric perception remains fundamental to robotic grasping, yet physical artifacts in real depth maps like voids and noise establish a significant sim-to-real gap that critically impedes policy transfer. Training-time strategies like procedural noise injection or learned mappings suffer from data inefficiency due to unrealistic noise simulation, which is often ineffective for grasping tasks that require fine manipulation or dependency on paired datasets heavily. Furthermore, leveraging foundation models to reduce the sim-to-real gap via intermediate representations fails to fully mitigate the domain shift and adds computational overhead during deployment. This work confronts dual challenges of data inefficiency and deployment complexity. We propose DiffuDepGrasp, a deploy-efficient sim-to-real framework enabling zero-shot transfer through simulation-exclusive policy training. Its core innovation, the Diffusion Depth Generator, synthesizes geometrically pristine simulation depth with learned sensor-realistic noise via two synergistic modules. The first Diffusion Depth Module leverages temporal geometric priors to enable sample-efficient training of a conditional diffusion model that captures complex sensor noise distributions, while the second Noise Grafting Module preserves metric accuracy during perceptual artifact injection. Policies trained via our framework require only raw depth inputs during deployment, thus eliminating computational overhead. Extensive sim-to-real validation demonstrates 95.7% average success (SOTA) on 12-object grasping with zero-shot transfer and strong generalization to unseen objects, proving data efficiency and practical value.
|
| |
| 09:00-10:30, Paper TuI1I.329 | Add to My Program |
| Wildfire Containment Using Multi-Drone Systems: A Predict-Then-Optimize Approach |
|
| Cheng, Aoran | The University of Hong Kong |
| Pan, Shijie | Johns Hopkins University |
| Sun, Yiqi | The University of Hong Kong |
| Kang, Kai | College of Economics, Shenzhen University |
| Pais, Cristobal | Amazon |
| Zhou, Yulun | The University of Hong Kong |
| Shen, Zuo-Jun Max | The University of Hong Kong |
|
|
| |
| 09:00-10:30, Paper TuI1I.330 | Add to My Program |
| Visual Proactivity: Enhancing Human-Robot Collaboration through Intent Communication |
|
| Bo, Valerio | Institut De Robotica Y Informatica Industrial (CSIC-UPC) |
| Bejarano, Edison | Institut De Robotica Y Informatica Industrial (CSIC-UPC) |
| Garrell, Anais | Institut De Robotica Y Informatica Industrial (CSIC-UPC) |
| Sanfeliu, Alberto | Institut De Robotica Y Informatica Industrial (CSIC-UPC) |
Keywords: Human-Centered Robotics, Intention Recognition, Social HRI
Abstract: As robots transition from performing repetitive tasks to collaborating with humans, understanding human intent becomes crucial to effective interaction. Anticipation enables robots to predict human actions, while proactivity allows them to take initiative and guide human behavior toward optimal outcomes. Although research has largely focused on how robots infer and respond to human intentions, less attention has been paid to how robots communicate their own intent. This paper introduces visual proactivity, a novel, simple yet effective approach that enables robots to communicate their intentions through visual feedback, influencing human behavior and enhancing transparency and fluency. We develop and evaluate proactive robotic behaviors in a human-to-robot handover scenario, where a user study validates human perception of reactive, anticipatory, and proactive behaviors. The results demonstrate that effective visual proactivity fosters better alignment and coordination, paving the way for more intuitive human-robot collaboration.
|
| |
| 09:00-10:30, Paper TuI1I.331 | Add to My Program |
| The QuadSoft: Design, Construction, and Experimental Validation of a Soft and Actuated Quadrotor |
|
| Verdín Monzón, Rodolfo Isaac | Centro De Investigaciones En óptica |
| Moreno Jimenez, Hugo Alberto | Center for Research in Optics |
| Spong, Mark | University of Texas at Dallas |
| Flores, Gerardo | Texas A&M Int. University |
Keywords: Soft Robot Materials and Design, Tendon/Wire Mechanism, Hardware-Software Integration in Robotics
Abstract: This paper presents QuadSoft, a novel fully actuated quadrotor equipped with continuous-curvature, tendon-driven soft robotic arms. The design combines a semi-rigid central frame with flexible arms, enabling controlled structural reconfiguration during flight without altering the propeller layout. Unlike existing soft aerial platforms that rely on discrete bending joints, QuadSoft utilizes a continuum deformation approach to modulate arm curvature, actively adjusting its thrust vector and aerodynamic characteristics. We characterize the geometric mapping between servomotor input and the resulting constant curvature, validating it experimentally. Outdoor flight tests demonstrate stable take-off, hover, directional maneuvers, and landing, confirming that controlled arm bending can generate horizontal displacement while preserving altitude. Measurements of pitch, roll, and curvature angles show that the platform follows intended actuation patterns with minimal attitude deviations. These results demonstrate that QuadSoft preserves the baseline stability of rigid quadrotors while enabling morphology-driven maneuverability, all under the standard PX4 autopilot without retuning. Beyond a proof of concept, this work establishes a distinctive outdoor validation of a tendon-driven continuum morphing quadrotor, opening a new research avenue toward adaptive aerial systems that combine the safety and versatility of soft robotics with the performance of conventional UAVs.
|
| |
| 09:00-10:30, Paper TuI1I.332 | Add to My Program |
| RetoVLA: Reusing Register Tokens for Spatial Reasoning in Vision-Language-Action Models |
|
| Koo, Jiyeon | Gachon University |
| Cho, Taewan | Gachon University |
| Kang, Hyunjoon | Gachon University |
| Pyo, Eunseom | Gachon University |
| Oh, Taegyun | Gachon University |
| Kim, Taeryang | Gachon University |
| Choi, Andrew Jaeyong | Gachon University |
Keywords: Deep Learning in Grasping and Manipulation, Semantic Scene Understanding, Deep Learning for Visual Perception
Abstract: Vision-Language-Action (VLA) models have demonstrated robust performance across diverse robotic tasks. However, their high memory and computational demands often limit real-time deployment. While existing model compression techniques reduce the parameter footprint, they often drop in 3D spatial reasoning and scene layout understanding. This work introduces RetoVLA, an architecture designed to maintain spatial awareness in lightweight models by repurposing Register Tokens—learnable parameters originally introduced to mitigate attention artifacts in Vision Transformers. While these tokens are generally discarded once used, we repurpose them for their dense representation of global spatial context. RetoVLA integrates these recycled tokens directly into the action-planning module through a dedicated spatial context injection path. Our proposed design enables the recovery of global context without increasing the total parameter count. Real-world experiments using a 7-DOF manipulator show a 17.1%p improvement in average success rates over the baseline. Our results demonstrate that leveraging internal register tokens provides a highly effective mechanism for developing efficient, spatially-aware robotic agents. A video demonstration is available at: https://youtu.be/2CseBR-snZg
|
| |
| 09:00-10:30, Paper TuI1I.333 | Add to My Program |
| Learning Constraint-Aware Dynamical Systems from Human Demonstrations for Constrained Manipulation Tasks |
|
| Sung, Soyoun | POSTECH, Pohang University of Science and Technology |
| Kim, Keehoon | POSTECH, Pohang University of Science and Technology |
Keywords: Learning from Demonstration, Imitation Learning, Machine Learning for Robot Control
Abstract: Learning from demonstration (LfD) enables robots to acquire new skills from human examples without explicit programming. Dynamical system (DS)-based approaches, in particular, have shown robustness to disturbances and adaptability in unstructured environments. However, existing methods often fail to incorporate task-specific constraints—such as grasp locations, execution starting points, or motion restrictions—that are critical for reliable execution. This limitation becomes especially problematic in tool-use scenarios, where both the environment and the grasped tool impose strict restrictions on feasible motions. To address this challenge, we propose a novel constraint-aware DS framework that automatically extracts and encodes task-specific constraints directly from demonstration data. The key idea is that task-critical configurations, repeatedly observed across successful demonstrations, can be identified and modeled as essential regions for task success using Gaussian Process Regression. By embedding these constraints, the proposed method generates motions that remain robust to environmental variations and tool-induced limitations. Experiments with a 7-DoF robotic manipulator demonstrate that our framework significantly improves task success rates over state-of-the-art methods. Real-world evaluations on daily-life tasks, such as dishware collection, further confirm its practicality and potential for real-world robotic applications.
|
| |
| 09:00-10:30, Paper TuI1I.334 | Add to My Program |
| Enhancing Pose Estimation Stability and Accuracy for Fiducial Markers Using Transparent Cylinders |
|
| Tanaka, Hideyuki | National Institute of AIST |
| Ogata, Kunihiro | National Institute of Advanced Industrial Science and Technology (AIST) |
|
|
| |
| 09:00-10:30, Paper TuI1I.335 | Add to My Program |
| MemOcc: Hierarchical Memory for Indoor Continuous Occupancy Mapping |
|
| YIRong, Yang | Beihang University |
| Yuxin, Lin | Beihang University |
| Longteng, Guo | Institute of Automation of the Chinese Academy of Sciences |
| Li, Song | Horizon Robotics |
| Qunbo, Wang | Institute of Automation, Chinese Academy of Sciences |
| Ming-Ming, Yu | BeiHang University |
| Wenjun, Wu | Beihang University |
| Jing, Liu | Institute of Automation, Chinese Academy of Science |
Keywords: Semantic Scene Understanding, AI-Enabled Robotics, Learning from Experience
Abstract: Indoor 3D occupancy mapping, crucial for robotic perception, struggles with occlusions and reappearing surfaces in continuous observations. Existing methods either fuse frames without discernment, causing occlusion-induced errors to persist and contaminate global representations, or recompute scenes from scratch, sacrificing efficiency and stability. To address these challenges, we propose MemOcc, a novel memory-augmented framework for continuous occupancy mapping using read–write–retrieve operations. MemOcc employs a hierarchical memory design with cooperative short- and long-term tiers. Its Short-Term Memory Cache module uses visibility-gated writes and confidence maps to stabilize voxel predictions and filter occlusion noise, while the Long-Term Memory Bank stores scene priors for rapid retrieval, accelerating convergence in revisited regions. As a plug-and-play module, MemOcc integrates seamlessly with existing 2D-to-3D pipelines without altering backbones or training. Experiments on indoor benchmarks demonstrate MemOcc reduces error propagation by 25% and improves mapping speed over state-of-the-art methods, achieving robust, real-time performance. By selectively retaining reliable evidence and enabling efficient retrieval, MemOcc paves the way for scalable indoor perception in robotics and augmented reality.
|
| |
| 09:00-10:30, Paper TuI1I.336 | Add to My Program |
| LLM Trainer: Automated Robotic Data Generating Via Demonstration Augmentation Using LLMs |
|
| George, Abraham | Carnegie Mellon University |
| Barati Farimani, Amir | Carnegie Mellon University |
Keywords: Data Sets for Robot Learning, Learning from Demonstration, Integrated Planning and Learning
Abstract: We present emph{LLM Trainer}, a fully automated pipeline that leverages the world knowledge of Large Language Models (LLMs) to transform a small number of human demonstrations (as few as one) into a large robot dataset for imitation learning. Our approach decomposes demonstration generation into two steps: (1) offline demonstration annotation that extracts keyframes, salient objects, and pose–object relations; and (2) online keypose retargeting that adapts those keyframes to a new scene, given an initial observation. Using these modified keypoints, our system warps the original demonstration to generate a new trajectory, which is then executed, and the resulting demo, if successful, is saved. Because the annotation is reusable across scenes, we use Thompson sampling to optimize the annotation, significantly improving generation success rate. We evaluate our method on a range of tasks, and find that our data annotation method consistently outperforms expert-engineered baselines. We further show an ensemble policy that combines the optimized LLM feed-forward plan with a learned feedback imitation learning controller. Finally, we demonstrate hardware feasibility on a Franka Emika Panda robot.
|
| |
| 09:00-10:30, Paper TuI1I.337 | Add to My Program |
| Trustworthy Delayed Teleoperation Via an Imperfect Regolith Model |
|
| Louca, Joe | University of Bristol |
| Vrublevskis, John | Thales Alenia Space (UK) |
| Eder, Kerstin | University of Bristol |
| Tzemanaki, Antonia | University of Bristol |
Keywords: Space Robotics and Automation, Telerobotics and Teleoperation, Acceptability and Trust
Abstract: Long-distance teleoperation will enable forthcoming scientific and commercial developments on the lunar surface such as in-situ resource utilisation. However, the large distances involved in these applications introduce multi-second signal delays, which may impair user performance and lead to reduced trust in the system. This work presents a user study of 26 participants exploring the impact of open-loop model-mediated teleoperation (MMT) in providing real-time feedback alongside a delayed video stream of the remote regolith simulant sample collection task. In this system, an imperfect but computationally efficient model was employed to visuo-haptically render the simulant. Three conditions were examined: MMT with visual feedback, MMT with visuo-haptic feedback, and direct teleoperation with delayed visual feedback. Users reported greater trust scores in the visual and visual-haptic MMT conditions (+13%, +12%, respectively) compared with delayed direct teleoperation. In addition, they demonstrated more trusting behaviour in the MMT conditions by reducing the duration of ‘wait’ periods. Performance metrics were also improved in the MMT conditions (faster completion time), although no significant differences were observed between the two MMT feedback types. These results suggest that, despite using an approximate representation of a complex environment, MMT is a valuable tool for improving performance and developing trust in delayed teleoperation systems.
|
| |
| 09:00-10:30, Paper TuI1I.338 | Add to My Program |
| Non-Motorized Hand Exoskeleton for Rescue and Beyond: Substantially Elevating Grip Endurance and Strength |
|
| Mai, Xianlong | University of Science and Technology of China |
| Yang, Jian | Anhui University, Hefei, Anhui, 230601, China |
| Li, Lei | University of Science and Technology of China |
| Zi, Bin | Hefei University of Technology |
| Zhang, Shiwu | University of Science and Technology of China |
| Gong, Xinglong | University of Science and Technology of China |
| Li, Weihua | University of Wollongong |
| Yun, Guolin | University of Cambridge |
| Sun, Shuaishuai | University of Science and Technology of China |
Keywords: Wearable Robots, Prosthetics and Exoskeletons, Magnetorheological technology, Human Performance Augmentation
Abstract: Robotic hand exoskeletons hold immense potential for enhancing human hand functionality, addressing the hand’s strength limitations and fatigue during physically-demanding tasks. However, most existing hand exoskeletons are motorized, being weak in generating high supporting force for gripping augmentation. We present a nonmotorized hand exoskeleton based on magnetorheological (MR) actuators to provide high gripping support and elevate grip endurance. Meanwhile, it ingeniously harnesses human energy for actuation and energy storage, enhancing grip strength without external power. The MR actuator demonstrates a peak holding force of 1046 N with merely 5 W power input, boasting a force-to-power ratio one-order-of-magnitude higher than conventional approaches, and 97.7% energy reduction for same holding force compared to other approaches. Participants wearing the hand exoskeletons experience a 41.8% enhancement in grip strength without external power and reduced hand muscle fatigue during prolonged physical labor. In rescuing scenarios such as postearthquake rescue, debris clearance, and casualty evacuation, our exoskeleton effectively supports gripping and improves working efficiency.
|
| |
| 09:00-10:30, Paper TuI1I.339 | Add to My Program |
| Fast Contact Detection Via Fusion of Joint and Inertial Sensors for Parallel Robots in Human-Robot Collaboration |
|
| Mohammad, Aran | Leibniz University Hannover |
| Piosik, Jan | Leibniz Universität Hannover |
| Lehmann, Dustin | TU Berlin |
| Seel, Thomas | Leibniz Universität Hannover |
| Schappler, Moritz | Institute of Mechatronic Systems, Leibniz Universitaet Hannover |
Keywords: Safety in HRI, Parallel Robots, Sensor Fusion
Abstract: Fast contact detection is crucial for safe human-robot collaboration. Observers based on proprioceptive information can be used for contact detection but have first-order error dynamics, which results in delays. Sensor fusion based on inertial measurement units (IMUs) consisting of accelerometers and gyroscopes is advantageous for reducing delays. The acceleration estimation enables the direct calculation of external forces. For serial robots, the installation of multiple accelerometers and gyroscopes is required for dynamics modeling since the joint coordinates are the minimal coordinates. Alternatively, parallel robots (PRs) offer the potential to use only one IMU on the end-effector platform, which already presents the minimal coordinates of the PR. This work introduces a sensor-fusion method for contact detection using encoders and only one low-cost, consumer-grade IMU for a PR. The end-effector accelerations are estimated by an extended Kalman filter and incorporated into the dynamics to calculate external forces. In real-world experiments with a planar PR, we demonstrate that this approach reduces the detection duration by up to 50% compared to a momentum observer and enables the collision and clamping detection within 3-39ms.
|
| |
| 09:00-10:30, Paper TuI1I.340 | Add to My Program |
| Event-Frame-Inertial Odometry Using Point and Line Features Based on Coarse-To-Fine Motion Compensation |
|
| Choi, Byeongpil | Seoul National University |
| Lee, Hanyeol | Seoul National University |
| Park, Chan Gook | Seoul National University |
Keywords: Localization, Visual-Inertial SLAM, Vision-Based Navigation
Abstract: An event camera is a vision sensor that captures pixel-level brightness changes and outputs this information as asynchronous events. These events are primarily generated from geometric structures such as edges, which are sensitive to variations in brightness. In this letter, we aim to leverage line structure information alongside point features to enhance the robustness and accuracy of localization in indoor or human-made environments. To obtain precise line measurements from events, we propose a novel line detection method that incorporates a coarse-to-fine motion compensation scheme, which generates highly sharp event frames. The extracted line features are paired with point features, eliminating the need for traditional line descriptors. Finally, the event features are effectively fused with frame-based point features within a multi-state constraint Kalman filter-based backend, fully exploiting the complementary advantages of both sensors. The performance of the proposed method is verified through an author-constructed experiment and two public datasets, demonstrating improved accuracy in line detection and pose estimation.
|
| |
| 09:00-10:30, Paper TuI1I.341 | Add to My Program |
| STEM: A Soft Tactile Electromagnetic Actuator for Multimodal Haptic Feedback in Virtual Environments |
|
| Mun, Heeju | Korea Advanced Institute of Science and Technology |
| Jeong, Seung Mo | Korea Advanced Institute of Science and Technology |
| Lim, Sein | Korea Advanced Institute of Science & Technology (KAIST) |
| Jung, Seunggyeom | Korea Advanced Institute of Science and Technology |
| Kyung, Ki-Uk | Korea Advanced Institute of Science & Technology (KAIST) |
Keywords: Haptics and Haptic Interfaces, Virtual Reality and Interfaces, Wearable Robotics
Abstract: This study introduces the soft tactile electromagnetic (STEM) actuator, a compact and wearable haptic device designed to deliver multimodal tactile feedback in virtual environments. The actuator employs soft materials as both an energy-storing and encasing structure, enabling out-of-plane deformations in response to arbitrary input signals while ensuring high wearability. Magnetic reinforcements, including a soft magnetic cap and a ferromagnetic pole piece, minimize magnetic flux leakage, effectively amplifying output force along with protrusion to enable precise and varied haptic feedback. The actuator generates multimodal tactile stimuli, including force, impulse, and vibration, surpassing conventional vibrotactile devices in delivering more varied and dynamic feedback. Experimental evaluation of the actuator's mechanical performance demonstrates its ability to produce both low- and high-frequency tactile feedback. A user study evaluating perception thresholds and signal recognition accuracy found that participants identified eight distinct tactile signals with an average accuracy of 91%, confirming the actuator’s capacity to deliver distinguishable multimodal feedback. These findings underscore the feasibility of the STEM actuator for immersive haptic interactions and highlight its potential applications in virtual reality.
|
| |
| 09:00-10:30, Paper TuI1I.342 | Add to My Program |
| A Bayesian Modeling Framework for Estimation and Ground Segmentation of Cluttered Staircases |
|
| Sriganesh, Prasanna | Carnegie Mellon University |
| Shirose, Burhanuddin | Carnegie Mellon University |
| Travers, Matthew | Carnegie Mellon University |
Keywords: Object Detection, Segmentation and Categorization, Probabilistic Inference, Field Robots
Abstract: Autonomous robot navigation in complex environments requires robust perception as well as high-level scene understanding due to perceptual challenges, such as occlusions, and uncertainty introduced by robot movement. For example, a robot climbing a cluttered staircase can misinterpret clutter as a step, misrepresenting the state and compromising safety. This requires robust state estimation methods capable of inferring the underlying structure of the environment even from incomplete sensor data. In this paper, we introduce a novel method for robust state estimation of staircases. To address the challenge of perceiving occluded staircases extending beyond the robot's field-of-view, our approach combines an infinite-width staircase representation with a finite endpoint state to capture the overall staircase structure. This representation is integrated into a Bayesian inference framework to fuse noisy measurements enabling accurate estimation of staircase location even with partial observations and occlusions. Additionally, we present a segmentation algorithm that works in conjunction with the staircase estimation pipeline to accurately identify clutter-free regions on a staircase. Our method is extensively evaluated on real robot across diverse staircases, demonstrating significant improvements in estimation accuracy and segmentation performance compared to baseline approaches.
|
| |
| 09:00-10:30, Paper TuI1I.343 | Add to My Program |
| Remote Awareness of Image Quality for Multiweek Shore-Launched AUV Surveys (I) |
|
| Bodenmann, Adrian | University of Southampton |
| Jones, Daniel O. B. | National Oceanography Centre |
| Phillips, Alexander | National Oceanography Centre |
| Templeton, Robert | National Oceanography Centre |
| Sherif, Rashiid | National Oceanography Centre |
| Fanelli, Francesco | National Oceanography Centre |
| Newborough, Darryl | Sonardyne International |
| Thornton, Blair | University of Southampton |
Keywords: Marine Robotics, Human Factors and Human-in-the-Loop, Environment Monitoring and Management
Abstract: Visual seafloor imaging using autonomous underwater vehicles (AUVs) has become an established method for seafloor mapping and monitoring. With AUVs now achieving multiweek endurance and several hundred kilometers of range on a single charge, image quality assessment (IQA) on-board vehicles in the field is necessary for robust data acquisition given the sensitivity of underwater imaging surveys to environmental conditions. This research develops a metric to assess seafloor image quality in situ, and demonstrates its use for quality assurance during a 21-day, shore-launched AUV campaign that visited three sites up to 170 km from shore. The metric was transmitted via satellite communication along with vehicle telemetry to shore-based AUV operators during regular surfacing intervals without relying on vehicle recovery. The method was implemented on the seafloor laser scan and strobed imaging system BioCam, deployed on the Autosub Long Range (ALR) AUV (also known as Boaty McBoatface) in the North Sea. Several tens of hectares of seafloor imagery were collected, and image quality scores were transmitted. This information was used to retask the AUV and maximize the quality of acquired images within operational constraints. Data products generated from the collected imagery show the improvements achieved that would otherwise have been missed. This highlights the importance of remote awareness of data quality to facilitate longer consecutive mapping missions without vehicle recovery.
|
| |
| 09:00-10:30, Paper TuI1I.344 | Add to My Program |
| Opt2Skill: Imitating Dynamically-Feasible Whole-Body Trajectories for Versatile Humanoid Loco-Manipulation |
|
| Liu, Fukang | Georgia Institute of Technology |
| Gu, Zhaoyuan | Georgia Institute of Technology |
| Cai, Yilin | Georgia Institute of Technology |
| Zhou, Ziyi | Georgia Institute of Technology |
| Jung, Hyunyoung | Georgia Institute of Technology |
| Jang, Jaehwi | Georgia Institute of Technology |
| Zhao, Shijie | Georgia Institute of Technology |
| Ha, Sehoon | Georgia Institute of Technology |
| Chen, Yue | Georgia Institute of Technology |
| Xu, Danfei | Georgia Institute of Technology |
| Zhao, Ye | Georgia Institute of Technology |
Keywords: Humanoid and Bipedal Locomotion, Whole-Body Motion Planning and Control, Reinforcement Learning
Abstract: Humanoid robots are designed to perform diverse loco-manipulation tasks. However, they face challenges due to their high-dimensional and unstable dynamics, as well as the complex contact-rich nature of the tasks. Model-based optimal control methods offer flexibility to define precise motion but are limited by high computational complexity and accurate contact sensing. On the other hand, reinforcement learning (RL) handles high-dimensional spaces with strong robustness but suffers from inefficient learning, unnatural motion, and sim-to-real gaps. To address these challenges, we introduce Opt2Skill, an end-to-end pipeline that combines model-based trajectory optimization with RL to achieve robust whole-body loco-manipulation. Opt2Skill generates dynamic feasible and contact-consistent reference motions for the Digit humanoid robot using differential dynamic programming (DDP) and trains RL policies to track these optimal trajectories. Our results demonstrate that Opt2Skill outperforms baselines that rely on human demonstrations and inverse kinematics-based references, both in motion tracking and task success rates. Furthermore, we show that incorporating trajectories with torque information improves contact force tracking in contact-involved tasks, such as wiping a table. We have successfully transferred our approach to real-world applications.
|
| |
| 09:00-10:30, Paper TuI1I.345 | Add to My Program |
| Structure-Exploiting Sequential Quadratic Programming for Model-Predictive Control |
|
| Jordana, Armand | New York University |
| Kleff, Sebastien | Inria Center at the University of Bordeaux |
| Meduri, Avadesh | New York University |
| Carpentier, Justin | INRIA |
| Mansard, Nicolas | CNRS |
| Righetti, Ludovic | New York University |
Keywords: Optimization and Optimal Control, Motion and Path Planning, Reactive and Sensor-Based Planning, Model-Predictive Control
Abstract: The promise of model-predictive control (MPC) in robotics has led to extensive development of efficient numerical optimal control solvers in line with differential dynamic programming because it exploits the sparsity induced by time. In this work, we argue that this effervescence has hidden the fact that sparsity can be equally exploited by standard nonlinear optimization. In particular, we show how a tailored implementation of sequential quadratic programming (QP) achieves state-of-the-art MPC. Then, we clarify the connections between popular algorithms from the robotics community and well-established optimization techniques. Further, the sequential quadratic program formulation naturally encompasses the constrained case, a notoriously difficult problem in the robotics community. Specifically, we show that it only requires a sparsity-exploiting implementation of a state-of-the-art QP solver. We illustrate the validity of this approach in a comparative study and experiments on a torque-controlled manipulator. To the best of our knowledge, this is the first demonstration of closed loop nonlinear MPC with constraints on a real robot.
|
| |
| 09:00-10:30, Paper TuI1I.346 | Add to My Program |
| Depth Transfer: Learning to See Like a Simulator for Real-World Drone Navigation |
|
| Yu, Hang | Delft University of Technology |
| De Wagter, Christophe | Delft University of Technology |
| de Croon, Guido | TU Delft |
Keywords: Aerial Systems: Perception and Autonomy, Collision Avoidance, Vision-Based Navigation
Abstract: Sim-to-real transfer is a fundamental challenge in robot learning. Discrepancies between simulation and reality can significantly impair policy performance, especially if it receives high-dimensional inputs such as dense depth estimates from vision. We propose a novel depth transfer method based on domain adaptation to bridge the visual gap between simulated and real-world depth data. A Variational Autoencoder (VAE) is first trained to encode ground-truth depth images from simulation into a latent space, which serves as input to a reinforcement learning (RL) policy. During deployment, the encoder is refined to align stereo depth images with this latent space, enabling direct policy transfer without fine-tuning. We apply our method to the task of autonomous drone navigation through cluttered environments. Experiments in IsaacGym show that our method nearly doubles the obstacle avoidance success rate when switching from ground-truth to stereo depth input. Furthermore, we demonstrate successful transfer to the photo-realistic simulator AvoidBench using only IsaacGym-generated stereo data, achieving superior performance compared to state-of-the-art baselines. Real-world evaluations in both indoor and outdoor environments confirm the effectiveness of our approach, enabling robust and generalizable depth-based navigation across diverse domains.
|
| |
| 09:00-10:30, Paper TuI1I.347 | Add to My Program |
| Contact-Aware Safety in Soft Robots Using High-Order Control Barrier and Lyapunov Functions |
|
| Wong, Kiwan | MIT |
| Stölzle, Maximilian | Disney Research |
| Xiao, Wei | MIT |
| Della Santina, Cosimo | TU Delft |
| Rus, Daniela | MIT |
| Zardini, Gioele | Massachusetts Institute of Technology |
Keywords: Modeling, Control, and Learning for Soft Robots, Robot Safety, Soft Robot Applications
Abstract: Robots operating alongside people, particularly in sensitive scenarios such as aiding the elderly with daily tasks or collaborating with workers in manufacturing, must guarantee safety and cultivate user trust. Continuum soft manipulators promise safety through material compliance, but as designs evolve for greater precision, payload capacity, and speed, and increasingly incorporate rigid elements, their injury risk resurfaces. In this letter, we introduce a comprehensive High-Order Control Barrier Function (HOCBF) + High-Order Control Lyapunov Function (HOCLF) framework that enforces strict contact force limits across the entire soft-robot body during environmental interactions. Our approach combines a differentiable Piecewise Cosserat-Segment (PCS) dynamics model with a convex-polygon distance approximation metric, named Differentiable Conservative Separating Axis Theorem (DCSAT), based on the soft robot geometry to enable real-time, whole-body collision detection, resolution, and enforcement of the safety constraints. By embedding HOCBFs into our optimization routine, we guarantee safety, allowing, for instance, safe navigation in operational space under HOCLF-driven motion objectives. Extensive planar simulations demonstrate that our method maintains safety-bounded contacts while achieving precise shape and task-space regulation. This work thus lays a foundation for the deployment of soft robots in human-centric environments with provable safety and performance.
|
| |
| 09:00-10:30, Paper TuI1I.348 | Add to My Program |
| ViTacGen: Robotic Pushing with Vision-To-Touch Generation |
|
| Wu, Zhiyuan | King's College London |
| Lin, Yijiong | University of Bristol |
| Zhao, Yongqiang | King's College London |
| Zhang, Xuyang | King's College London |
| Chen, Zhuo | King's College London |
| Lepora, Nathan | University of Bristol |
| Luo, Shan | King's College London |
Keywords: Force and Tactile Sensing, Deep Learning in Grasping and Manipulation
Abstract: Robotic pushing is a fundamental manipulation task that requires tactile feedback to capture subtle contact forces and dynamics between the end-effector and the object. However, real tactile sensors often face hardware limitations and deployment challenges, while vision-only policies struggle with satisfactory performance. Inspired by humans' ability to infer tactile states from vision, we propose ViTacGen, a novel robot manipulation framework designed for visual robotic pushing with vision-to-touch generation in reinforcement learning to eliminate the reliance on high-resolution real tactile sensors, enabling effective zero-shot deployment on visual-only robotic systems. Specifically, ViTacGen consists of an encoder-decoder vision-to-touch generation network that generates contact depth images, a standardized tactile representation, directly from visual image sequence, followed by a reinforcement learning policy that fuses visual-tactile data with contrastive learning based on visual and generated tactile observations. We validate the effectiveness of our approach in both simulation and real world experiments, demonstrating its superior performance and achieving a success rate of up to 86%. Code and data will be open-sourced once the paper is accepted.
|
| |
| 09:00-10:30, Paper TuI1I.349 | Add to My Program |
| Robust Unsupervised Domain Adaptation for 3D Point Cloud Segmentation under Source Adversarial Attacks |
|
| Li, Haosheng | Southern University of Science and Technology |
| Chen, Junjie | Southern University of Science and Technology |
| Xu, Yuecong | National University of Singapore |
| Ding, Kemi | Southern University of Science and Technology |
Keywords: Transfer Learning, Semantic Scene Understanding, Object Detection, Segmentation and Categorization
Abstract: Unsupervised domain adaptation (UDA) frameworks have shown good generalization capabilities for 3D point cloud semantic segmentation models on clean data. However, existing works overlook adversarial robustness when the source domain itself is compromised. To comprehensively explore the robustness of the UDA frameworks, we ffrst design a stealthy adversarial point cloud generation attack that can signiffcantly contaminate datasets with only minor perturbations to the point cloud surface. Based on that, we propose a novel dataset, AdvSynLiDAR, comprising synthesized contaminated LiDAR point clouds. With the generated corrupted data, we further develop the Adversarial Adaptation Framework as the countermeasure. Speciffcally, by extending the key point sensitive loss towards the Robust Long-Tailed loss and utilizing a decoder branch, our approach enables the model to focus on long-tailed classes during the pre-training phase and leverages high-conffdence decoded point cloud information to restore point cloud structures during the adaptation phase. We evaluated our AAF method on the AdvSynLiDAR dataset, where the results demonstrate that our AAF method can mitigate performance degradation under source adversarial perturbations for UDA in the 3D point cloud segmentation application.
|
| |
| 09:00-10:30, Paper TuI1I.350 | Add to My Program |
| Perfectly Undetectable Reflection and Scaling False Data Injection Attacks Via Affine Transformation on Mobile Robot Trajectory Tracking Control |
|
| Ueda, Jun | Georgia Institute of Technology |
| Kwon, Hyukbin | Georgia Institute of Technology |
Keywords: Networked Robots, Telerobotics and Teleoperation, Nonholonomic Mechanisms and Systems, False Data Injection Attack
Abstract: With the increasing integration of cyber-physical systems (CPS) into critical applications, ensuring their resilience against cyberattacks is paramount. A particularly concerning threat is the vulnerability of CPS to deceptive attacks that degrade system performance while remaining undetected. This paper investigates perfectly undetectable false data injection attacks (FDIAs) targeting the trajectory tracking control of a non-holonomic mobile robot. The proposed attack method utilizes affine transformations of intercepted signals, exploiting weaknesses inherent in the partially linear dynamic properties and symmetry of the nonlinear plant. The feasibility and potential impact of these attacks are validated through experiments using a Turtlebot 3 platform, highlighting the urgent need for sophisticated detection mechanisms and resilient control strategies to safeguard CPS against such threats. Furthermore, a novel approach for detection of these attacks called the state monitoring signature function (SMSF) is introduced. An example SMSF, a carefully designed function resilient to FDIA, is shown to be able to detect the presence of a FDIA through signatures based on system states.
|
| |
| 09:00-10:30, Paper TuI1I.351 | Add to My Program |
| Select2Plan: Training-Free ICL-Based Planning through VQA and Memory Retrieval |
|
| Buoso, Davide | Politecnico Di Torino |
| Robinson, Luke | University of Oxford |
| Averta, Giuseppe | Politecnico Di Torino |
| Torr, Philip | University of Oxford |
| Franzmeyer, Tim | University of Oxford |
| De Martini, Daniele | University of Oxford |
Keywords: Motion and Path Planning, Vision-Based Navigation, Autonomous Agents
Abstract: We introduce Select2Plan (S2P), a novel training-free framework for high-level robot planning that leverages off-the-shelf Vision-Language Models (VLMs) for autonomous navigation. Unlike most learning-based approaches that require extensive task- specific training and large-scale data collection, S2P overcomes the need for fine-tuning by adapting inputs to align with the VLM’s pretraining data. Our method achieves this through a combination of structured Visual Question Answering (VQA) to ground action selection on the image, and In-Context Learning (ICL) to exploit knowledge drawn from relevant examples from a memory bank of (visually) annotated data, which can include diverse, in-the-wild sources. We demonstrate S2P flexibility by evaluating it in both First-Person View (FPV) and Third-Person View (TPV) navigation. S2P improves the performance of a baseline VLM by 40% in TPV and surpasses end-to-end trained models by approximately 24% in FPV when tasked with navigating towards unseen objects in novel scenes. These results highlight the adaptability, simplicity, and effectiveness of our training-free approach, demonstrating that the use of pre-trained VLMs with structured memory retrieval enables robust high-level robot planning without costly task-specific training. Our experiments also show that retrieving samples from heterogeneous data sources, including online videos of different robots or humans walking, is highly beneficial for navigation. Notably, our method effectively generalizes to novel scenarios, requiring only a handful of demonstrations. Project Page: lambdavi.github.io/select2plan
|
| |
| 09:00-10:30, Paper TuI1I.352 | Add to My Program |
| STAF-Navi: Vision-Based Spatio-Temporal Attention Fusion Navigation Framework |
|
| Zhang, Haowen | Wuhan University |
| Liu, Fanghong | Wuhan University |
| Zhang, Chaoyu | Wuhan University |
| Yu, Qiuze | Wuhan University |
Keywords: Vision-Based Navigation, Reinforcement Learning, Aerial Systems: Perception and Autonomy
Abstract: In cluttered, unknown, and partially observable environments, Unmanned Aerial Vehicle (UAV) navigation encounters formidable challenges. To address these challenges, we propose an innovative spatio-temporal attention fusion navigation framework called STAF-Navi. The framework integrates spatio-temporal attention mechanisms to model sequential dependencies. It captures spatial and temporal correlations from historical observations and actions to improve navigation and obstacle avoidance. STAF-Navi employs deep collision encoding to compress high-dimensional depth images into informative low-dimensional latent states, and a single-site Transformer to model historical sensor inputs and states, enhancing the utility of current observations. By exploiting temporal dependencies, this integration enables early braking and stable hovering. Extensive simulation experiments show that the framework increases the navigation success rate by 10% and improves path efficiency by 7%. Finally, the successful deployment of the proposed strategy in real-world scenarios validates its effectiveness.
|
| |
| 09:00-10:30, Paper TuI1I.353 | Add to My Program |
| Plug-And-Play Shape Matching Module for Zero-Shot Mesh-Free Grasp Refinement on Unknown Objects |
|
| Hong, Juyong | Sungkyunkwan University |
| Son, Yeong Gwang | Sungkyunkwan University |
| Um, Seunghwan | Sungkyunkwan University |
| Choi, Hyouk Ryeol | Sungkyunkwan University |
Keywords: Perception for Grasping and Manipulation, RGB-D Perception, Object Detection, Segmentation and Categorization
Abstract: Reliably grasping unknown objects in logistics automation remains a major challenge. While most approaches rely on 3D CAD models or large-scale training, their applicability to novel items is limited. This paper proposes a plug-and-play geometric refinement module that can be appended to any existing grasp planner. The module operates in a training-free and mesh-free manner, estimating an object's approximate centroid from a single RGB-D image to enhance grasp stability. Its core mechanism involves using an initial grasp candidate as an automatic prompt for segmentation, followed by geometric primitive fitting to the isolated object's point cloud. By rescoring grasp candidates based on proximity to the estimated centroid, our module improves physical stability. Experimental results demonstrate that our module improves the success rate of baseline grasp planners by up to 25%p enhancing real-world pick-and-place performance without requiring any offline training or prior object models.
|
| |
| 09:00-10:30, Paper TuI1I.354 | Add to My Program |
| Hi-Drive: Hierarchical POMDP Planning for Safe Autonomous Driving in Diverse Urban Environments |
|
| Jin, Xuanjin | Shanghai Jiaotong University |
| Zeng, Chendong | Shanghai Jiao Tong University |
| Zhu, Shengfa | Sensetime |
| Liu, Chunxiao | Sensetime Research |
| Cai, Panpan | Shanghai Jiao Tong University |
Keywords: Planning under Uncertainty, Autonomous Vehicle Navigation, Path Planning for Multiple Mobile Robots or Agents
Abstract: Uncertainties in dynamic road environments pose significant challenges for behavior and trajectory planning in autonomous driving. This paper introduces Hi-Drive, a hierarchical planning algorithm addressing uncertainties at both behavior and trajectory levels using a hierarchical Partially Observable Markov Decision Process (POMDP) formulation. Hi-Drive employs driver models to represent uncertain behavioral intentions of other vehicles and uses their parameters to infer hidden driving styles. By treating driver models as high-level decision-making actions, our approach effectively manages the exponential complexity inherent in POMDPs. To further enhance safety and robustness, Hi-Drive integrates a trajectory optimization based on importance sampling, refining trajectories using a comprehensive analysis of critical agents. Evaluations on real-world urban driving datasets demonstrate that Hi-Drive significantly outperforms state-of-the-art planning-based and learning-based methods across diverse urban driving situations in real-world benchmarks.
|
| |
| 09:00-10:30, Paper TuI1I.355 | Add to My Program |
| Depth-Constrained ASV Navigation with Deep RL and Limited Sensing |
|
| Zhalehmehrabi, Amirhossein | University of Verona |
| Meli, Daniele | University of Verona |
| Dal Santo, Francesco | Universitá Di Verona |
| Trotti, Francesco | University of Verona |
| Farinelli, Alessandro | University of Verona |
Keywords: Reinforcement Learning, Planning under Uncertainty, Marine Robotics
Abstract: Autonomous Surface Vehicles (ASVs) play a crucial role in maritime operations, yet their navigation in shallow-water environments remains challenging due to dynamic disturbances and depth constraints. Traditional navigation strategies struggle with limited sensor information, making safe and efficient navigation difficult. In this paper, we propose a reinforcement learning (RL) framework for ASV navigation under depth constraints, where the vehicle must reach a target while avoiding unsafe areas with only a single depth measurement per timestep from a downward-facing Single Beam Echosounder (SBES). To enhance environmental awareness, we integrate Gaussian Process (GP) regression into the RL framework, enabling the agent to progressively estimate a bathymetric depth map from sparse sonar readings. This approach improves decision-making by providing a richer representation of the environment. Furthermore, we demonstrate effective sim-to-real transfer, ensuring that policies generalize well to real-world aquatic conditions. Experimental results validate our method’s capability to improve ASV navigation performance while maintaining safety in challenging shallow-water environments. The code is available at https://github.com/Isla-lab/depth-constrained-aquatic-navigation
|
| |
| 09:00-10:30, Paper TuI1I.356 | Add to My Program |
| Keypoint Semantic Integration for Improved Feature Matching in Outdoor Agricultural Environments |
|
| de Silva, Rajitha | University of Lincoln |
| Swindell, Jacob | University of Lincoln |
| Cox, Jonathan | University of Lincoln |
| Popovic, Marija | TU Delft |
| Cadena, Cesar | ETH Zurich |
| Stachniss, Cyrill | University of Bonn |
| Polvara, Riccardo | University of Lincoln |
Keywords: Semantic Scene Understanding, View Planning for SLAM, Agricultural Automation
Abstract: Robust robot navigation in outdoor environments requires accurate perception systems capable of handling visual challenges such as repetitive structures and changing appearances. Visual feature matching is crucial to vision-based pipelines but remains particularly challenging in natural outdoor settings due to perceptual aliasing. We address this issue in vineyards, where repetitive vine trunks and other natural elements generate ambiguous descriptors that hinder reliable feature matching. We hypothesise that semantic information tied to keypoint positions can alleviate perceptual aliasing by enhancing keypoint descriptor distinctiveness. To this end, we introduce a keypoint semantic integration technique that improves the descriptors in semantically meaningful regions within the image, enabling more accurate differentiation even among visually similar local features. We validate this approach in two vineyard perception tasks: (i) relative pose estimation and (ii) visual localisation. Our method improves matching accuracy across all tested keypoint types and descriptors, demonstrating its effectiveness over multiple months in challenging vineyard conditions.
|
| |
| 09:00-10:30, Paper TuI1I.357 | Add to My Program |
| A Static Modelling and Evaluation Framework for Soft Continuum Robots with Reinforced Chambers |
|
| Shi, Jialei | Imperial College London |
| Jin, Hanyu | Carnegie Mellon University |
| Gaozhang, Wenlong | University College London |
| Shi, Ge | CSIRO |
| Abad Guaman, Sara Adela | University College London |
| Wurdemann, Helge Arne | University College London |
Keywords: Soft Robot Materials and Design, Modeling, Control, and Learning for Soft Robots, Soft Sensors and Actuators, Soft Continuum Robots
Abstract: Elastomer-based soft manipulators, featuring fibre-reinforced chambers, represent a prevalent design paradigm in the field of soft robotics. These robots incorporate multiple reinforced actuation chambers, enabling robust elongation and omni-directional bending motions. However, the inherent compliance of materials and the pressurised chambers inevitably introduce significant nonlinearity to these soft robots. Moreover, design of such robots often relies on a trial-and-error approach. Consequently, a comprehensive robot prototyping framework is of paramount importance. To achieve this, we present a static modelling, design and evaluation framework for soft robots with densely reinforced chambers (i.e., the angle between the reinforcement fibre and the axial direction of soft robots is 90. We first propose a static analytical modelling framework to achieve both the forward kinematics and tip force generation modelling of soft robots. This modelling framework accommodates the effects of pressurised chambers and (non)linear material behaviours. Furthermore, our design and evaluation framework incorporates an open-accessible simulation toolbox with a user-friendly grap
|
| |
| 09:00-10:30, Paper TuI1I.358 | Add to My Program |
| Freeze-Frame with StaticNeRF: Uncertainty-Guided NeRF Map Reconstruction in Dynamic Scenes |
|
| Lee, Juhui | Inha University |
| Yang, Geonmo | Inha University |
| Ma, Seungjun | Hyundai Motor Company |
| Cho, Younggun | Inha University |
Keywords: Deep Learning for Visual Perception, Deep Learning Methods
Abstract: Recent advances in neural representations have shown great promise for enabling high-fidelity dense mapping in robotics. Given the inherently dynamic nature of real-world environments, many studies have attempted to learn static scene representations from dynamic observations. However, existing methods often fail to remove subtly moving objects and struggle to accurately recover occluded static backgrounds, which leads to critical limitations in practice. Furthermore, when static neural maps are used for localization, dynamic content in query images must be handled effectively. To overcome these challenges, we propose a static neural mapping framework that is robust to diverse dynamic environments and capable of processing dynamic content during localization. We evaluated our approach through extensive experiments on both public and in-house datasets. Our method improves both dynamic object removal and localization robustness under dynamic conditions, and constitutes a significant step toward resilient robot navigation in real-world environments.
|
| |
| 09:00-10:30, Paper TuI1I.359 | Add to My Program |
| Neural Predictor for Flight Control with Payload |
|
| Jin, Ao | Northwestern Polytechnical University |
| Li, Chenhao | Northwestern Polytechnical University |
| Wang, Qinyi | Northwestern Polytechnical University |
| Liu, Ya | Northwestern Polytechnical University |
| Huang, Panfeng | Northwestern Polytechnical University |
| Zhang, Fan | Northwestern Polytechnical Univeristy |
Keywords: Aerial Systems: Mechanics and Control, Model Learning for Control, Machine Learning for Robot Control
Abstract: Aerial robotics for transporting suspended payloads as the form of freely-floating manipulator are growing great interest in recent years. However, the force/torque caused by payload and residual dynamics will introduce unmodeled perturbations to the aerial robotics, which negatively affects the closed-loop performance. Different from estimation-like methods, this paper proposes Neural Predictor, a learning-based approach to model force/torque induced by payload and residual dynamics as a dynamical system. It yields a hybrid model that combines the first-principles dynamics with the learned dynamics. The hybrid model is then integrated into a MPC framework to improve closed-loop performance. Effectiveness of proposed framework is verified extensively in both numerical simulations and real-world flight experiments. The results indicate that our approach can capture force/torque caused by suspended payload and residual dynamics accurately, respond quickly to the changes of them and improve the closed-loop performance significantly. In particular, Neural Predictor outperforms a state-of-the-art learning-based estimator and has reduced the force and torque estimation errors by up to 66.15% and 33.33% while requiring less samples.
|
| |
| 09:00-10:30, Paper TuI1I.360 | Add to My Program |
| Torque-Bounded Task-Space Admittance Control for Redundant Manipulators |
|
| Kikuuwe, Ryo | Hiroshima University |
Keywords: Compliance and Impedance Control, Redundant Robots, Force Control, Singular Configuration
Abstract: This paper presents a task-space admittance controller applicable to redundant manipulators equipped with torque sensors. It extends Kikuuwe's (2019) torque-bounded admittance controller (TBAC), which allows for imposing explicit limits on the joint actuator torques without causing unsafe behaviors such as oscillation and overshoots. The proposed controller enforces the end-effector to follow predefined task-space dynamics as long as the joint torques are unsaturated and the configuration is away from singularities. The behavior in the nullspace, which arises from the redundant degrees of freedom and singular configurations, is governed by predefined joint-space dynamics. The task-space and joint-space dynamics are combined through a newly proposed continualized pseudoinverse, which employs the singular value decomposition. Results of experiments using a seven-degree-of-freedom Kinova Gen3 robot illustrate the validity of the proposed admittance controller in various scenarios, including the case where the robot is fully stretched.
|
| |
| 09:00-10:30, Paper TuI1I.361 | Add to My Program |
| Development of Adaptive-Limb Transformable Robot for Portable and Replaceable End-Effectors with Compact Lock-Spin Mechanisms |
|
| Hirai, Jin | The University of Tokyo |
| Hiraoka, Takuma | The University of Tokyo |
| Konishi, Masanori | The University of Tokyo |
| Tada, Hiromi | The University of Tokyo |
| Kojima, Kunio | The University of Tokyo |
| Okada, Kei | The University of Tokyo |
Keywords: Legged Robots, Mechanism Design, Grippers and Other End-Effectors
Abstract: Transformable robots adapt to various environments by changing their shape or functionality. The robots are able to further expand their task range by replacing their end-effectors (EEs). In this paper, we propose an adaptive-limb transformable robot capable of replacing multiple types of mounted EEs. First, 7 degrees of freedom (DoF) limbs can reach multiple types of EEs mounted on the front body surface, and replace them using that single limb without relying on external devices. Second, we develop a compact Lock-Spin mechanism that integrates a locking mechanism into the rotor of the motor to enable continuous rotation. Experimental results demonstrate that the proposed transformable robot can replace EEs on-site and that this replacement enables locomotion and manipulation adapted to the environment.
|
| |
| 09:00-10:30, Paper TuI1I.362 | Add to My Program |
| Single-Instance Sampling for Computationally Efficient and Accurate Real-Time Task Space MPPI Control |
|
| Kim, Dongwhan | LG Electronics |
| Im, Euncheol | Korea Institute of Science and Technology |
| Kim, Yujin | Cornell University |
| Lim, Myo-Taeg | Korea University |
| Lee, Yisoo | Korea Institute of Science and Technology |
Keywords: Motion Control of Manipulators, Optimization and Optimal Control, Motion Control, Model Predictive Control
Abstract: This study presents a model predictive path integral (MPPI) method capable of conducting high-frequency real-time model predictive control (MPC) for robot manipulators. Real-time MPC-based manipulation holds significant potential for controlling an end-effector precisely and reactively while satisfying various constraints in dynamic environments. However, the optimization under a complex robot model and various constraints imposes a heavy computational burden, hindering the realization of high-frequency updates. To address this challenge, we propose a single-instance sampling-based MPPI algorithm and dynamic time horizon to significantly reduce the computational burden while enhancing control performance. The performance and efficacy of the proposed method are verified through experiments conducted on a 7-degree-of-freedom robotic arm, along with comparative simulations and analysis.
|
| |
| 09:00-10:30, Paper TuI1I.363 | Add to My Program |
| Task-Parameterized Motion Learning with Time-Sensitive Constraints |
|
| Richter, Julian | Technical University of Braunschweig |
| Oliveira, João P. | Instituto Superior Técnico |
| Scheurer, Christian | KUKA Laboratories GmbH |
| Steil, Jochen J. | Technische Universität Braunschweig |
| Dehio, Niels | KUKA |
Keywords: Learning from Demonstration, Constrained Motion Planning
Abstract: Teaching motion skills to robots through demonstrations has becomes widely popular. However, precise execution of start-, via-, and end-poses at given times is often not guaranteed, limiting the technology transfer to industrial application. To address this issue, we propose the novel Constrained Expectation Maximization (CEM) algorithm, which enforces time-sensitive constraints (TSC) when learning Gaussian Mixture Models (GMM). Our approach applies to data on Riemannian manifolds and extends to task-parameterized scenarios. We validate CEM against state-of-the-art methods on handwritten data and real robot applications utilizing the KUKA LBR iiwa. By enforcing constraints within the learning process, CEM achieves improved and more efficient reproduction of the demonstration data.
|
| |
| 09:00-10:30, Paper TuI1I.364 | Add to My Program |
| Fine-Grained Classification for Depth Estimation from Monocular Microscopy for Robotic Micromanipulation of Motile Cells |
|
| Yang, Han | The Chinese University of Hong Kong, Shenzhen |
| Jin, Yufei | The Chinese Univiersity of Hong Kong(shenzhen) |
| Jiang, Aojun | University of Toronto |
| Wang, Xinrui | The Chinese University of Hongkong (Shenzhen) |
| Wang, Xibu | Department of Reproductive Medicine, The 3rd Afliated Hospital of Shenzhen University, Shenzhen, China. |
| Yi, Xiaoling | Department of Reproductive Medicine, The 3rd Afliated Hospital of Shenzhen University, Shenzhen, China. |
| Sun, Yu | University of Toronto |
| Zhang, Zhuoran | The Chinese University of Hong Kong, Shenzhen |
|
|
| |
| 09:00-10:30, Paper TuI1I.365 | Add to My Program |
| Anytime Probabilistically Constrained Provably Convergent Online Belief Space Planning |
|
| Zhitnikov, Andrey | Technion – Israel Institute of Technology |
| Indelman, Vadim | Technion - Israel Institute of Technology |
Keywords: Robot Safety, Autonomous Agents, SLAM, Constrained Anytime Belief Space Planning
Abstract: Taking into account future risk is essential for an autonomously operating robot to find online not only the best but also a safe action to execute. In this paper, we build upon the recently introduced formulation of probabilistic belief-dependent constraints. In our methodology safety can be materialized with any general belief-dependent operator we call payoff. We present an anytime approach employing the Monte Carlo Tree Search (MCTS) method in continuous domains in terms of states, actions and observations and general-belief dependent reward and payoff operators. Unlike previous approaches, our method ensures safety anytime with respect to the currently expanded search tree without relying on the convergence of the search. We prove convergence in probability with an exponential rate of a version of our algorithms and study proposed techniques via extensive simulations. Even with a tiny number of tree queries, the best action found by our approach is much safer than the baseline. Moreover, our approach constantly yields better than the baseline action in terms of objective function. This is because we revise the values and statistics maintained in the search tree and r
|
| |
| 09:00-10:30, Paper TuI1I.366 | Add to My Program |
| Autonomous Decentralized Control for Motion Switching in an Intestine-Inspired Peristaltic Mixing Pump Adaptive to Physical Phase Transitions of Mixed Materials |
|
| Tsurumi, Koya | Chuo University |
| Tanno, Takaaki | Chuo University |
| Adachi, Ryosuke | Chuo University |
| Ito, Fumio | Chuo University |
| Hanamura, Tomoki | Shinshu University |
| Umedachi, Takuya | Shinshu University |
| Nakamura, Taro | Chuo University |
Keywords: Biomimetics, Soft Robot Applications, Distributed Robot Systems
Abstract: In this paper, we developed an autonomous decentralized control method that incorporates phase-difference adjustment based on a sigmoid function, enabling the design of both increases and decreases in discrepancy. The method was applied to a peristaltic mixing pump capable of mixing and transporting solid–liquid multiphase fluids. This study aims to realize a soft robotics system that autonomously switches motion modes according to changes in the physical properties of the transported material, thereby integratively mimicking both the motility and motion-switching functions of the intestine. Conventional autonomous decentralized control methods have been applied to the locomotion of amoeba-type and snake-type robots. However, when such control laws are applied to pumps, it is difficult to achieve appropriate motion switching in environments where the contents harden due to mixing. In this paper, we employed a sigmoid function that allows bidirectional control of discrepancy and constructed a new control law based on target phase-difference adjustment without feedback. The control law was implemented in a four-unit pump, and we confirmed that the desired motion patterns could be reproduced according to the preset target phase differences. As a result, the phase differences between all units converged to the target values within approximately 10–30 s after actuation began, producing the intended motion patterns. Furthermore, polyvinyl alcohol solution and borax water were used as contents whose fluidity decreases during mixing. We verified that autonomous motion switching occurred as the discrepancy increased. The results showed that, in units containing hardened material, a conveying motion with a phase difference of π/3 was generated, whereas in units with residual
|
| |
| 09:00-10:30, Paper TuI1I.367 | Add to My Program |
| Tightly Coupled Rao-Blackwellized Particle Filter for GNSS-Only Positioning in Urban Environments without Ambiguity Resolution |
|
| Niimi, Daiki | Meijo University |
| Fujino, An | Meijo University |
| Suzuki, Taro | Chiba Institute of Technology |
| Meguro, Junichi | Meijo University |
Keywords: Localization, Autonomous Vehicle Navigation, Probability and Statistical Methods
Abstract: This paper presents a tightly coupled Rao-Blackwellized particle filter (TC-RBPF) for global navigation satellite system (GNSS) positioning that eliminates the need for carrier-phase integer ambiguity resolution. The previously proposed loosely coupled RBPF (LC-RBPF) approach uses carrier-phase residuals to estimate particle likelihoods, enabling positioning without integer ambiguity resolution. However, the position estimation accuracy depends on the performance of the state transition. The previous approach estimates velocity using a Kalman filter (KF) based on least-squares Doppler measurements, which are vulnerable to non-line-of-sight (NLOS) multipath errors. This often leads to complete positioning failure in urban environments. To overcome these limitations, the proposed TC-RBPF tightly integrates raw Doppler measurements into the KF. This enables consistent estimation of both velocity and receiver clock drift within a time-series framework. Furthermore, a robust KF based on Student's t-distribution and particle-wise NLOS rejection using double-differenced pseudorange residuals are introduced to mitigate the impact of outliers. Together, these mechanisms enhance outlier robustness and transition reliability. Experimental evaluations in six challenging urban scenarios demonstrate that the proposed method achieves superior positioning performance compared to existing methods, confirming its effectiveness under degraded GNSS conditions.
|
| |
| 09:00-10:30, Paper TuI1I.368 | Add to My Program |
| A Minimal Subset Approach for Informed Keyframe Sampling in Large-Scale SLAM |
|
| Stathoulopoulos, Nikolaos | Luleå University of Technology |
| Kanellakis, Christoforos | Luleå University of Technology |
| Nikolakopoulos, George | Luleå University of Technology |
Keywords: Computer Vision for Automation, Recognition
Abstract: Typical LiDAR SLAM architectures feature a front-end for odometry estimation and a back-end for refining and optimizing the trajectory and map, commonly through loop closures. However, loop closure detection in large-scale missions presents significant computational challenges due to the need to identify, verify, and process numerous candidate pairs for pose graph optimization. Keyframe sampling bridges the front-end and back-end by selecting frames for storing and processing during global optimization. This article proposes an online keyframe sampling approach that constructs the pose graph using the most impactful keyframes for loop closure. We introduce the Minimal Subset Approach (MSA), which optimizes two key objectives: redundancy minimization and information preservation, implemented within a sliding window framework. By operating in the feature space rather than 3-D space, MSA efficiently reduces redundant keyframes while retaining essential information. Evaluations on diverse public datasets show that the proposed approach outperforms naive methods in reducing false positive rates in place recognition, while delivering superior ATE and RPE in metric localization, without the need for manual parameter tuning. Additionally, MSA demonstrates efficiency and scalability by reducing memory usage and computational overhead during loop closure detection and pose graph optimization.
|
| |
| 09:00-10:30, Paper TuI1I.369 | Add to My Program |
| Design and Experimental Validation of a Controller for Bowden-Cable Actuators Subject to Friction Variation |
|
| Lu, Yaodong | Sorbonne Université |
| Aoustin, Yannick | CN-Nantes Université |
| Nocito, Pablo | SORBONNE UNIVERSITE - Agence comptable |
| Mick, Sébastien | Sorbonne Université, CNRS, INSERM, Institut des Systemes Intelligents et de Robotique, ISIR, |
| Jarrassé, Nathanael | Sorbonne Université, ISIR UMR 7222 CNRS |
|
|
| |
| 09:00-10:30, Paper TuI1I.370 | Add to My Program |
| ROAR a Robust Autonomous Aerial Tracking System for Challenging Scenarios |
|
| Zhang, Tong | Northwestern Polytechnical University |
| Li, Chenghao | Northwestern Polytechnical University |
| Zhao, Kezhen | Northwestern Polytechnical University |
| Shen, Hao | Northwestern Polytechnical University |
| Pang, Tao | The 32nd Research Institute of China Electronics Technology Group |
|
|
| |
| 09:00-10:30, Paper TuI1I.371 | Add to My Program |
| Multi-Agent Collaboration for PrSTL Specifications with Temporal Collective Counting Operators |
|
| Quan, Yicheng | Beijing Institute of Technology |
| Yang, Yan | University of Science and Technology Beijing |
| Liu, Zhijie | University of Science and Technology Beijing |
| Shi, Zhongjiao | Beijing Institute of Technology |
Keywords: Formal Methods in Robotics and Automation, Planning under Uncertainty, Path Planning for Multiple Mobile Robots or Agents
Abstract: We address the collaborative path planning problem for multi-agent systems with heterogeneous capabilities, subject to uncertainty and operating under complex task specifications. Conventional Probabilistic Signal Temporal Logic (PrSTL) frameworks exhibit significant limitations in describing multi-agent collaborative tasks with temporally cumulative properties. To address this challenge, we extend the PrSTL framework by introducing a Temporal Collective Counting Operator to characterize such spatio-temporal specifications. We then formulate the multi-agent collaborative planning problem under dynamics uncertainty as a Mixed-Integer Second-Order Cone Program. This formulation leverages PrSTL to specify tasks with cumulative temporal properties, while employing Polynomial Chaos Expansion to propagate uncertainty. Finally, we propose a constraint relaxation mechanism to address the conservatism introduced by formula transformations andprobabilistic constraints' approximation.
|
| |
| 09:00-10:30, Paper TuI1I.372 | Add to My Program |
| Real-World Robot Control by Deep Active Inference with a Temporally Hierarchical World Model |
|
| Fujii, Kentaro | Keio University |
| Murata, Shingo | Keio University |
Keywords: Cognitive Control Architectures, Learning from Experience, Machine Learning for Robot Control
Abstract: Robots in uncertain real-world environments must perform both goal-directed and exploratory actions. However, most deep learning-based control methods neglect exploration and struggle under uncertainty. To address this, we adopt deep active inference, a framework that accounts for human goal-directed and exploratory actions. Yet, conventional deep active inference approaches face challenges due to limited environmental representation capacity and high computational cost in action selection. We propose a novel deep active inference framework that consists of a world model, an action model, and an abstract world model. The world model encodes environmental dynamics into hidden state representations at slow and fast timescales. The action model compresses action sequences into abstract actions using vector quantization, and the abstract world model predicts future slow states conditioned on the abstract action, enabling low-cost action selection. We evaluate the framework on object-manipulation tasks with a real-world robot. Results show that it achieves high success rates across diverse manipulation tasks and switches between goal-directed and exploratory actions in uncertain settings, while making action selection computationally tractable. These findings highlight the importance of modeling multiple timescale dynamics and abstracting actions and state transitions.
|
| |
| 09:00-10:30, Paper TuI1I.373 | Add to My Program |
| Enhanced Autonomous Navigation on the Perseverance Mars Rover (I) |
|
| Toupet, Olivier | Jet Propulsion Laboratory, California Institute of Technology |
| Ono, Masahiro | Jet Propulsion Laboratory, California Institute of Technology |
| Del Sesto, Tyler | Jet Propulsion Laboratory, California Institute of Technology |
| Maimone, Mark | Jet Propulsion Laboratory, California Institute of Technology |
| McHenry, Michael | Jet Propulsion Laboratory, California Institute of Technology |
Keywords: Space Robotics and Automation, Motion and Path Planning, Autonomous Vehicle Navigation
Abstract: This paper presents Enhanced Autonomous Navigation, or ENav, the autonomous driving algorithm of NASA's Mars rover Perseverance. A unique challenge for the autonomous driving of Perseverance is to meet strict safety and performance requirements in a highly uncertain environment with only a single-core CPU with extremely limited computing resources. ENav overcame this challenge with a novel two-stage path selection approach that balances the path optimality and computational efficiency, combined with a unique collision checking algorithm that conservatively approximates computationally expensive kinematic settling. In addition, ENav provides robustness against slip by expanding the bounding boxes for wheels used by the collision check. These new features, together with FPGA-accelerated vision processing, enabled Perseverance to autonomously drive on substantially more rock-dense terrains and increased the average daily driving distance by an order of magnitude compared to its predecessors, the Curiosity, Spirit, and Opportunity rovers. Perseverance has set several new records for autonomous driving on Mars, breaking those previously held by the Opportunity rover. As of the 1312th Martian day since landing, or 28 October 2024 on the Earth calendar, ~90 of the 32.1 km of driving has used ENav to evaluate the terrain. This paper provides detailed documentation of the ENav algorithm, as well as its implementation, testing, deployment, and driving results on Mars.
|
| |
| 09:00-10:30, Paper TuI1I.374 | Add to My Program |
| Kinodynamic Trajectory Planning for Efficient UAV Exploration and Reconstruction of Unknown Environments |
|
| Félix Mendes, João | Instituto Superior Técnico |
| Basiri, Meysam | Instituto Superior Técnico |
| Ventura, Rodrigo | Instituto Superior Técnico |
Keywords: Motion and Path Planning, Autonomous Agents, Aerial Systems: Perception and Autonomy
Abstract: Autonomous exploration of unknown 3D environments requires motion planners that can efficiently identify informative regions to explore while continuously adapting to the evolving map of the environment. While existing sampling-based methods have demonstrated strong real-time performance, they often ignore the robot’s kinodynamic model and constraints. Consequently, they generate only target positions, neglecting kinodynamic considerations in the next-best-view decision process. This results in frequent slowdowns and abrupt maneuvers, reducing coverage speed and exploration efficiency. In this work, we propose a kinodynamic motion planning framework designed for fast and efficient exploration of unknown environments. By incorporating the robot’s kinodynamic model and constraints into a kinodynamic RRT, our approach bridges the gap between dynamically feasible motion and effective viewpoint selection, producing smoother and faster trajectories that improve exploration performance. Additionally, we present an Iterative Minimum Gain (IMG) approach to improve global coverage, and a novel informed yaw optimization method that accelerates optimal yaw selection, capable of achieving up to more than twice the speed of state-of-the-art methods. We validate our framework through extensive simulation and real-world experiments, demonstrating improved exploration rates, higher average velocities, and better global coverage over existing methods.
|
| |
| 09:00-10:30, Paper TuI1I.375 | Add to My Program |
| Approximating Global Contact-Implicit MPC Via Sampling and Local Complementarity |
|
| Venkatesh, Sharanya | University of Pennsylvania |
| Bianchini, Bibit | University of Pennsylvania |
| Aydinoglu, Alp | University of Pennsylvania |
| Yang, William | Amazon Robotics |
| Posa, Michael | University of Pennsylvania |
Keywords: Dexterous Manipulation, Optimization and Optimal Control, Integrated Planning and Control
Abstract: To achieve general-purpose dexterous manipulation, robots must rapidly devise and execute contact-rich behaviors. Existing model-based controllers are incapable of globally optimizing in real-time over the exponential number of possible contact sequences. Instead, recent progress in contact-implicit control has leveraged simpler models that, while still hybrid, make local approximations. However, the use of local models inherently limits the controller to only exploit nearby interactions, potentially requiring intervention to richly explore the space of possible contacts. We present a novel approach which leverages the strengths of local complementarity-based control in combination with low-dimensional, but global, sampling of possible end-effector locations. Our key insight is to consider a contact-free stage preceding a contact-rich stage at every control loop. Our algorithm, in parallel, samples end effector locations to which the contact-free stage can move the robot, then considers the cost predicted by contact-rich MPC local to each sampled location. The result is a globally-informed, contact-implicit controller capable of real-time dexterous manipulation. We demonstrate our controller on precise, non-prehensile manipulation of non-convex objects using a Franka Panda arm.
|
| |
| 09:00-10:30, Paper TuI1I.376 | Add to My Program |
| L-BIRD: Lightweight Bio-Inspired Rotary-Wing Drone |
|
| Guo, Xuwen | East China Normal University |
| Zhu, Mingxuan | East China Normal University |
| Tian, Yinghong | East China Normal University |
Keywords: Biologically-Inspired Robots, Aerial Systems: Mechanics and Control, Embedded Systems for Robotic and Automation
Abstract: In nature, birds exhibit outstanding attitude control, enabling flexible and efficient takeoff, hovering and landing — capabilities that have not been fully replicated. Thus, we introduce the lightweight bio-inspired rotary-wing drone (L-BIRD). It incorporates a spherical structure, which can imitate birds' attitude variation and land on complex surfaces adaptively. L-BIRD employs a model predictive control (MPC) framework to enable real-time tracking of bird-like attitude trajectories derived from bio-inspired parameter pairs. To facilitate lightweight deployment on resource-constrained hardware platforms, we improve MPC framework by multi-path primal-dual neural network (PDNN), matrix sparsity and multiplicative optimization. Experimental results, both in simulations and real-world deployments, demonstrate that L-BIRD realizes accurate and efficient biomimetic attitude control and diverse environmental adaptability. The attitude trajectory mean-square error (MSE) decreases to 0.0042 rad, random access memory (RAM) usage reduces by 39.3%.
|
| |
| 09:00-10:30, Paper TuI1I.377 | Add to My Program |
| RoboPrec: Enabling Reliable Embedded Computing for Robotics by Providing Accuracy Guarantees across Mixed-Precision Datatypes |
|
| Yilmaz, Alp Eren | Boston University |
| Bourgeat, Thomas | EPFL |
| Pentecost, Lillian | Amherst College |
| Plancher, Brian | Dartmouth College |
| Neuman, Sabrina | Boston University |
Keywords: Embedded Systems for Robotic and Automation, Software Tools for Robot Programming, Software-Hardware Integration for Robot Systems
Abstract: Mobile robots demand power efficiency as well as accuracy and high performance in their computations. Embedded microcontrollers and FPGAs, which can consume as much as 1000x less power than large CPUs and GPUs, offer a promising solution to these power needs. However, these power-efficient platforms often lack full floating-point support and rely on fixed-point computations to deliver performance. This is a challenge as most robotics software uses floating-point datatypes (double, float) to conservatively ensure accuracy, and prior works that use fixed-point types employ unreliable ad hoc approaches to select the datatype precision (i.e., quantity and allocation of bits). We address this challenge with the RoboPrec framework, where we: (i) develop a transpiler that integrates code transformations and robot-specific code generation with traditional numerical stability analysis methods (which calculate guaranteed error bounds), and adapts them to be practical and usable for real-world robotics software; and then leverage this to (ii) generate guaranteed-accuracy fixed- point code that is deployable to embedded computing platforms. We use rigid body dynamics, a fundamental robotics workload, as a motivating case study. We find that RoboPrec-generated 32-bit fixed-point code can be up to 8x faster than float and 122x faster than double on embedded processors while, critically, also providing guaranteed accuracy bounds with lower worst-case error than float. This work provides a foundation for practical and reliable low-power embedded robotics computing.
|
| |
| 09:00-10:30, Paper TuI1I.378 | Add to My Program |
| K-ARC: Adaptive Robot Coordination for Multi-Robot Kinodynamic Planning |
|
| Qin, Mike | University of Illinois Urbana-Champaign |
| Solis Vidana, Juan Irving | University of Illinois Urbana-Champaign |
| Motes, James | University of Illinois Urbana-Champaign |
| Morales, Marco | University of Illinois Urbana-Champaign & Instituto Tecnológico Autónomo De México |
| Amato, Nancy | University of Illinois Urbana-Champaign |
Keywords: Motion and Path Planning, Multi-Robot Systems, Nonholonomic Motion Planning
Abstract: This work presents Kinodynamic Adaptive Robot Coordination (K-ARC), a novel algorithm for multi-robot kino- dynamic planning. Our experimental results show the capability of K-ARC to plan for up to 32 planar mobile robots, while achieving up to an order of magnitude of speed-up compared to previous methods in various scenarios. K-ARC is able to achieve this due to its two main properties. First, K-ARC constructs its solution iteratively by planning in segments, where initial kino- dynamic paths are found through optimization-based approaches and the inter-robot conflicts are resolved through sampling- based approaches. The interleaving use of sampling-based and optimization-based approaches allows K-ARC to leverage the strengths of both approaches in different sections of the planning process where one is more suited than the other, while previous methods tend to emphasize on one over the other. Second, K-ARC builds on a previously proposed multi-robot motion planning framework, Adaptive Robot Coordination (ARC), and inherits its strength of focusing on coordination between robots only when needed, saving computation efforts. We show how the combination of these two properties allows K-ARC to achieve overall better performance in our simulated experiments with increasing numbers of robots, increasing degrees of problem difficulties, and increasing complexities of robot dynamics.
|
| |
| 09:00-10:30, Paper TuI1I.379 | Add to My Program |
| Fourigami: A 4-Degree-Of-Freedom, Force-Controlled, Origami, Finger Pad Haptic Device |
|
| Winston, Crystal | Stanford University |
| Choi, Hojung | Stanford University |
| Jitosho, Rianna | Stanford University |
| Zhakypov, Zhenishbek | Company |
| Palmer, Jasmin Elena | Stanford University |
| Cutkosky, Mark | Stanford University |
| Okamura, Allison M. | Stanford University |
Keywords: Haptics and Haptic Interfaces, Force Control, Parallel Robots, Soft Robot Materials and Design
Abstract: Skin deformation haptic devices worn on the finger pad provide realistic touch feedback during interactions with virtual objects. Two primary challenges in creating such devices are: (1) making a multi-degree-of-freedom device that is small and lightweight so it does not encumber the wearer and (2) providing accurate control of forces displayed to the finger pad. This work presents a 4-degree-of-freedom (DoF) finger pad haptic device, called Fourigami, that addresses these challenges. We address the first challenge using origami manufacturing methods and pneumatic actuation to fabricate a 25 g prototype that displays normal, shear, and twist and can be easily worn on the finger pad. We address the second challenge using a low-profile, 6-DoF, force/torque sensor to control forces displayed to the finger. Fourigami has a bandwidth ranging from 2-4 Hz depending on direction, and when acting on a human finger, it exerts forces ranging from ± 1.0 N in shear, 4.2 N in normal, and ± 4.2 N·mm of twist. Finally, we demonstrate the device’s efficacy when rendering haptic feedback to a user tracking a sinusoidal trajectory and a trajectory representing interactions with a virtual environment.
|
| |
| 09:00-10:30, Paper TuI1I.380 | Add to My Program |
| MAPF-HD: Multi-Agent Path Finding in High-Density Environments |
|
| Makino, Hiroya | Toyota Central R&D Labs., Inc |
| Ito, Seigo | Toyota Central R&D Labs., Inc |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning, Intelligent Transportation Systems
Abstract: Multi-agent path finding (MAPF) involves planning efficient paths for multiple agents to move simultaneously while avoiding collisions. In typical warehouse environments, agents are often sparsely distributed along aisles; however, increasing the agent density can improve space efficiency. When the agent density is high, it becomes necessary to optimize the paths not only for goal-assigned agents but also for those obstructing them. This study proposes a novel MAPF framework for high-density environments (MAPF-HD). Several studies have explored MAPF in similar settings using integer linear programming (ILP). However, ILP-based methods require substantial computation time to optimize all agent paths simultaneously. Even in small grid-based environments with fewer than 100 cells, these computations can take tens to hundreds of seconds. Such high computational costs render these methods impractical for large-scale applications such as automated warehouses and valet parking. To address these limitations, we introduce the phased null-agent swapping (PHANS) method. PHANS employs a heuristic approach to incrementally swap positions between agents and empty vertices. This method solves the MAPF-HD problem within a few seconds, even in large environments containing more than 700 cells. The proposed method has the potential to improve efficiency in various real-world applications such as warehouse logistics, traffic management, and crowd control.
|
| |
| 09:00-10:30, Paper TuI1I.381 | Add to My Program |
| LiHi-GS: LiDAR-Supervised Gaussian Splatting for Highway Driving Scene Reconstruction |
|
| Kung, Pou-Chun | University of Michigan, Ann Arbor |
| Zhang, Xianling | Ford Motor Company, Latitude AI |
| Skinner, Katherine | University of Michigan |
| Jaipuria, Nikita | MIT |
Keywords: Deep Learning for Visual Perception, Mapping, Sensor Fusion
Abstract: Photorealistic 3D scene reconstruction plays an important role in autonomous driving, enabling the generation of novel data from existing datasets to simulate safety-critical scenarios and expand training data without additional acquisition costs. Gaussian Splatting (GS) facilitates real-time, photorealistic rendering with an explicit 3D Gaussian representation of the scene, providing faster processing and more intuitive scene editing than the implicit Neural Radiance Fields (NeRFs). While extensive GS research has yielded promising advancements in autonomous driving applications, they overlook two critical aspects: First, existing methods mainly focus on low-speed and feature-rich urban scenes and ignore the fact that highway scenarios play a significant role in autonomous driving. Second, while LiDARs are commonplace in autonomous driving platforms, existing methods learn primarily from images and use LiDAR only for initial estimates or without precise sensor modeling, thus missing out on leveraging the rich depth information LiDAR offers and limiting the ability to synthesize LiDAR data. In this paper, we propose a novel GS method for dynamic scene synthesis and editing with improved scene reconstruction through LiDAR supervision and support for LiDAR rendering. Unlike prior works that are tested mostly on urban datasets, to the best of our knowledge, we are the first to focus on the more challenging and highly relevant highway scenes for autonomous driving, with sparse sensor views and monotone backgrounds.
|
| |
| 09:00-10:30, Paper TuI1I.382 | Add to My Program |
| Lyapunov-Based Control Barrier Functions for Real-Time Safe Navigation in Three-Dimension Complex Environments |
|
| Zhang, Fuwei | Sun Yat - Sen University |
| Hou, Zhiwei | Sun Yat-Sen University |
Keywords: Robot Safety, Motion Control, Collision Avoidance
Abstract: In the field of safe navigation for mobile robots, control barrier functions (CBFs) have garnered significant attention due to their ability to transform complex safety constraints into real-time solvable optimization problems. In this letter, we propose a novel Lyapunov-based CBF framework. It offers the following key advantages: (1) Using a single Control Lyapunov Function (CLF), this method synthesizes spatially shifted CBFs to construct an expansive safe invariant set in obstacle-dense environments. (2) The framework is capable of incorporating existing approaches for constructing quadratic CLF, making it applicable to a wide range of complex nonlinear systems and enhancing its generality and extensibility. (3) It enables real-time synthesis of CBFs, and ensures safety in large-scale 3D environments through efficient CBF-based quadratic programming (CBF-QP). (4) The method ensures safety while inheriting the stability properties of the CLF, allowing the asymptotic convergence of the system state to equilibrium, thus unifying safety and motion stability. To validate efficacy, we rigorously tested the framework in both simulations and hardware experiments.
|
| |
| 09:00-10:30, Paper TuI1I.383 | Add to My Program |
| A Two-Layer Adaptive Assist-As-Needed Control Scheme for Rehabilitation Robotics |
|
| Zhang, Maozeng | Southeast University |
| Li, Huijun | Southeast University |
| Shi, Ke | Southeast University |
| Song, Aiguo | Southeast University |
| |
| 09:00-10:30, Paper TuI1I.384 | Add to My Program |
| Multi-Strategy Enhanced Particle Swarm Optimization for Variable Curvature Path Planning in Flexible Needle Insertion |
|
| Qin, Yanding | Nankai University |
| Teng, Jianing | Nankai University |
| Wen, Chao | Naikai University |
| Fang, Ge | Nankai University |
| Wang, Hongpeng | Nankai University |
| Han, Jianda | Nankai University |
Keywords: Surgical Robotics: Steerable Catheters/Needles, Surgical Robotics: Planning, Motion and Path Planning
Abstract: Flexible needles provide enhanced adaptability for navigating puncture pathways and avoiding obstacles when compared to conventional rigid needles. However, developing a three dimensional (3D) curved path for flexible needle is challenging, particularly in achieving both effective obstacle avoidance and precise targeting. To this end, we proposed an improved particle swarm optimization-based path planning approach by incorporating good point set initialization and heuristic multi-mutation strategy. Such incorporation greatly enhanced the algorithm’s global search capability while ensuring fast convergence speed. In addition, 3D biarc curve fitting was employed to develop a kinematically reachable path for bevel tip needles. Obstacle-avoidance simulations conducted demonstrate the superior performance of proposed method against state-of-the-art algorithms in the aspect of path length and distance to obstacles, repeatability and local minima trap avoidance. Needle puncturing experiments performed using duty cycling control achieved a small curvature radius of 49.6 mm and targeting errors of less than 4 mm. This algorithm facilitates efficient variable curvature path planning for flexible needles, ensuring precise targeting while effectively avoiding obstacles.
|
| |
| 09:00-10:30, Paper TuI1I.385 | Add to My Program |
| Real-Time Adaptive Motion Planning Via Point Cloud-Guided, Energy-Based Diffusion and Potential Fields |
|
| Teshome, Wondmgezahu | Northeastern University |
| Behzad, Kian | Northeastern University |
| Camps, Octavia I. | Northeastern University |
| Everett, Michael | Northeastern University |
| Siami, Milad | Northeastern University |
| Sznaier, Mario | Northeastern University |
Keywords: AI-Based Methods, Motion and Path Planning, Planning under Uncertainty
Abstract: Motivated by the problem of pursuit-evasion, we present a motion planning framework that combines energy-based diffusion models with artificial potential fields for robust real time trajectory generation in complex environments. Our approach processes obstacle information directly from point clouds, enabling efficient planning without requiring complete geometric representations. The framework employs classifier-free guidance training and integrates local potential fields during sampling to enhance obstacle avoidance. In dynamic scenarios, the system generates initial trajectories using the diffusion model and continuously refines them through potential field-based adaptation, demonstrating effective performance in pursuit-evasion scenarios with partial pursuer observability.
|
| |
| 09:00-10:30, Paper TuI1I.386 | Add to My Program |
| Terrain-Aware Probabilistic Search Planning for Unmanned Aerial Vehicles (I) |
|
| Schomer, Nathan | Oregon State University |
| Adams, Julie | Oregon State University |
Keywords: Search and Rescue Robots, Field Robots, Optimization and Optimal Control
Abstract: Mountain search and rescue is a form of emergency response to assist people in austere environments (e.g., extreme terrain, poor weather). Volunteer mountain search and rescue teams in the United States have begun adopting consumer-grade unmanned aerial vehicles to assist a variety of tasks (e.g., search, resource delivery); however, these tools lack the autonomy necessary for the mountain search and rescue teams to fully realize their potential for wide area, aerial search. The unique and tight constraints of mountain search and rescue (e.g., in situ computation, sensor limitations) greatly limit the applicability of recent robotics research. A two-step coverage path planning algorithm that leverages existing viewpoint and path planning approaches was developed to meet the unique needs of mountain search and rescue. Viewpoints were sampled to meet a minimum coverage ratio and assigned priority from a search probability map. The path planning problem was formulated as a clustered traveling salesperson problem, which is solved with a metaheuristic iterative solver. Simulation results inform parameter selection for a series of field experiments. The field experiments demonstrate how the new algorithm can provide resilience against the many compounding factors that make UAV-based mountain search and rescue challenging
|
| |
| 09:00-10:30, Paper TuI1I.387 | Add to My Program |
| FW-ORB-SLAM: A Monocular Visual SLAM Algorithm for Flapping-Wing Flying Robots |
|
| Zhong, Zheng | University of Science and Technology Beijing |
| Chen, Shou | University of Science and Technology Beijing |
| Fu, Qiang | University of Science and Technology Beijing |
| Wang, Jiubin | University of Science and Technology Beijing |
| He, Wei | Beijing Information Science and Technology University |
Keywords: Biologically-Inspired Robots, SLAM
Abstract: Visual simultaneous localization and mapping (SLAM) is of great significance for flapping-wing flying robots (FWFRs) to enhance their autonomous navigation capabilities in complex environments. However,during the motion of FWFRs, there are intense image vibrations accompanied by significant illumination changes, which would prevent existing visual SLAM algorithms from being directly applied to FWFRs. Therefore, this paper proposes a modified ORB-SLAM3 algorithm called FW-ORB-SLAM for FWFRs. First, we adopt the fast Fourier transform (FFT) method to map the original images to the frequency domain. Then, based on the characteristic flapping motion of the FWFR, we decompose the frequency domain jitter to obtain stabilized images. Moreover, to mitigate the impact of illumination variations on feature point tracking during outdoor flight, a local adaptive contrast enhancement method is proposed, which enhances the stability of feature point tracking and augments the robustness of the SLAM algorithm. Finally, flight experiments carried out using our self-developed FWFR named U-Dove demonstrate that FW-ORB-SLAM outperforms the state-of-the-art ORB-SLAM3 algorithm, which provides insights into performing vision-based SLAM tasks for the FWFR.
|
| |
| 09:00-10:30, Paper TuI1I.388 | Add to My Program |
| REBot: Reflexive Evasion Robot for Instantaneous Dynamic Obstacle Avoidance |
|
| Xu, Zihao | National University of Singapore |
| Hao, Ce | National University of Singapore |
| Wang, Chunzheng | National University of Singapore |
| Sima, Kuankuan | National University of Singapore |
| Shi, Fan | National University of Singapore |
| Dong, Jin Song | National University of Singapore |
Keywords: Legged Robots, Reinforcement Learning
Abstract: Dynamic obstacle avoidance (DOA) is critical for quadrupedal robots operating in environments with moving obstacles or humans. Existing approaches typically rely on navigation-based trajectory replanning, which assumes sufficient reaction time and leading to fails when obstacles approach rapidly. In such scenarios, quadrupedal robots require reflexive evasion capabilities to perform instantaneous, low-latency maneuvers. This paper introduces Reflexive Evasion Robot (REBot), a control framework that enables quadrupedal robots to achieve real-time reflexive obstacle avoidance. REBot integrates an avoidance policy and a recovery policy within a finite-state machine. With carefully designed learning curricula and by incorporating regularization and adaptive rewards, REBot achieves robust evasion and rapid stabilization in instantaneous DOA tasks. We validate REBot through extensive simulations and real-world experiments, demonstrating notable improvements in avoidance success rates, energy efficiency, and robustness to fast-moving obstacles. Paper homepage: https://rebot-2025.github.io/.
|
| |
| 09:00-10:30, Paper TuI1I.389 | Add to My Program |
| Heterogeneous Multirobot Team: Maritime Inspection and Intervention in Global Navigation Satellite System-Denied Scenarios |
|
| Arbanas Ferreira, Barbara | Centre of Excellence MARBLE |
| Ivanovic, Antun | University of Zagreb, Faculty of Electrical Engineering and Computing |
| Peti, Marijana | Faculty of electrical engineering and computing |
| Mandic, Luka | FER, University of Zagreb |
| Batoš, Matko | University of Zagreb Faculty of Electrical Engineering and Computing |
| Domislovic, Jakob | University of Zagreb, Faculty of Electrical Engineering and Computing |
| Vasiljevic, Goran | Faculty of Electrical Engineering and Computing, Zagreb, Croatia |
| Obradovic, Juraj | University of Zagreb, Faculty of Electrical Engineering and Computing |
| Ferreira, Fausto | Faculty of Electrical Engineering and Computing (FER), University of Zagreb |
| Petric, Frano | University of Zagreb, Faculty of Electrical Engineering and Computing |
| Barisic, Antonella | University of Zagreb, Faculty of Electrical Engineering and Computing |
| Car, Marko | Faculty of Electrical Engineering and Computing |
| Goričanec, Jurica | University of Zagreb, Faculty of Electrical Engineering and Computing |
| Kraševac, Natko | University of Zagreb, Faculty of Electrical Engineering and Computing |
| Krizmancic, Marko | University of Zagreb, Faculty of Electrical Engineering and Computing |
| Loncar, Ivan | University of Zagreb, Faculty of Electrical Engineering and Computing |
| Markovic, Lovro | University of Zagreb, Faculty of Electrical Engineering and Computing |
| Milijas, Robert | CoE MARBLE - Centre of Excellence in Maritime Robotics and Technologies for Sustainable Blue Economy |
| Stuhne, Dario | Faculty of Electrical Engineering and Computing, University of Zagreb |
| Orsag, Matko | University of Zagreb, Faculty of Electrical Engineering and Computing |
| Bogdan, Stjepan | University of Zagreb |
| Miskovic, Nikola | University of Zagreb, Faculty of Electrical Engineering and |
|
|
| |
| 09:00-10:30, Paper TuI1I.390 | Add to My Program |
| Force Allocation Control for Redundant-Torso Biomimetic Quadrupeds Via Virtual Force and Fully Actuated System Approach |
|
| Zhuo, Zhiqin | Sun Yat-Sen University |
| Huang, Ke | Sun Yat-Sen University |
| Wu, Zhigang | Sun Yat-Sen University |
| Jiang, Jianping | Sun Yat-Sen University |
Keywords: Redundant Robots, Legged Robots, Force Control
Abstract: Fully actuated system approach (FASA) provides a promising control framework for robots with redundant actuation, offering simplified controller design and increased design freedom. However, its application to legged robots remains challenging due to hybrid actuation from intermittent ground contact and redundant inputs. To address this, we propose Virtual Force-based FASA (VF-FASA), which introduces virtual forces as intermediaries to construct full-actuation conditions required by FASA. FASA generates virtual control laws based on a simplified torso dynamics model, and a matrix-weighted pseudoinverse optimization is employed to map these virtual inputs into actual torso joint torques and foot contact forces. This method achieves coordinated control of both the floating base and redundant torso, effectively leveraging joint redundancy for improved whole-body motion. Simulation results on a redundant-torso quadruped robot demonstrate robust trajectory tracking and effective whole-body coordination under dynamic locomotion. The framework expands FASA to legged systems, providing an effective approach for controlling quadruped robots.
|
| |
| 09:00-10:30, Paper TuI1I.391 | Add to My Program |
| A Joint Learning of Force Feedback of Robotic Manipulation and Textual Cues for Granular Materials Classification |
|
| Zhang, Zeqing | The University of Hong Kong |
| Chen, Guanqi | The University of Hong Kong |
| Chen, Wentao | The University of Hong Kong |
| Jia, Ruixing | The University of Hong Kong |
| Chen, Guanhua | Southern University of Science and Technology |
| Zhang, Liangjun | GM |
| Pan, Jia | University of Hong Kong |
| Zhou, Peng | Great Bay University |
| |
| 09:00-10:30, Paper TuI1I.392 | Add to My Program |
| Probing Multimodal LLMs As World Models for Driving |
|
| Sreeram, Shiva | MIT |
| Wang, Tsun-Hsuan | Massachusetts Institute of Technology |
| Maalouf, Alaa | MIT |
| Rosman, Guy | Massachusetts Institute of Technology |
| Karaman, Sertac | Massachusetts Institute of Technology |
| Rus, Daniela | MIT |
Keywords: Performance Evaluation and Benchmarking, Data Sets for Robotic Vision, Autonomous Vehicle Navigation
Abstract: We provide a sober look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving, challenging common assumptions about their ability to interpret dynamic driving scenarios. Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored. Our experimental study assesses various MLLMs as world models using in-car camera perspectives and reveals that while these models excel at interpreting individual images, they struggle to synthesize coherent narratives across frames, leading to considerable inaccuracies in understanding (i) ego vehicle dynamics, (ii) interactions with other road actors, (iii) trajectory planning, and (iv) open-set scene reasoning. We introduce the Eval-LLM-Drive dataset and DriveSim simulator to enhance our evaluation, highlighting gaps in current MLLM capabilities and the need for improved models in dynamic real-world environments.
|
| |
| 09:00-10:30, Paper TuI1I.393 | Add to My Program |
| Unveiling SO(3) Parallel Robot Variants: Application of the Optimal Robot to a Humanoid Eye |
|
| Hassen, Nigatu | Zhejiang university, robotics institute |
| Li, Jihao | Zhejiang University |
| Shi, Gaokun | Yuyao Robot Research Center |
| Wang, Jianguo | Zhejiang University |
| Lu, GuoDong | Zhejiang University |
| Li, Howard | University of New Brunswick |
| Dong, Huixu | Zhejiang University |
| |
| 09:00-10:30, Paper TuI1I.394 | Add to My Program |
| Lyapunov Stability-Driven Control Algorithm for Heterogeneous Multi-Robot Coordination (I) |
|
| Rekabi Bana, Fatemeh | Durham University |
| Bahaidarah, Mazen | The University of Manchester |
| Marjanovic, Ognjen | University of Manchester |
| Arvin, Farshad | Durham University |
Keywords: Swarm Robotics, Distributed Robot Systems, Multi-Robot Systems
Abstract: Recent advancements in autonomous swarm systems mark a pivotal point in robotic science. Using a large-scale swarm of simple robots for complex tasks offers efficient, robust, and reliable solutions, inspired by natural phenomena. While bio-inspired methods are effective, approaches inspired by physical interactions in viscoelastic materials offer more structured ways to prove stability and robust performance mathematically. This paper proposes a new viscoelastic swarm algorithm applicable to heterogeneous swarm systems. The algorithm's development utilises the Lyapunov method to determine stability criteria and conditions, thereby avoiding reliance on complex optimisation to ensure stable performance parameters. A series of Monte Carlo simulations assessed the algorithm's performance and sensitivity to key variables. Furthermore, experiments with real robots evaluated the effects of variables like neighbourhood conditions and the stiffness coefficient on the algorithm's output. The results from simulations and experiments demonstrate the algorithm's stable, bounded performance and show how key variables, such as the stiffness coefficient and the number of neighbours, influence swarm performance. In real-world experiments, the proposed framework significantly reduces the robots' control effort while improving swarm behaviour, compared with a state-of-the-art algorithm.
|
| |
| 09:00-10:30, Paper TuI1I.395 | Add to My Program |
| CaFe-TeleVision: A Coarse-To-Fine Teleoperation System with Immersive Situated Visualization for Enhanced Ergonomics |
|
| Tang, Zixin | The Chinese University of Hong Kong |
| Chen, Yiming | The Chinese Univesity of Hong Kong |
| Rouxel, Quentin | The Chinese University of Hong Kong (CUHK) |
| Li, Dianxi | The Chinese University of Hong Kong |
| Wu, Shuang | Huawei |
| Chen, Fei | T-Stone Robotics Institute, the Chinese University of Hong Kong |
Keywords: Telerobotics and Teleoperation, Virtual Reality and Interfaces, Bimanual Manipulation
Abstract: Teleoperation presents a promising paradigm for remote control and robot proprioceptive data collection. Despite recent progress, current teleoperation systems still suffer from limitations in efficiency and ergonomics, particularly in challenging scenarios. In this paper, we propose CaFe-TeleVision, a coarse-to-fine teleoperation system with immersive situated visualization for enhanced ergonomics. At its core, a coarse-tofine control mechanism is proposed in the retargeting module to bridge workspace disparities, jointly optimizing efficiency and physical ergonomics. To stream immersive feedback with adequate visual cues for human vision systems, an on-demand situated visualization technique is integrated in the perception module, which reduces the cognitive load for multi-view processing. The system is built on a humanoid collaborative robot and validated with six challenging bimanual manipulation tasks. User study among 24 participants confirms that CaFe-TeleVision enhances ergonomics with statistical significance, indicating a lower task load and a higher user acceptance during teleoperation. Quantitative results also validate superior performance of our system across six tasks, surpassing comparative methods by up to 28.89% in success rate and accelerating by 26.81% in completion time. The system will be open-sourced later.
|
| |
| 09:00-10:30, Paper TuI1I.396 | Add to My Program |
| ID(O): Mapping Data Quantization for Bathymetric Collaborative SLAM |
|
| Zhang, Qianyi | Korea Advanced Institute of Science and Technology |
| Kim, Jinwhan | KAIST |
Keywords: Marine Robotics, Autonomous Vehicle Navigation, Multi-Robot Systems, Bathymetric SLAM
Abstract: Underwater acoustic communication, characterized by limited bandwidth, high latency, and low reliability, poses significant challenges for data exchange in bathymetric collaborative simultaneous localization and mapping (CSLAM). In this article, we introduce a novel vector quantization (VQ) method called ID(O) for mapping data compression in bathymetric CSLAM. ID(O) encodes the map into an index map (I), a central depth map (D), and an orientation map (O). To accommodate strict communication constraints, orientations can be partially or fully excluded from transmission, and we propose a method to estimate these orientations during map restoration. Moreover, we integrate ID(O) within a feature-based bathymetric CSLAM framework named TTT CSLAM. Extensive experiments on two large-scale sea trial datasets demonstrate that ID(O) achieves about 40% higher restoration accuracy than the baseline method using principal component analysis. TTT CSLAM with ID(O) can match that with lossless compression regarding mapping accuracy and efficiency, and it is robust against 40% packet loss and large dead reckoning drift errors across diverse environments. To the best of the authors’ knowledge, ID(O) is the first VQ method for bathymetric data compression, and TTT CSLAM with ID(O) is the first bathymetric CSLAM tested within an underwater communication network employed by acoustic modems.
|
| |
| 09:00-10:30, Paper TuI1I.397 | Add to My Program |
| Learning 6D Object Pose Estimation with Event Cameras Using Synthetic Data and Domain Randomization |
|
| Abdul Hay, Oussama | Khalifa University |
| Huang, Xiaoqian | Khalifa University |
| Humais, Muhammad Ahmed | Khalifa University |
| Ayyad, Abdulla | Khalifa University |
| Almadhoun, Randa | University of Sunderland |
| Zweiri, Yahya | Khalifa University |
Keywords: Deep Learning for Visual Perception, Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation
Abstract: Estimating the 6D pose of rigid objects is a critical upstream task in many robotics applications. Most existing methods rely on RGB or RGB-D sensing modalities, which suffer from limitations under challenging lighting conditions and high-speed motion. In contrast, event-based cameras offer unique advantages such as high temporal resolution and high dynamic range, making them well-suited for such scenarios. However, current event-based pose estimation methods are typically optimization-based, designed for relatively simple objects, and require hand-crafted parameters. In this work, we introduce the first learning-based approach for 6D object pose estimation using event cameras, employing an Augmented Event Encoder (AEE) trained entirely only on synthetic data and validated on the E-POSE dataset. Our model leverages an augmented autoencoder with domain randomization to map synthetic templates into a latent space, enabling accurate matching with real event query images. The method demonstrates robust performance across various scenarios, including changes in illumination and camera speeds, and achieves strong results on the ADD-S (Rotation) metric.
|
| |
| 09:00-10:30, Paper TuI1I.398 | Add to My Program |
| Factor Graph-Based Ground Truth Trajectory Estimation by Fusing Robotic Total Station and Inertial Measurements |
|
| Mittelstedt, Manuel | Universität Bonn |
| Esser, Felix | University of Bonn |
| Tombrink, Gereon | University of Bonn |
| Klingbeil, Lasse | University of Bonn |
| Kuhlmann, Heiner | University of Bonn |
Keywords: Localization, Sensor Fusion
Abstract: The application of mobile mapping systems (MMS) has increased continuously in the last decades in fields like infrastructure or ecosystem monitoring. Equipped with multiple laser scanners and cameras, these systems can generate high-resolution 3D point clouds of the environment in a short time. In this process, the accuracy of the trajectory of the system is of central importance as it directly affects the accuracy of the resulting point cloud. However, since the trajectory estimation depends on sensor observations that are often affected by unknown systematic errors, the actual accuracy of the trajectory remains mainly unknown. To uncover the gap in the trajectory accuracy assessment, we present a method to create ground truth trajectories for mobile mapping systems by integrating millimeter-accurate total station measurements. We mount two 360-degree prisms on a mobile platform, track them with two Robotic Total Stations (RTS) during motion, and fuse these prism measurements with the readings from an Inertial Measurement Unit (IMU) using a factor graph-based trajectory estimation approach. To evaluate the quality of this ground truth trajectory, we record repeated measurements on a closed-loop rail track close to Bonn, Germany. The results show that the generated ground truth trajectory estimated with RTS and IMU data achieves a precision of around 1 mm in position and 0.05◦ in orientation. To show the potential of the method, we detect systematic deviations of an example MSS that uses Real-Time Kinematic Global Navigation Satellite System (RTK-GNSS) and IMU data for trajectory estimation. The results show that even under good GNSS conditions, the ground truth trajectory has significantly better precision and less systematic errors than the trajectory based on RTK-GNSS and IMU data.
|
| |
| 09:00-10:30, Paper TuI1I.399 | Add to My Program |
| Real-Time Velocity Profile Optimization for Time-Optimal Maneuvering with Generic Acceleration Constraints |
|
| Piazza, Mattia | University of Trento |
| Piccinini, Mattia | Technical University of Munich |
| Taddei, Sebastiano | University of Trento, Politecnico Di Bari |
| Biral, Francesco | University of Trento |
| Bertolazzi, Enrico | University of Trento |
Keywords: Optimization and Optimal Control, Constrained Motion Planning, Wheeled Robots
Abstract: The computation of time-optimal velocity profiles along prescribed paths, subject to generic acceleration constraints, is a crucial problem in robot trajectory planning, with particular relevance to autonomous racing. However, the existing methods either support arbitrary acceleration constraints at high computational cost or use conservative box constraints for computational efficiency. We propose FBGA, a new Forward-Backward algorithm with Generic Acceleration constraints, which achieves both high accuracy and low computation time. FBGA operates forward and backward passes to maximize the velocity profile in short, discretized path segments, while satisfying user-defined performance limits. Tested on five racetracks and two vehicle classes, FBGA handles complex, non-convex acceleration constraints with custom formulations. Its maneuvers and lap times closely match optimal control baselines (within 0.11% to 0.36%), while being up to three orders of magnitude faster. FBGA maintains high accuracy even with coarse discretization, making it well suited for online multi-query trajectory planning.
|
| |
| 09:00-10:30, Paper TuI1I.400 | Add to My Program |
| Real-Time Multi-Level Terrain-Aware Path Planning for Ground Mobile Robots in Large-Scale Rough Terrains |
|
| Li, Yuxiang | Harbin Institute of Technology, Shenzhen |
| Chen, Kun | Harbin Institute of Technology, Shenzhen |
| Wang, Yifei | Harbin Institute of Technology Shenzhen |
| Zhang, Weifan | Harbin Institute of Technology, Shenzhen |
| Wang, Jiancheng | Harbin Institute of Technology, Shenzhen |
| Chen, Haoyao | Harbin Institute of Technology, Shenzhen |
| Liu, Yunhui | Chinese University of Hong Kong |
Keywords: Autonomous Vehicle Navigation, Motion and Path Planning, Robotics in Hazardous Fields, Configuration Stability Estimation
Abstract: Autonomous ground mobile robots rely on their configuration characteristics to prevent tip-overs and collisions, ensuring safe navigation in complex environments. However, complex configurations with specially designed links and joints produce a higher dimensional workspace and bring significant challenges for path planning, especially in large-scale rough terrains. To address this, we propose a real-time multilevel terrain-aware path planning framework that integrates different levels of terrain awareness into the global and local layers. An implicit map representation is introduced at the global layer to enable efficient terrain analysis and path planning, while an iterative geometric evaluation is designed at the local layer to estimate configuration stability and improve path smoothness. By sharing the global layer information with the local layer, the framework enhances path planning efficiency and adaptability in complex environments. Its modular design supports diverse robot configurations and pathfinding algorithms, enabling effective autonomous navigation in large-scale 3-D terrains with online or offline maps. Simulations and real-world experiments demonstrated that our approach outperforms state of the art across diverse environments, including uneven terrains, multilayered structures, and complex debris fields. The results highlighted that our approach provides faster and safer path planning, more accurate and robust configuration-stability estimation, and higher success rates in traversing complex 3-D environments.
|
| |
| 09:00-10:30, Paper TuI1I.401 | Add to My Program |
| Learning on the Fly: Rapid Policy Adaptation Via Differentiable Simulation |
|
| Pan, Jiahe | ETH Zurich |
| Xing, Jiaxu | University of Zurich |
| Reiter, Rudolf | University of Zurich |
| Zhai, Yifan | Robotics and Perception Group |
| Aljalbout, Elie | NVIDIA |
| Scaramuzza, Davide | University of Zurich |
Keywords: Machine Learning for Robot Control, Aerial Systems: Perception and Autonomy, Continual Learning
Abstract: Learning control policies in simulation enables rapid, safe, and cost-effective development of advanced robotic capabilities. However, transferring these policies to the real world remains difficult due to the sim-to-real gap, where unmodeled dynamics and environmental disturbances can degrade policy performance. Existing approaches, such as domain randomization and Real2Sim2Real pipelines, can improve policy robustness, but either struggle under out-of-distribution conditions or require costly offline retraining. In this work, we approach these problems from a different perspective. Instead of relying on diverse training conditions before deployment, we focus on rapidly adapting the learned policy in the real world in an online fashion. To achieve this, we propose a novel online adaptive learning framework that unifies residual dynamics learning with real-time policy adaptation inside a differentiable simulation. Starting from a simple dynamics model, our framework continuously refines the model using real-world data to capture unmodeled effects and disturbances, such as payload changes and wind. The refined dynamics model is embedded in a differentiable simulation framework, enabling gradient backpropagation through the dynamics and thus rapid, sample-efficient policy updates. All components of our system are designed for rapid adaptation, enabling the policy to adjust to unseen disturbances within 5 seconds of training. We validate the approach on agile quadrotor control under various disturbances in both simulation and the real world. Our framework reduces hovering error by up to 81% compared to L1-MPC and 55% compared to DATT, while also demonstrating robustness in vision-based control without explicit state estimation.
|
| |
| 09:00-10:30, Paper TuI1I.402 | Add to My Program |
| Soft Robotic Delivery of Coiled Anchors for Cardiac Interventions |
|
| Zamora Yañez, Leonardo | Boston University |
| Rogatinsky, Jacob | Boston University |
| Recco, Dominic | Boston Children's Hospital |
| Lee, Sang-Yoep | Boston University |
| Matthews, Grace | Boston University |
| Sabelhaus, Andrew | Boston University |
| Hoganson, David | Boston Children's Hospital |
| Ranzani, Tommaso | Boston University |
Keywords: Medical Robots and Systems, Soft Robot Applications
Abstract: Trans-catheter cardiac intervention has become an increasingly available option for high-risk patients without the complications of open heart surgery. However, current catheter-based platforms suffer from a lack of dexterity, force application, and compliance required to perform complex intracardiac procedures. An exemplary task that would significantly ease minimally invasive intracardiac procedures is the implantation of anchor coils, which can be used to fix and implant various devices. We introduce a robotic platform capable of delivering anchor coils. We develop a kineto-statics model of the robotic platform and demonstrate low positional error. We leverage the passive compliance and high force output of the actuator in a multi-anchor delivery procedure against a motile in-vitro simulator with millimeter level accuracy.
|
| |
| 09:00-10:30, Paper TuI1I.403 | Add to My Program |
| Affordance RAG: Hierarchical Multimodal Retrieval with Affordance-Aware Embodied Memory for Mobile Manipulation |
|
| Korekata, Ryosuke | Keio University |
| Xie, Quanting | Carnegie Mellon University |
| Bisk, Yonatan | Carnegie Mellon University |
| Sugiura, Komei | Keio University |
Keywords: Deep Learning Methods, Deep Learning for Visual Perception
Abstract: In this study, we address the problem of open-vocabulary mobile manipulation, where a robot is required to carry a wide range of objects to receptacles based on free-form natural language instructions. This task is challenging, as it involves understanding visual semantics and the affordance of manipulation actions. To tackle these challenges, we propose Affordance RAG, a zero-shot hierarchical multimodal retrieval framework that constructs Affordance-Aware Embodied Memory from pre-explored images. The model retrieves candidate targets based on regional and visual semantics and reranks them with affordance scores, allowing the robot to identify manipulation options that are likely to be executable in real-world environments. Our method outperformed existing approaches in retrieval performance for mobile manipulation instruction in large-scale indoor environments. Furthermore, in real-world experiments where the robot performed mobile manipulation in indoor environments based on free-form instructions, the proposed method achieved a task success rate of 85%, outperforming existing methods in both retrieval performance and overall task success.
|
| |
| 09:00-10:30, Paper TuI1I.404 | Add to My Program |
| VOCALoco: Viability-Optimized Cost-Aware Adaptive Locomotion |
|
| Wu, Stanley | McGill University |
| Danesh, Mohamad Hosein | McGill University |
| Li, Simon | McGill University |
| Yurchyk, Hanna | McGill University, Mila |
| Abyaneh, Amin | McGill University |
| El Houssaini, Anas | McGill University |
| Meger, David Paul | McGill University |
| Lin, Hsiu-Chin | McGill University |
Keywords: Legged Robots, Integrated Planning and Learning, Reinforcement Learning
Abstract: Recent advancements in legged robot locomotion have facilitated traversal over increasingly complex terrains. Despite this progress, many existing approaches rely on end- to-end deep reinforcement learning (DRL), which poses limi- tations in terms of safety and interpretability, especially when generalizing to novel terrains. To overcome these challenges, we introduce VOCALoco, a modular skill-selection framework that dynamically adapts locomotion strategies based on per- ceptual input. Given a set of pre-trained locomotion policies, VOCALoco evaluates their viability and energy-consumption by predicting both the safety of execution and the anticipated cost of transport over a fixed planning horizon. This joint assessment enables the selection of policies that are both safe and energy- efficient, given the observed local terrain. We evaluate our approach on staircase locomotion tasks, demonstrating its performance in both simulated and real-world scenarios using a quadrupedal robot. Empirical results show that VOCALoco achieves improved robustness and safety during stair ascent and descent compared to a conventional end-to-end DRL policy.
|
| |
| 09:00-10:30, Paper TuI1I.405 | Add to My Program |
| RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation |
|
| Patel, Naman | New York University Tandon School of Engineering |
| Krishnamurthy, Prashanth | New York University Tandon School of Engineering |
| Khorrami, Farshad | New York University Tandon School of Engineering |
Keywords: Semantic Scene Understanding, RGB-D Perception, Recognition, SLAM
Abstract: Mapping and understanding complex 3D environments is fundamental to how autonomous systems perceive and interact with the physical world, requiring both precise geometric reconstruction and rich semantic comprehension. While existing 3D semantic mapping systems excel at reconstructing and identifying predefined object instances, they lack the flexibility to efficiently build semantic maps with open-vocabulary during online operation. Although recent vision-language models have enabled open-vocabulary object recognition in 2D images, they haven't yet bridged the gap to 3D spatial understanding. The critical challenge lies in developing a training-free unified system that can simultaneously construct accurate 3D maps while maintaining semantic consistency and supporting natural language interactions in real time. In this paper, we develop a zero-shot framework that seamlessly integrates GPU-accelerated geometric reconstruction with open-vocabulary vision-language models through online instance-level semantic embedding fusion, guided by hierarchical object association with spatial indexing.
|
| |
| 09:00-10:30, Paper TuI1I.406 | Add to My Program |
| Filament Sliding Linear Potentiometer-Based Data Glove (FLiPo) for Precisely Annotating Human Finger Poses |
|
| Xia, Zhisheng | Huazhong University of Science and Technology |
| Yong, Haochen | Huazhong University of Science and Technology |
| Liu, Qilong | Huazhong University of Science and Technology |
| Ke, Zhenghao | Huazhong University of Science and Technology |
| Ding, Han | Huazhong University of Science and Technology |
| Wu, Zhigang | Huazhong University of Science and Technology |
Keywords: Gesture, Posture and Facial Expressions, Wearable Robotics, Human-Centered Automation
Abstract: Data gloves offer excellent portability and a strong ability to handle occluded movements, making them more advantageous over other methods for capturing complex hand motions in unstructured environments. However, the majority of existing hand-motion-capture gloves do not preserve visual features of the hand, which critically hinders their applicability for automatic pose annotation in RGB images. Here, we propose a data glove based on filament-sliding linear potentiometers (FLiPo), which can maintain finger appearance and ensure high accuracy as well as robustness, paving the way for automatic annotation. In FLiPo, fine filaments (diameter 0.1 mm) are deployed on finger skin to transmit joint arc length variations as well as preserve the hand’s visual features, while linear potentiometers used to capture filament length changes are positioned on the arm. Simultaneously, a quantitative occlusion scoring metric is proposed to evaluate the degree of finger occlusion caused by the device. Further, we experimentally analyze the nonlinearities induced by biaxial joint coupling and skin tissue artifact (STA)-related hysteresis, and employ a fully connected neural network to map arc length to joint angles with an MAE of joint angles of 2.15°. Meanwhile, tests under challenging environmental conditions, including heat, moisture, and magnetic interference, are conducted to evaluate its stability. Finally, the system's capability for real-time pose capture with high accuracy, robustness, and low occlusion was demonstrated
|
| |
| 09:00-10:30, Paper TuI1I.407 | Add to My Program |
| Back from the Dead: Self-Recovery Strategy for Modular Planetary Exploration Robots (I) |
|
| Zhao, Ning | Harbin Institute of Technology |
| Liang, Dawei | Harbin Institute of Technology |
| Yang, Zhiyuan | Harbin Institute of Technology |
| Qi, Jian | Harbin Institute of Technology |
| Zhao, Jie | Harbin Institute of Technology |
| Zhu, Yanhe | Harbin Institute of Technology |
| |
| 09:00-10:30, Paper TuI1I.408 | Add to My Program |
| Fitts' List Revisited: An Empirical Study on Function Allocation in a Two-Agent Physical Human-Robot Collaborative Position/Force Task |
|
| Mol, Nicky | Delft University of Technology |
| Prendergast, J. Micah | Delft University of Technology |
| Abbink, David A. | Delft University of Technology |
| Peternel, Luka | Delft University of Technology |
Keywords: Physical Human-Robot Interaction, Human Factors and Human-in-the-Loop, Human-Centered Robotics
Abstract: In this letter, we investigate whether classical function allocation--the principle of assigning tasks to either a human or a machine--holds for physical Human-Robot Collaboration, which is important for providing insights for Industry 5.0 to guide how to best augment rather than replace workers. This study empirically tests the applicability of Fitts' List within physical Human-Robot Collaboration, by conducting a user study (N=26, within-subject design) to evaluate four distinct allocations of position/force control between human and robot in an abstract blending task. We hypothesize that the function in which humans control the position achieves better performance and receives higher user ratings. When allocating position control to the human and force control to the robot, compared to the opposite case, we observed a significant improvement in preventing overblending. This was also perceived better in terms of physical demand and overall system acceptance, while participants experienced greater autonomy, more engagement and less frustration. An interesting insight was that the supervisory role (when the robot controls both position and force) was rated second best in terms of subjective acceptance. Another surprising insight was that if position control was delegated to the robot, the participants perceived much lower autonomy than when the force control was delegated to the robot. These findings empirically support applying Fitts' principles to static function allocation for physical collaboration, while also revealing important nuanced user experience trade-offs, particularly regarding perceived autonomy when delegating position control.
|
| |
| 09:00-10:30, Paper TuI1I.409 | Add to My Program |
| SIS: Seam-Informed Strategy for T-Shirt Unfolding |
|
| Huang, Xuzhao | The University of Hong Kong |
| Seino, Akira | Centre for Transformative Garment Production |
| Tokuda, Fuyuki | Tohoku University |
| Kobayashi, Akinari | Centre for Transformative Garment Production |
| Chen, Dayuan | Tohoku University |
| Hirata, Yasuhisa | Tohoku University |
| Tien, Norman | University of Hong Kong |
| Kosuge, Kazuhiro | The University of Hong Kong |
Keywords: Perception for Grasping and Manipulation, Bimanual Manipulation, Manipulation Planning
Abstract: Seams are information-rich components of garments. The presence of different types of seams and their combinations helps to select grasping points for garment handling. In this paper, we propose a new Seam-Informed Strategy (SIS) for finding actions for handling a garment, such as grasping and unfolding a T-shirt. Candidates for a pair of grasping points for a dual-arm manipulator system are extracted using the proposed Seam Feature Extraction Method (SFEM). A pair of grasping points for the robot system is selected by the proposed Decision Matrix Iteration Method (DMIM). The decision matrix is first computed by multiple human demonstrations and updated by the robot execution results to improve the grasping and unfolding performance of the robot. Note that the proposed scheme is trained on real data without relying on simulation. Experimental results demonstrate the effectiveness and generalization ability of the proposed strategy.
|
| |
| 09:00-10:30, Paper TuI1I.410 | Add to My Program |
| Safe and Stable Neural Network Dynamical Systems for Robot Motion Planning |
|
| Binny, Allen Emmanuel | Indian Institute of Technology Kharagpur |
| Anand, Mahathi | Technical University of Munich |
| Kussaba, Hugo Tadashi | University of Brasília |
| Chen, Lingyun | Technical University of Munich |
| Agrawal, Shreenabh | Indian Institute of Science, Bangalore |
| Abu-Dakka, Fares | New York University Abu Dhabi |
| Swikir, Abdalla | Mohamed Bin Zayed University of Artificial Intelligence |
Keywords: Learning from Demonstration, Robot Safety, Formal Methods in Robotics and Automation
Abstract: Learning safe and stable robot motions from demonstrations remains a challenge, especially in complex, nonlinear tasks involving dynamic, obstacle-rich environments. In this paper, we propose Safe and Stable Neural Network Dynamical Systems S²-NNDS, a learning-from-demonstration framework that simultaneously learns expressive neural dynamical systems alongside neural Lyapunov stability and barrier safety certificates. Unlike traditional approaches with restrictive polynomial parameterizations, S²-NNDS leverages neural networks to capture complex robot motions providing probabilistic guarantees through split conformal prediction in learned certificates. Experimental results on various 2D and 3D datasets—including LASA handwriting and demonstrations recorded kinesthetically from the Franka Emika Panda robot—validate S²-NNDS effectiveness in learning robust, safe, and stable motions from potentially unsafe demonstrations.
|
| |
| 09:00-10:30, Paper TuI1I.411 | Add to My Program |
| Game-KFS: Game-Theory-Inspired Keyframe Selection for Hybrid Representation Visual SLAM |
|
| Chen, Shilang | Guangdong University of Technology |
| Yang, Bo | Nanjing University of Information Science & Technology |
| Wang, Chaoqun | Shandong University |
| Fang, Peidong | Guangdong University of Technology |
| Zhu, Haifei | Guangdong University of Technology |
| Chen, Weinan | Guangdong University of Technology |
| Guan, Yisheng | Guangdong University of Technology |
Keywords: SLAM, Mapping, Localization
Abstract: Hybrid representation Visual Simultaneous Localization and Mapping (VSLAM) systems combine the inherent strengths of both discrete and field representations. They promise high-precision tracking and photo-realistic dense mapping. However, current keyframe selection methods in hybrid representation VSLAM struggle to satisfy both the high-precision tracking requirements of discrete representations and the high-quality rendering requirements of field representations. In this paper, we propose a game-theory-inspired keyframe selection approach that addresses the requirements of both representation types. We introduce two objective functions to comprehensively assess discrete point tracking and radiance field model rendering. By employing a game-theory-inspired framework, our method effectively balances these objectives to achieve improved keyframe selection. Experimental results demonstrate that integrating our approach into a hybrid representation VSLAM system significantly enhances tracking accuracy and rendering quality, outperforming existing keyframe selection methods.
|
| |
| 09:00-10:30, Paper TuI1I.412 | Add to My Program |
| On Your Own - Pro-Level Autonomous Drone Racing in Uninstrumented Arenas |
|
| Bosello, Michael | TII - Technology Innovation Institute |
| Pinzarrone, Flavio | TII - Technology Innovation Institute |
| Kiade, Sara | TII - Technology Innovation Institute |
| Aguiari, Davide | TII - Technology Innovation Institute |
| Keuter, Yvo | TII - Technology Innovation Institute |
| AlShehhi, Aaesha | TII - Technology Innovation Institute |
| Caminati, Gyordan | TII - Technology Innovation Institute |
| Wong, Kei Long | TII - Technology Innovation Institute |
| Chou, Ka Seng | TII - Technology Innovation Institute |
| Halepota, Junaid | TII - Technology Innovation Institute |
| Alneyadi, Fares | TII - Technology Innovation Institute |
| Panerati, Jacopo | NRC |
| Pau, Giovanni | TII - Technology Innovation Institute |
Keywords: Aerial Systems: Perception and Autonomy, Aerial Systems: Applications, Data Sets for Robot Learning
Abstract: Drone technology is proliferating in many industries, including agriculture, logistics, defense, infrastructure, and environmental monitoring. Vision-based autonomy is one of its key enablers, particularly for real-world applications. This is essential for operating in novel, unstructured environments where traditional navigation methods may be unavailable. Autonomous drone racing has become the de facto benchmark for such systems. State-of-the-art research has shown that autonomous systems can surpass human-level performance in racing arenas. However, direct applicability to commercial and field operations is still limited as current systems are often trained and evaluated in highly controlled environments. In our contribution, the system's capabilities are analyzed within a controlled environment---where external tracking is available for ground-truth comparison---but also demonstrated in a challenging, uninstrumented environment---where ground-truth measurements were never available. We show that our approach can match the performance of professional human pilots in both scenarios. We also publicly release the data from the flights carried out by our approach and a world-class human pilot.
|
| |
| 09:00-10:30, Paper TuI1I.413 | Add to My Program |
| Progressive-Resolution Policy Distillation: Leveraging Coarse-Resolution Simulations for Time-Efficient Fine-Resolution Policy Learning (I) |
|
| Kadokawa, Yuki | Nara Institute of Science and Technology |
| Tahara, Hirotaka | Nara Institute of Science and Technology |
| Matsubara, Takamitsu | Nara Institute of Science and Technology |
Keywords: Mining Robotics, Robotics and Automation in Construction, Reinforcement Learning
Abstract: In earthwork and construction, excavators often encounter large rocks mixed with various soil conditions, requiring skilled operators. This paper presents a framework for achieving autonomous excavation using reinforcement learning (RL) through a rock excavation simulator. In the simulation, resolution can be defined by the particle size/number in the whole soil space. Fine-resolution simulations closely mimic real-world behavior but demand significant calculation time and challenging sample collection, while coarse-resolution simulations enable faster sample collection but deviate from real-world behavior. To combine the advantages of both resolutions, we explore using policies developed in coarse-resolution simulations for pre-training in fine-resolution simulations. To this end, we propose a novel policy learning framework called Progressive-Resolution Policy Distillation (PRPD), which progressively transfers policies through some middle-resolution simulations with conservative policy transfer to avoid domain gaps that could lead to policy transfer failure. Validation in a rock excavation simulator and nine real-world rock environments demonstrated that PRPD reduced sampling time to less than 1/7 while maintaining task success rates comparable to those achieved through policy learning in a fine-resolution simulation.
|
| |
| 09:00-10:30, Paper TuI1I.414 | Add to My Program |
| DAPPER: Discriminability-Aware Policy-To-Policy Preference-Based Reinforcement Learning for Query-Efficient Robot Skill Acquisition |
|
| Kadokawa, Yuki | Nara Institute of Science and Technology |
| Frey, Jonas | ETH Zurich |
| Miki, Takahiro | ETH Zurich |
| Matsubara, Takamitsu | Nara Institute of Science and Technology |
| Hutter, Marco | ETH Zurich |
Keywords: Reinforcement Learning, Human Factors and Human-in-the-Loop, Legged Robots
Abstract: Preference-based Reinforcement Learning (PbRL) enables policy learning through simple queries comparing trajectories from a single policy, yet suffers from low query efficiency as policy bias limits trajectory diversity and reduces discriminable queries for learning human preferences. This paper identifies preference discriminability, which quantifies how easily a human can judge which trajectory is closer to their ideal behavior, as a key metric for improving query efficiency. To address this, we move beyond single-policy sampling and generate queries by comparing trajectories from different policies, as learning multiple policies from scratch promotes trajectory diversity without policy bias. We propose Discriminability-Aware Policy-to-Policy Preference-Based Efficient Reinforcement Learning (DAPPER), which integrates preference discriminability with trajectory diversification achieved by multiple policies. DAPPER trains new policies from scratch after each reward update and employs a discriminator that learns to estimate preference discriminability, enabling the prioritized sampling of more discriminable queries. During training, it jointly maximizes the preference reward and pr
|
| |
| 09:00-10:30, Paper TuI1I.415 | Add to My Program |
| TARAD: Task-Aware Robot Affordance-Centric Diffusion Policy Learned from LLM-Generated Demonstrations |
|
| Hu, Site | The University of Osaka |
| Nagai, Takayuki | The University of Osaka |
| Horii, Takato | The University of Osaka |
Keywords: AI-Enabled Robotics, Learning from Demonstration, Manipulation Planning
Abstract: In open-ended task settings, the ability of a robot to execute diverse tasks accurately by following language instructions is critical. Methods based on traditional imitation learning typically depend on extensive expert demonstrations and often struggle to generalize in the case of unseen scenarios or tasks. Recently, approaches leveraging large foundational models have demonstrated improved generalization by enhancing task comprehension in novel scenarios based on the intrinsic world knowledge embedded in these models. However, these methods rely on predefined motion primitives and lack a detailed understanding of the environment, which is essential for successful execution. Herein we introduce Task-Aware Robot Affordance-Centric Diffusion Policy (TARAD), a novel framework for robot manipulation. TARAD leverages LLMs and VLMs to perform high-level planning from natural language instructions and extract affordance information from the robot’s observations. A heuristic motion planner is employed for low-level motion planning, enabling zero-shot trajectory synthesis and the fully automatic generation of a dataset with language labels and affordances. By incorporating affordances into the observation space, our approach integrates the intrinsic commonsense and reasoning capabilities of foundation models into imitation learning, enabling the training of an affordance-centric, multi-task 3D diffusion policy. Empirical evaluations in both the RLBench simulated environments and real-world experiments with UR5e demonstrate that TARAD effectively combines the precise control of imitation learning with the strong generalization capabilities of foundation models, all without relying on expert demonstrations or predefined motion primitives.
|
| |
| 09:00-10:30, Paper TuI1I.416 | Add to My Program |
| Risk-Aware Routing for a Robot in a Shared Dynamic Environment |
|
| Stracca, Elena | University of Pisa |
| Grioli, Giorgio | Istituto Italiano Di Tecnologia |
| Pallottino, Lucia | Università Di Pisa |
| Salaris, Paolo | University of Pisa |
Keywords: Autonomous Vehicle Navigation, Reactive and Sensor-Based Planning, Motion and Path Planning, Stochastic Shortest Path With Recourse
Abstract: This paper explores the challenge of optimal routing for a mobile robot navigating a dynamic and shared human environment. The primary goal is to minimize the risk of performance degradation during motion, such as delays in completing tasks due to the need for safe or acceptable human-robot encounters. The problem is formulated as a graph whose edge costs become progressively known only as the robot moves through the environment. We model this problem as a Markov Decision Process (MDP), enabling an offline evaluation of the expected cost of alternative routes based on statistical information about human spatial distributions and possible observations at each intersection. This compact state representation scales linearly with the number of intersections in the map. Since the memoryless property of the MDP may induce loops during online execution, we compute an offline policy and introduce an online policy adaptation mechanism to prevent cyclic behaviors. Extensive simulations across environments of different complexity, and using data collected from real-world experiments, demonstrate that our approach outperforms reactive and advanced state-of-the-art planners in terms of either performance or scalability.
|
| |
| 09:00-10:30, Paper TuI1I.417 | Add to My Program |
| Lie Group Implicit Kinematics for Redundant Parallel Manipulators: Left-Trivialized Extended Jacobians and Gradient-Based Online Redundancy Flows for Singularity Avoidance |
|
| Liu, Yifei | The University of British Columbia |
| Wen, Kefei | The University of British Columbia |
Keywords: Kinematics, Parallel Robots, Redundant Robots
Abstract: We present a Lie group implicit formulation for kinematically redundant parallel manipulators that yields left-trivialized extended Jacobians for the extended task variable x = (g, ρ) ∈ SE(3) × R. On top of this model we design a gradient-based redundancy flow on the redundancy manifold that empirically maintains a positive manipulability margin along prescribed SE(3) trajectories. The framework uses right-multiplicative state updates, remains compatible with automatic differentiation, and avoids mechanism-specific analytic Jacobians; it works with either direct inverse kinematics or a numeric solver. A specialization to SO(2)3 provides computation-friendly first- and second-order steps. We validate the approach on two representative mechanisms: a (6+3)-degree-of-freedom (DoF) Stewart platform and a Spherical–Revolute platform. Across dense-coverage orientation trajectories and interactive gamepad commands, the extended Jacobian remained well conditioned while the redundancy planner ran at approximately 2 kHz in software-in-the-loop on a laptop-class CPU. The method integrates cleanly with existing kinematic stacks and is suitable for real-time deployment.
|
| |
| 09:00-10:30, Paper TuI1I.418 | Add to My Program |
| Deep Learning–Driven Tumor Boundary Estimation Using Robotic Palpation in Minimally Invasive Surgery |
|
| Ryu, Youngjun | DGIST |
| Hong, Jeongbin | DGIST |
| Kee, Hyeonwoo | DGIST |
| Park, Sukho | DGIST |
Keywords: Medical Robots and Systems, Surgical Robotics: Planning, Object Detection, Segmentation and Categorization
Abstract: Accurate estimation of tumor boundaries is critical for ensuring adequate surgical margins in robot-assisted minimally invasive surgery (RMIS). In this study, we present a method that estimates tumor boundaries in RMIS using sweeping palpation data acquired with a single force/torque (F/T) sensor. From the reconstructed surface, tissue displacement and normal force were derived to calculate stiffness, which was then used to construct a stiffness map. To reduce noise and enhance feature representations, we employed a sparse autoencoder (SAE). The SAE outputs were subsequently clustered with a Gaussian mixture model (GMM) and K-means to segment the tumor from normal tissue. Experiments with phantom models and an ex vivo model demonstrated that the SAE-based approach significantly improved the Dice similarity coefficient (DSC) and sensitivity while maintaining specificity, and reduced the Hausdorff distance (HD and HD95) and average symmetric surface distance (ASSD), compared with results from raw data. Importantly, when evaluated under clinically relevant surgical margin conditions, the estimated HD consistently remained below threshold across all models. These results indicate that the proposed method achieves both high accuracy and clinical feasibility without additional imaging devices or displacement sensors, highlighting its potential to support margin minimization and organ function preservation in RMIS.
|
| |
| 09:00-10:30, Paper TuI1I.419 | Add to My Program |
| Augmented Tank-Based Control Guarantees Passive Individual Interaction Environment for Multiuser Haptic-Enabled Robotic Systems |
|
| Wang, Cui | Southern University of Science and Technology |
| Liu, Yudong | Southern University of Science and Technology |
| Sun, Chenyang | Southern University of Science and Technology |
| Li, Ping | Southern University of Science and Technology |
| Chen, Yi-Feng | Southern University of Science and Technology |
| Dong, Mingjie | Beijing University of Technology |
| Li, Zhenhong | University of Manchester |
| Liu, Lu | City University of Hong Kong |
| Zhang, Mingming | Southern University of Science and Technology |
Keywords: Multi-Robot Systems, Physical Human-Robot Interaction, Motion Control, Energy Tank
Abstract: Despite extensive investigations into the multiuser haptic-enabled robotic system (M-Hers), achieving scalable control design in the presence of nonpassive human operators remains a key challenge. This is primarily due to the increasing complexity of stability conditions and interaction coupling as the number of operators grows. In this study, we address this challenge in two steps. First, we introduce the individual interaction environment (IIE) to isolate the passivity violations, which facilitates the independent control design for each human–robot subsystem, thereby enhancing the scalability with respect to the number of subsystems. Second, within the IIE framework, we identify passivity-violating components caused by partners’active behaviors and propose a novel augmented tank-based controller (ATBC) to guarantee passive IIE while maintaining high rendering accuracy. Specifically, the ATBC employs an energy-related power regulation strategy to enhance interaction safety and a time-varying control gain to mitigate the negative effects of power regulation on rendering fidelity. We validated the proposed method through collaborative haptic tasks on a customized M-Hers composed of three robots in four different scenarios. Comparative studies demonstrate that our approach effectively ensures IIE passivity in the presence of active human behaviors, while ensuring high reproducibility and achieving a favorable balance between passivity and rendering accuracy.
|
| |
| 09:00-10:30, Paper TuI1I.420 | Add to My Program |
| SCANS: A Soft Gripper with Curvature and Spectroscopy Sensors for In-Hand Material Differentiation |
|
| Hanson, Nathaniel | Massachusetts Institute of Technology |
| Allison, Austin | Northeastern University |
| DiMarzio, Charles A | Northeastern University |
| Padir, Taskin | Northeastern University |
| Dorsey, Kristen | Northeastern University |
Keywords: Soft Sensors and Actuators, Soft Robot Applications, Grippers and Other End-Effectors
Abstract: We introduce the soft curvature and spectroscopy (SCANS) system: a versatile, electronics-free, fluidically actuated soft manipulator capable of assessing the spectral properties of objects either in hand or through pre-touch caging. This platform offers a wider spectral sensing capability than previous soft robotic counterparts. We perform a material analysis to explore optimal soft substrates for spectral sensing, and evaluate both pre-touch and in-hand performance. Experiments demonstrate explainable, statistical separation across diverse object classes and sizes (metal, wood, plastic, organic, paper, foam), with large spectral angle differences between items. Through linear discriminant analysis, we show that sensitivity in the near-infrared wavelengths is critical to distinguishing visually similar objects. These capabilities advance the potential of optics as a multi-functional sensory modality for soft robots.
|
| |
| 09:00-10:30, Paper TuI1I.421 | Add to My Program |
| Equivalence of Closed Chains to Open Chains: Virtual Decomposition Control Combined with Adaptive RBF Neural Network for Hydraulic Robot Legs |
|
| Zhang, Kun | Zhejiang University |
| Zong, Huaizhi | Zhejiang University |
| Li, Yong | Zhejiang University |
| Ai, Jikun | Zhejiang University |
| Zhang, Junhui | Zhejiang University |
| Xu, Bing | ZheJiang University |
| |
| 09:00-10:30, Paper TuI1I.422 | Add to My Program |
| A Shared Control Architecture for Vitreoretinal Surgery with Safety Guarantee Using Control Barrier Functions |
|
| Piccinelli, Nicola | University of Verona |
| Vesentini, Federico | University of Verona |
| Briel, Marius | Carl Zeiss AG |
| Haide, Ludwig | Carl Zeiss AG |
| Tagliabue, Eleonora | Carl Zeiss AG |
| Pellegrini, Marco | University of Ferrara |
| Kronreif, Gernot | ACMIT Gmbh |
| Muradore, Riccardo | University of Verona |
Keywords: Surgical Robotics: Laparoscopy, Optimization and Optimal Control, Telerobotics and Teleoperation
Abstract: Control Barrier Functions (CBFs) provide a powerful framework for enforcing real-time safety in control systems and have seen increasing applications in safety-critical domains, such as surgical robotics. In vitreoretinal microsurgery, where precision and tissue protection are crucial, we propose a shared control approach that leverages CBFs to maintain the robot's end-effector within a safe zone above the retina. Using real-time 3D reconstruction from an instrument-integrated Optical Coherence Tomography (iiOCT) system mounted on the surgical tool, we define a safety band between two offset surfaces derived from the reconstructed retina. A hybrid controller drives the tool into the band when outside and then enforces forward invariance using a CBF-based quadratic program. Concurrently, haptic feedback proportional to the deviation from the band centre guides the surgeon toward the optimal working distance. We validate our method in ex vivo pig eye experiments, performing a simulated Vitreous Shaving (VS), showing improved safety and operator awareness.
|
| |
| 09:00-10:30, Paper TuI1I.423 | Add to My Program |
| Virtual-Force Based Visual Servo for Multiple Peg-In-Hole Assembly with Tightly Coupled Multi-Manipulator |
|
| Zhang, Jiawei | Harbin Institute of Technology |
| Bai, Chengchao | Harbin Institute of Technology |
| Pan, Wei | The University of Manchester |
| Guo, Jifeng | Harbin Institute of Technology |
| Liu, Tianhang | Harbin Institute of Technology |
Keywords: Deep Learning in Grasping and Manipulation, Mobile Manipulation
Abstract: Multiple Peg-in-Hole (MPiH) assembly is one of the fundamental tasks in robotic assembly. In the MPiH tasks for large-size parts, it is challenging for a single manipulator to simultaneously align multiple distant pegs and holes, necessitating tightly coupled multi-manipulator systems. For such MPiH tasks using tightly coupled multiple manipulators, we propose a collaborative visual servo control framework that uses only the monocular in-hand cameras of each manipulator to reduce positioning errors. Initially, we train a state classification neural network and a positioning neural network. The former divides the states of the peg and hole in the image into three categories: obscured, separated, and overlapped, while the latter determines the position of the peg and hole in the image. Based on these findings, we propose a method to integrate the visual features of multiple manipulators using virtual forces, which can naturally combine with the cooperative controller of the multi-manipulator system. To generalize our approach to holes of different appearances, we varied the appearance of the holes during the dataset generation process. The results confirm that by considering the appearance of the holes, classification accuracy and positioning precision can be improved. Finally, the results show that our method achieves a success rate close to 100% in dual-manipulator dual peg-in-hole tasks with a clearance of 0.2 mm, while robust to camera calibration errors.
|
| |
| 09:00-10:30, Paper TuI1I.424 | Add to My Program |
| Motion Control and Power Distribution of H-Shaped Multi-Modal Transformable Rotorcraft |
|
| Wang, Xuqiao | Civil Aviation University of China |
| Guo, Da | Civil Aviation University of China |
| Zhao, Changli | Tianjin Sino-German University of Applied Sciences |
| Duan, Menghao | Civil Aviation University of China |
| Luo, Qijun | Civil Aviation University of China |
Keywords: Aerial Systems: Applications, Aerial Systems: Mechanics and Control, Motion Control
Abstract: Multilink transformable rotorcraft demonstrate exceptional flexibility when navigating confined spaces, yet face critical challenges including time-varying center of gravity, body misalignment, and the absence of a unified control strategy during dynamic reconfiguration, which severely restrict motion continuity and operational capability. To address these limitations, we propose an H-shaped multi-modal transformable rotorcraft. Its novelty lies in the co-design of a specialized mechanical architecture with 6 controllable degrees of freedom (CDOF) and an integrated control allocation framework, enabling the aircraft to achieve stable and continuous aerial transitions between high-passability, high fault-tolerance, and high-torque configurations. A dynamic PID control law based on motion characteristic values ensures system robustness against uncertainties, while a novel competency-based power distribution strategy uniquely constrains propeller thrust and lever arms to generate feasible control commands for each configuration. Experimental results demonstrate that our platform successfully overcomes stability challenges, maintaining positional deviation within 0.04 m during traversal through constrained spaces. The aircraft can reduce its footprint by up to 55.8%, sustain flight under single-propeller failure, and switch to a fault-tolerant configuration within 1.2 seconds, while exhibiting high-torque output capability sufficient for rotational operations. This work provides a comprehensive ontological platform that effectively bridges the technological gap between transformable reconfiguration and fault-tolerant control, enabling multi-scenario operational capabilities.
|
| |
| 09:00-10:30, Paper TuI1I.425 | Add to My Program |
| HEAPGrasp: Hand-Eye Active Perception to Grasp Objects with Diverse Optical Properties |
|
| Kennis, Ginga | Tokyo University of Science |
| Arai, Shogo | Tokyo University of Science |
Keywords: Grasping, Perception for Grasping and Manipulation
Abstract: Autonomous robotic handling requires accurate 3-D scene measurement followed by grasp planning. Conventional systems struggle with transparent or specular objects. Additionally, in hand–eye setups, moving through multiple viewpoints increases handling execution time. In this paper, we propose HEAPGrasp—Hand-Eye Active Perception to Grasp objects with diverse optical properties. To measure such objects, we focus on the ability to segment objects regardless of their optical properties in RGB images. We employ Shape from Silhouette based on the segmented images for 3-D measurement. To shorten the time required for multi-view capture with a hand-eye camera, we plan its trajectory using a cost function that balances 3-D measurement accuracy against its trajectory length. Real-robot experiments achieve a 94.3% grasp success rate on transparent, specular, and opaque objects, while reducing the hand-eye camera’s trajectory length by 52% and handling execution time by 18% relative to baselines that circle around the scene for 3-D measurement.
|
| |
| 09:00-10:30, Paper TuI1I.426 | Add to My Program |
| Following to Talk: Effects of Mobile Robot Guidance on Dialogue Motivation in a Field Trial |
|
| Iwasaki, Masaya | Osaka University |
| Masuda, Kazuki | Osaka University |
| Sakai, Kazuki | Osaka University |
| Ban, Midori | Osaka University |
| Chi, Zihao | Osaka University |
| Kawata, Megumi | Osaka University |
| Meneses, Alexis | Pontificia Universidad Católica Del Perú |
| Ishiguro, Hiroshi | Osaka University |
| Yoshikawa, Yuichiro | Osaka University |
Keywords: Social HRI, Wheeled Robots, Service Robotics
Abstract: Since robots are often disregarded in public interactions, many studies have examined how nonverbal cues and dialogue strategies encourage users to initiate engagement. However, the impact of robot “movement” remains insufficiently investigated. This study examined the psychological effects of movement behavior on willingness to engage in dialogue in a scenario where a mobile guide robot leads people to a stationary robot. A field experiment in a shopping mall showed that guidance by a mobile robot significantly increased dialogue duration, whereas no correlation was found between moving distance and willingness to engage. These results suggest that physical commitment induced by guided movement may enhance user motivation, and that interaction designs leveraging movement behavior may be important for advancing the social implementation of interactive robots.
|
| |
| 09:00-10:30, Paper TuI1I.427 | Add to My Program |
| Toward Open-Source and Modular Space Systems with ATMOS (I) |
|
| Roque, Pedro | Caltech |
| Phodapol, Sujet | KTH Royal Institute of Technology |
| Krantz, Elias | KTH Royal Institute of Technology |
| Lim, Jaeyoung | University of California, Berkeley |
| Verhagen, Joris Petrus Martinus | KTH Royal Institute of Technology |
| Jiang, Frank J. | KTH Royal Institute of Technology in Stockholm |
| Dörner, David | KTH Royal Institute of Technology |
| Mao, Huina | KTH Royal Institute of Technology |
| Tibert, Gunnar | KTH Royal Institute of Technology |
| Siegwart, Roland | ETH Zurich |
| Stenius, Ivan | KTH |
| Tumova, Jana | KTH Royal Institute of Technology |
| Fuglesang, Christer | KTH Royal Institute of Technology |
| Dimarogonas, Dimos V. | KTH Royal Institute of Technology |
Keywords: Space Robotics and Automation, Hardware-Software Integration in Robotics, Software-Hardware Integration for Robot Systems
Abstract: In the near future, most deployed spacecraft will be autonomous. Their tasks will involve autonomous rendezvous and proximity operations (RPOs) with large structures, such as inspection, assembly, and maintenance of orbiting space stations, as well as human-assistance tasks over shared workspaces. Yet, testing these capabilities remains challenging since microgravity conditions are difficult to simulate on Earth. Free-flying platforms, which replicate microgravity environments through nearly frictionless planar motion, have been used to provide a way to easily test and experiment on these systems without being in orbit. To promote replicable and reliable scientific results for autonomous control of spacecraft, we present the design of a space robotics platform based on open-source and modular software and hardware—the autonomy testbed for multipurpose orbiting systems (ATMOS). ATMOS uses thrusters and air bearings to achieve near-frictionless motion, thereby simulating spacecraft dynamics in two dimensions and enabling realistic testing of navigation and control algorithms on Earth. The simulation software provides a software-in-the-loop architecture that seamlessly transfers simulated results to the hardware. Our results provide an insight into the performance of such a system, including comparisons of hardware and software results, as well as control and planning methodologies for controlling free-flying platforms.
|
| |
| 09:00-10:30, Paper TuI1I.428 | Add to My Program |
| Offline-Trained GAN-Augmented Highly Adaptive Control with Multi-DoF Fusion for Pneumatic Soft Surgical Robots (I) |
|
| Lu, Yuxi | Tongji University |
| Zhou, Zhongchao | Graduate School of Science and Engineering, Chiba University |
| Zheng, Dongliang | Georgia Tech |
| Zhou, Yanmin | Tongji University |
| Wang, Zhipeng | Tongji University |
| Jiang, Shuo | Tongji University |
| Yu, Wenwei | Chiba University |
| He, Bin | Tongji University |
Keywords: Modeling, Control, and Learning for Soft Robots, Machine Learning for Robot Control, Soft Robot Applications
Abstract: Pneumatic soft robots are well-suited for minimally invasive surgery owing to their compliance and safe interaction with tissues. However, achieving highly adaptive control is difficult owing to modeling inaccuracies, inter-chamber coupling, and disturbances from surgical instruments. Non-learning adaptive methods depend on simplified models and perform poorly in unstructured settings. Conversely, learning-based methods often impose high computational costs in multi-degree-of-freedom (multi-DoF) pneumatic systems. A previous study proposed a generative adversarial network (GAN)-based proportional–integral–derivative (G-PID) controller that combined PID stability with learning-based adaptability by aligning system behavior with a reference model. However, its performance in highly coupled multi-DoF pneumatic soft robots was unverified, and its online adversarial training was computationally intensive. We addressed these limitations by developing an offline-trained G-PID controller, shifting adversarial training offline to reduce computational overhead, achieving 23-fold faster convergence, and enabling real-time, model-free control with balanced adaptability and efficiency. We evaluated three multi-DoF data fusion strategies, showing effective coordination of DoF coupling while maintaining individual control fidelity. Validation on a multi-DoF soft robotic mechatronic system for single-port transvesical prostatectomy revealed tip errors below 0.16 mm across surgical instrument
|
| |
| 09:00-10:30, Paper TuI1I.429 | Add to My Program |
| Unleashing Humanoid Reaching Potential Via Real-World-Ready Skill Space |
|
| Zhang, Zhikai | Tsinghua University |
| Chen, Chao | Peking University |
| Xue, Han | Tsinghua University |
| Wang, Jilong | Galaxy General Robot Co., Ltd |
| Liang, Sikai | Tongji University |
| Liu, Yun | Tsinghua University |
| Zhang, Zongzhang | Nanjing University |
| Wang, He | Peking University |
| Yi, Li | Tsinghua University |
Keywords: Humanoid Robot Systems, Whole-Body Motion Planning and Control, Legged Robots
Abstract: Humans possess a large reachable space in the 3D world, enabling interactions with objects at varying heights and distances. However, realizing such large-space reaching on humanoids is a complex whole-body control (WBC) problem. Learning from scratch often leads to optimization difficulty and poor sim2real transferability. To address these challenges, we present Real-world-Ready Skill Space (R2S2), a structural skill prior that helps autonomous whole-body-control task execution in an efficient manner while maintaining sim2real transferability. Inheriting knowledge from a set of real-world-ready primitive skills to ease multi-skill learning, R2S2 further expands the capability of primitive skills and learns a unified structural skill representation. By sampling from R2S2, we unleash humanoid reaching potential in many real-world tasks. As a beneficial side effect, R2S2 can also support humanoid whole-body teleoperation with a large reachable space. We validate the generalizability of R2S2 in various challenging goal-reaching tasks across different robot platforms, simulation and real world.
|
| |
| 09:00-10:30, Paper TuI1I.430 | Add to My Program |
| A Single Hydraulic Bellows-Based MRI-Safe Robotic Needle Driver Capable of Independent and Coupled Needle Translation and Rotation |
|
| Qiu, Yufu | The Chinese University of HongKong |
| Fang, Haiyang | The Chinese University of Hong Kong |
| Lin, Kwan Kit | The Chinese University of Hong Kong |
| Cheng, Shing Shin | The Chinese University of Hong Kong |
Keywords: Mechanism Design, Medical Robots and Systems, MRI-compatible robot, Hydraulic/Pneumatic Actuators
Abstract: Despite decades of research in magnetic resonance imaging (MRI)-compatible robotic technologies, the existing MRI-safe needle drivers rarely feature simultaneously high compactness, large insertion force, and motion versatility, all of which are critical to facilitate clinical translation in intraoperative MRI-guided percutaneous procedures. The paper presents an MR-safe needle driver that for the first time offers all these desired qualities. It measures only 2.2 × 5.3 × 3.8 cm (length × width × height), facilitating its adoption in in-bore skull-mounted or body-mounted MRI-guided procedures. It is driven by a single hydraulic bellows-based actuator, which provides good water sealing, smooth motion, and high expansion ratio, and a pre-clamped gripper design that offers large insertion force (>10 N). A compact passive rotation mechanism, together with a motion decoupling and switching mechanism, was introduced, allowing the needle to move with three motion types: independent translation, translation with passive rotation, and independent rotation. The passive rotation motion reduces needle deformation and tissue resistance during insertion, while the combination of independent tran
|
| |
| 09:00-10:30, Paper TuI1I.431 | Add to My Program |
| Active-Perceptive Language-Oriented Grasp Policy for Heavily Cluttered Scenes |
|
| Dai, Yixiang | Tsinghua University |
| Chen, Siang | Tsinghua University |
| Yang, Kaiqin | Tsinghua University |
| Hu, Dingchang | Tsinghua University |
| Xie, Pengwei | Tsinghua University |
| Li, Guosheng | Tsinghua University |
| Shen, Yuan | Tsinghua University |
| Wang, Guijin | Tsinghua University |
Keywords: Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation, RGB-D Perception
Abstract: Language-guided robotic grasping in cluttered environments presents significant challenges due to severe occlusions and complex scene structures, which often hinder accurate target localization. Existing approaches typically suffer from limited observational capabilities, resulting in suboptimal exploration of the target object. In this paper, we propose a novel Active-Perceptive Language-Oriented Grasp Policy (APeG) for heavily cluttered scenes. APeG develops an active perception scheme in the grasp pipeline via an occlusion-aware, semantic-guided viewpoint optimization strategy, enabling efficient exploration of cluttered scenes. In addition, a grasp-wise Reinforcement Learning (RL) policy is proposed to select robust grasp poses. Extensive real-world experiments validate the effectiveness of APeG, demonstrating significant improvements in both task success rate and operational efficiency over existing baselines, highlighting its potential for practical deployment in language-conditioned robotic manipulation.
|
| |
| 09:00-10:30, Paper TuI1I.432 | Add to My Program |
| Event-Triggered Indirect Herding Control of a Cooperative Agent |
|
| Amy, Patrick | University of Florida |
| Fallin, Brandon | University of Florida |
| Philor, Jhyv | University of Florida |
| Dixon, Warren | University of Florida |
Keywords: Multi-Robot Systems, Distributed Robot Systems
Abstract: This work explores the indirect herding control problem for a single pursuer agent regulating a single target agent to a goal location. To accommodate the constraints of sensing hardware, an event-triggered inter-agent influence model between the pursuer agent and target agent is considered. Motivated by fielded sensing systems, we present an event-triggered controller and trigger mechanism that satisfies a user-selected minimum inter-event time. The combined pursuer-target system is presented as a switched system that alternates between stable and unstable modes. A dwell-time analysis is completed to develop a closed-form solution for the maximum time the pursuer agent can allow the target agent to evolve in the unstable mode before requiring a control input update. The presented trigger function is designed to produce inter-event times that are upper-bounded by the maximum dwell time. The effectiveness of the proposed approach is demonstrated through both simulated and experimental studies, where a pursuer agent successfully regulates a target agent to a desired goal location.
|
| |
| 09:00-10:30, Paper TuI1I.433 | Add to My Program |
| A&B-LO: Continuous-Time LiDAR Odometry with Adaptive Non-Uniform B-Spline Trajectory Representation |
|
| Lu, Yuchu | Tongji University |
| Yao, Chenpeng | Tongji University |
| Du, Jiayuan | Tongji University |
| Liu, Chengju | Tongji University |
| Chen, Qijun | Tongji University |
Keywords: SLAM, Range Sensing, Localization
Abstract: LiDAR odometry, fused by inertial measurement units (IMU), is an essential task in robotics navigation. Unlike the mainstream methods compensate the motion distortion of LiDAR data by high frequency inertial sensors, this paper deals with the distortion with continuous-time trajectory representation, and achieved competitive performance against state-of-the-art. We propose a compact framework of LiDAR odometry with adaptive non-uniform B-spline trajectory representation to formulate it as continuous-time estimation problem. We deploy point-to-plane registration and pseudo-velocity smoothing constraints to fully utilize geometric and kinematic information of odometry. For faster convergence of optimization, analytical Jacobian of constraints is derived to solve the non-linear least squares minimization. For more efficient B-spline representation, an adaptive knot spacing technique is proposed to adjust the time interval of control poses of spline. Extensive experiments on public and realistic datasets demonstrate validation and efficiency of our system compared with other LiDAR or LiDAR-inertial methods.
|
| |
| 09:00-10:30, Paper TuI1I.434 | Add to My Program |
| Haptics of Pulse Palpation: Simulation and Validation through Novel Sensor-Actuator System (I) |
|
| Subudhi, Debadutta | Indian institute of technology madras, India |
| M, Manivanna | IIT Madras |
| Deepak, K K | IIT Delhi |
|
|
| |
| 09:00-10:30, Paper TuI1I.435 | Add to My Program |
| Perceived Intensity of Pneumatic Vibrotactile Stimuli: Effects of Pressure, Frequency, and Stiffness (I) |
|
| Kommuri, Krishna Dheeraj | Eindhoven University of Technology |
| van Beek, Femke | Technical University of Eindhoven |
| Kuling, Irene | Eindhoven University of Technology |
Keywords: Haptics and Haptic Interfaces, Soft Sensors and Actuators
Abstract: Vibrotactile actuators are used in many different haptic devices, e.g. game controllers. These vibrotactile actuators are typically made of rigid materials. In this paper, we use soft pneumatic actuators known as Pneumatic Unit Cell (PUC) to characterize the perceived intensity of vibrotactile stimuli when presented at the tip of the index finger. This study investigates how three parameters—stimulus pressure (4 to 30 kPa), inflation-deflation frequency (20 to 100 Hz), and actuator stiffness (determined by top layer thicknesses of 0.9 mm and 1.2 mm)—influence the perceptual intensity of the stimuli. Psychophysical experiments involving 16 participants were conducted using the AEPsych toolbox. These reveal that all the three parameters - pressure, frequency, and actuator stiffness significantly affect perceptual intensity. The findings indicate that both pressure and frequency exhibit a positive main effect and a positive interaction effect on perceived vibrotactile intensity. Additionally, the results show that, for a given frequency, pressure variations produce more perceptually distinct stimuli than frequency variations for a given pressure. Finally, presenting vibrotactile stimuli on a less stiff PUC actuator was perceived as being less intense than when the same stimulus was presented on a stiffer PUC actuator. Overall, this study provides key insights into the combined influence of pressure, frequency and actuator stiffness on the perceived vibrotactile intensity.
|
| |
| 09:00-10:30, Paper TuI1I.436 | Add to My Program |
| Targetless LiDAR-Camera Calibration with Neural Gaussian Splatting |
|
| Jung, Haebeom | Seoul National University |
| Kim, Namtae | Seoul National University |
| Kim, Jungwoo | Yonsei University |
| Park, Jaesik | Seoul National University |
Keywords: Sensor Fusion, Calibration and Identification, Computer Vision for Transportation
Abstract: Accurate LiDAR-camera calibration is crucial for multi-sensor systems. However, traditional methods often rely on physical targets, which are impractical for real-world deployment. Moreover, even carefully calibrated extrinsics can degrade over time due to sensor drift or external disturbances, necessitating periodic recalibration. To address these challenges, we present a Targetless LiDAR–Camera Calibration (TLC-Calib) that jointly optimizes sensor poses with a neural Gaussian–based scene representation. Reliable LiDAR points are frozen as anchor Gaussians to preserve global structure, while auxiliary Gaussians prevent local overfitting under noisy initialization. Our fully differentiable pipeline with photometric and geometric regularization achieves robust and generalizable calibration, consistently outperforming existing targetless methods on the KITTI-360, Waymo, and Fast-LIVO2 datasets. In addition, it yields more consistent Novel View Synthesis results, reflecting improved extrinsic alignment. The project page is available at: https://www.haebeom.com/tlc-calib-site/.
|
| |
| TuI1LB Late Breaking Results Session, Hall C |
Add to My Program |
| Late Breaking Results 1 |
|
| |
| |
| 09:00-10:30, Paper TuI1LB.1 | Add to My Program |
| Vision-Guided Robotic Grinding with Deep Learning-Based Bead Segmentation and Digital Twin Verification |
|
| Kim, Seong Hyeon | Korea Institute of Industrial Technology |
| Kim, Hyo-Young | Tech University of Korea (TUK) |
Keywords: Computer Vision for Manufacturing, Industrial Robots, RGB-D Perception
Abstract: Weld bead grinding is a critical post-processing step in metal fabrication, yet conventional robotic grinding based on teach-pendant programming lacks adaptability to variations in bead geometry and position. This paper presents a vision-guided robotic grinding system that combines deep learning-based weld bead segmentation, automated grinding path generation, and digital twin-based pre-verification. A U-Net model with a ResNet34 encoder and ImageNet pre-training segments weld bead regions from RGB images captured by an Intel RealSense D415 camera mounted on a Staubli RX160 manipulator, achieving a mean Intersection over Union (IoU) of 0.9311 and a Dice coefficient of 0.9641. The segmented bead contours are transformed into the robot coordinate frame through hand-eye calibration and forward kinematics, enabling automated generation of grinding waypoints along the bead centerline. The CHOMP algorithm plans collision-free trajectories within MoveIt, and all planned motions are validated in a digital twin environment built on NVIDIA Isaac Sim 5.0, integrated with ROS through a distributed multi-container architecture. Experimental results demonstrate that the proposed system effectively generates adaptive grinding paths for varying weld bead geometries and verifies them in simulation before physical deployment.
|
| |
| 09:00-10:30, Paper TuI1LB.2 | Add to My Program |
| Learning Contact Tasks Skills Based on DMP and Affordance Templates |
|
| Lee, Hunjo | Korea University of Science and Technology, Korea Institute of Industrial Technology |
| Yang, Gi-Hun | KITECH |
Keywords: Imitation Learning, Telerobotics and Teleoperation, Motion Control
Abstract: Learning from demonstration (LfD) enables robots to learn experts’ skills by human demonstration. Recently, LfD has been developed for learning and performing skills in contact-rich tasks. However, task performance has not been generalized to unknown poses in contact-rich tasks. In this paper, we propose a teleoperation-based learning from demonstration (LfD) framework for performing contact-rich tasks in unknown poses. Expert demonstrations are collected via a bilateral teleoperation system, with an orientation synchronization algorithm aiding intuitive manipulation. From demonstrations, position and wrench profiles are recorded. Task trajectories are learned using dynamic movement primitives (DMP), while strategy learning allocates input and compliance spaces based on affordance templates to adapt motion during contact. By combining trajectory and strategy learning, the framework successfully reproduces manipulation behaviors in novel configurations. Experiments on turning-valve and peg-in-hole insertion validate the method, showing improved success rates and robustness to pose variations.
|
| |
| 09:00-10:30, Paper TuI1LB.3 | Add to My Program |
| Learning Traversability Cost Maps with Decomposed Uncertainties Via Continuous-State MEDIRL |
|
| Song, Gwanhyeong | Seoul National University |
| Lee, Dongjae | Seoul National University |
| Kim, Ayoung | Seoul National University |
Keywords: Reinforcement Learning, Imitation Learning, Field Robots
Abstract: Accurate traversability assessment is critical for mobile robot motion planning, yet sensor occlusions and model limitations often compromise cost map reliability. Therefore, analyzing spatial uncertainty is essential for robust risk management. We propose a novel Maximum Entropy Deep Inverse Reinforcement Learning (MEDIRL) framework that learns a traversability cost map while explicitly disentangling aleatoric and epistemic uncertainties. Aleatoric uncertainty is captured via latent sampling in a Conditional Variational Autoencoder, while epistemic uncertainty is estimated using a decoder ensemble. For kinematic fidelity, we introduce efficient continuous-state rollouts utilizing precomputed transition grids and bilinear interpolation. Fusing camera and LiDAR features, our model achieves stable convergence guided by a novel margin loss. Results demonstrate that learned state visitation frequencies match expert trajectories, and the decomposed uncertainties effectively identify high-risk terrains, providing a crucial foundation for safer autonomous navigation.
|
| |
| 09:00-10:30, Paper TuI1LB.4 | Add to My Program |
| A Wire-Driven Robotic Hand with Mode-Switchable Planetary Transmission for Dynamic Manipulation |
|
| Choi, Jeongseok | Hanyang University |
| Lee, Minsu | Hanyang University |
| Shin, WooSeong | Hanyang University |
| Seo, TaeWon | Hanyang University |
Keywords: Multifingered Hands, In-Hand Manipulation, Grippers and Other End-Effectors
Abstract: Dynamic hand manipulation requires both precise motion control and rapid actuation capability, yet most existing robotic hands are primarily optimized for dexterity and accuracy, often sacrificing speed performance. To address this limitation, this work presents a mode-switchable wire-driven robotic hand incorporating a planetary transmission mechanism capable of continuously varying output speed and torque characteristics. The proposed system operates in two distinct modes: a torque-enhancing mode for stable and precise grasping, and a speed-amplifying mode for agile dynamic motions such as flicking and throwing. A dedicated mechanical switching mechanism enables real-time transition between the two transmission modes according to task requirements. A full prototype of the five-finger robotic hand is currently under fabrication and system integration, and preliminary analytical results demonstrate the feasibility of both precise grasping and enhanced high-speed manipulation capability. These results validate the proposed transmission architecture as a promising solution for robotic hands requiring both dexterous and dynamic manipulation.
|
| |
| 09:00-10:30, Paper TuI1LB.5 | Add to My Program |
| Stable Worker Intention Recognition Via Transformer and CRF-Ontology Decoding for Human–Robot Collaboration |
|
| Park, Hwijin | Kyungpook National University |
| Kwon, HyunBin | Kyungpook National University |
| Lee, Hyundo | Kyungpook National University |
| Park, Cheol Woo | KNU |
| Yi, Hak | Kyungpook National University |
Keywords: Intention Recognition, Multi-Modal Perception for HRI, Human-Robot Collaboration
Abstract: This paper proposes a transformer-based single stream model with CRF–ontology decoding for stable worker intention recognition in human–robot collaboration(HRC). Although existing intention recognition methods achieve high accuracy, they often suffer from temporal prediction instability and logically inconsistent combinations among actions, tools, parts, and intentions. To address these issues, the proposed approach employs a transformer encoder to integrate worker actions and part-related information, thereby capturing the task context and jointly predicting actions, tools, parts, and intentions. For intention prediction, a conditional random field (CRF) is applied to enforce temporal consistency and improve prediction stability. In addition, an ontology-based postprocessing step removes infeasible combinations under a given task intention and reselects predictions that satisfy structural constraints. Experimental results show that the CRF reduces the intention change rate from 7.9% to 3.0%, improving temporal stability, while ontology-based decoding decreases the violation rate from 26.5% to 6.9% by eliminating inconsistent predictions. When combined, the proposed method achieves both a low change rate (3.0%) and a low violation rate (3.7%), demonstrating its effectiveness for reliable intention recognition in HRC.
|
| |
| 09:00-10:30, Paper TuI1LB.6 | Add to My Program |
| Agile Collision Avoidance for Deformable-Tethered Multi-Robot Systems Via Zone-Aware Hierarchical Learning and VLM-Guided Control |
|
| Zhou, Zeyu | The Hong Kong Polytechnic University |
| Zhang, Jingwei | University of Chinese Academy of Science |
| Zhi, Hui | The Hong Kong Polytechnic University |
| Hao, Yun | The Hong Kong Polytechnic University |
| Tang, Wei | Northwestern Polytechnical University |
| Navarro-Alarcon, David | The Hong Kong Polytechnic University |
Keywords: Reinforcement Learning, Flexible Robotics, Hybrid Logical/Dynamical Planning and Verification
Abstract: Navigating Linked Multi-Component Robotic Systems (L-MCRS)---robot pairs tethered by passive flexible hoses---through dynamic pedestrian environments is fundamentally harder than rigid multi-robot coordination, as the uncontrollable hose creates a variable-geometry collision footprint spanning 118 pairwise combinations. We propose H-SEPID, unifying zone-aware Hierarchical Reinforcement Learning grounded in Kinematic Flow Theory with VLM-guided cascaded optimization. A phase-aware dual attention value network performs C0-continuous topological policy switching, while a Vision-Language Model infers strategic intent and quantifies action-space constraints governing hose geometry. A seven-category safety shield with ORCA fallback and a threading reward band produce emergent gap-threading maneuvers. H-SEPID achieves 94 success and 4 collision rate in an 8-robot, 5-pedestrian, 4-hose scenario, outperforming five baselines, and is validated on real e-puck2 robots across 12 configurations.
|
| |
| 09:00-10:30, Paper TuI1LB.7 | Add to My Program |
| Rapid Robot Manipulation Policy Learning Via Hierarchical Foundation-Model Prior Distillation |
|
| Dong, Qingwei | Shenyang Institute of Automation, Chinese Academy of Sciences |
| Zhang, Jiyuan | University of Chinese Academy of Sciences |
| Wan, Guangxi | Shenyang Institute of Automation, Chinese Academy of Sciences |
| Liu, Ruikai | University of Chinese Academy of Sciences |
| Zeng, Peng | Shenyang Institute of Automation Chinese Academy of Sciences |
Keywords: Reinforcement Learning, Deep Learning Methods, Learning from Experience
Abstract: In robotic skill acquisition, rapid policy learning remains challenging due to high-dimensional state-action spaces and inefficient exploration in the early stage of training cite{p1}. Although the pre-trained OpenVLA model exhibits cross-task generalization and can generate goal-directed actions for unseen tasks under suitable prompts, its direct application to novel manipulation tasks remains limited, while full fine-tuning is computationally expensive. To address this issue, we propose a hierarchical framework that combines OpenVLA with reinforcement learning for efficient skill acquisition. Specifically, OpenVLA is used to generate diverse task-related prior trajectories through prompt engineering, and reinforcement learning leverages these priors to fit local dynamics and constrain policy exploration. In this way, the proposed method improves adaptation efficiency and accelerates policy learning on new tasks. We evaluate the framework on multiple manipulation tasks in the LIBERO environment.
|
| |
| 09:00-10:30, Paper TuI1LB.8 | Add to My Program |
| Uncertainty-Aware Stereo Grasp Point Selection for Deformable Linear Objects |
|
| Saccani, Cristina | University of Bologna, DEI |
| Caporali, Alessio | University of Bologna |
| Palli, Gianluca | University of Bologna |
Keywords: Planning under Uncertainty, Perception for Grasping and Manipulation, Deep Learning for Visual Perception
Abstract: Reliable grasp point selection on deformable linear objects, such as cables, requires not only accurate depth estimation but also awareness of prediction reliability. We present a five-stage stereo network for joint disparity, semantic, and uncertainty estimation, and use the predicted uncertainty to filter grasp candidates before geometric ranking. Disparity uncertainty is modeled via a Laplace negative log-likelihood, semantic uncertainty via the entropy of semantic predictions, with an alignment term enforcing consistency between them. Experiments on a synthetic stereo dataset show that uncertainty-aware selection reduces the mean grasp-point depth error from 4.19 mm to 1.55 mm, increases the success rate within a 3 mm tolerance from 74.2% to 88.6%, and lowers the 90th percentile of the failure exceedance above 3 mm from 29.47 mm to 6.77 mm. These results show that uncertainty is an effective cue for safer grasp selection on deformable linear objects.
|
| |
| 09:00-10:30, Paper TuI1LB.9 | Add to My Program |
| Learning What Matters: Task Tailored Dynamics Models through Differentiable MPC |
|
| Węgrzynowski, Jan | Poznan University of Technology |
| Kicki, Piotr | Poznan University of Technology |
| Czechmanowski, Grzegorz | Poznan University of Technology, IDEAS Research Institute, IDEAS NCBR |
| Walas, Krzysztof, Tadeusz | Poznan University of Technology |
Keywords: Optimization and Optimal Control, Machine Learning for Robot Control, Calibration and Identification
Abstract: In model-based control, dynamics models are typically trained by minimizing open-loop prediction errors uniformly across all states. However, due to finite model capacity, this misallocates representational power, as not all prediction errors impact the downstream closed-loop performance equally. In this extended abstract, we propose a task-aware training methodology for a prediction model used in the context of Model Predictive Control (MPC). By extracting analytical sensitivities via differentiable MPC, we construct a loss function that weights multi-step dynamics model prediction errors based on their impact on the closed-loop task cost. Experimental results on a simulated 7DoF manipulator demonstrate that our sensitivity-weighted loss significantly improves closed-loop tracking performance compared to standard Mean Squared Error (MSE) or variance-based state standardization.
|
| |
| 09:00-10:30, Paper TuI1LB.10 | Add to My Program |
| Intent Recognition in Gait Transition Using Muscle Volume Sensors with Deep Learning |
|
| Park, Geonwoo | Kyungpook National University |
| Woo, MooJin | Kyungpook National University |
| Oh, Jiwoo | Kyungpook National University |
| Chei, Hyeon Chan | Kyungpook National University |
| Oh, Keonyoung | Kyungpook National University |
Keywords: Intention Recognition, Wearable Robotics, Deep Learning Methods
Abstract: Intention recognition is essential for wearable robotics and assistive systems. However, conventional approaches often suffer from cumbersome sensor setups or sensitivity to external disturbances. To address these limitations, this study proposes an LSTM-based intention recognition method using lower-limb Muscle-Volume (MV) sensors. An insole-type pressure sensor, an IMU sensor, and a cuff-type MV sensor were used to record a series of motions, including sitting, standing, walking, and running. Deep learning techniques were then applied for classification and transition detection. Accuracies of the predicted movement states based on data from the IMU, insole-type pressure, and cuff-type MV sensors were 93.04%, 97.65%, and 93.08%, respectively. The average transition detection latencies for the IMU, insole, and MV sensor model were 0.135 s, 0.377 s, and 0.455 s, respectively. Results show that the proposed MV sensor achieves performance comparable to insole pressure sensors, demonstrating its potential as a practical and robust alternative for intention recognition in wearable systems.
|
| |
| 09:00-10:30, Paper TuI1LB.11 | Add to My Program |
| Suppressing Initial Force Overshoot Using Admittance Filter and ASMC under Contact Location Uncertainty |
|
| Oh, Sejik | Yeungnam University |
| Kim, Yi Gyeom | Yeungnam University |
| Jo, Hyojin | Yeungnam University |
| Park, Dogyun | Yeungnam University |
| Kwon, Nam Kyu | Yeungnam University |
Keywords: Force Control, Compliance and Impedance Control, Robust/Adaptive Control
Abstract: This paper proposes a control method to mitigate initial force overshoot caused by contact surface position estimation errors. The proposed method adds a compensation term based on Adaptive Sliding Mode Control (ASMC) to a conventional admittance control structure. Also, to maintain position tracking performance under model uncertainty, a hierarchical control structure is designed by combining a Time-Delay Control (TDC)-based internal position controller. Simulation results show that the proposed method reduces the maximum overshoot from 76.6% to 32.1% compared to conventional admittance control, and shortens the peak time from 0.254 s to 0.126 s. Furthermore, the settling time is reduced from 3.51 s to 1.467 s at the 2% criterion and from 2.113 s to 0.832 s at the 5% criterion, improving transient response stability and convergence speed.
|
| |
| 09:00-10:30, Paper TuI1LB.12 | Add to My Program |
| Reinforcement Learning for Stair Locomotion of a Wheeled Bipedal Robot with Contact-Guided Behavior Cloning |
|
| Kim, Yi Gyeom | Yeungnam University |
| Oh, Sejik | Yeungnam University |
| Jo, Hyojin | Yeungnam University |
| Park, Dogyun | Yeungnam University |
| Kwon, Nam Kyu | Yeungnam University |
Keywords: Reinforcement Learning, Field Robots
Abstract: This paper proposes a contact event-guided PPO with Behavior Cloning (PPO-BC) framework for stair locomotion of a 2-wheel 2-leg (2W2L) wheeled bipedal robot. Stair traversal is difficult because successful climbing depends on brief and sparse wheel-stair contact events that require precise leg lifting and posture stabilization. To address this issue, the proposed method trains a student policy using a combined objective of PPO-based reinforcement learning and behavior cloning from a pretrained frozen teacher policy. The teacher learns leg-centered climbing behaviors, while the student learns full 8-DoF control. A soft contact gate detects stair interaction directly from wheel contact forces and increases the BC contribution during critical contact phases without external terrain sensors. The method is validated under a minimal reward structure based on velocity tracking and postural stability, without stair-specific shaping rewards. Experiments in Isaac Lab simulation show that the proposed method outperforms both pure PPO and uniform PPO-BC in stair-crossing performance while maintaining stable locomotion after traversal.
|
| |
| 09:00-10:30, Paper TuI1LB.14 | Add to My Program |
| Distributed AI for Robotics |
|
| Singh, Satyabhama | Fraunhofer Institute for Microelectronic Circuits and Systems |
| Wulfert, Lars | Fraunhofer Institute for Microelectronic Circuits and Systems IMS |
| Woehrle, Hendrik | Fraunhofer Institute for Microelectronic Circuits and Systems |
Keywords: Reinforcement Learning, Machine Learning for Robot Control, AI-Based Methods
Abstract: Robot learning primarily relies on centralized training. While it provides the infrastructure, centralization limits parallel and collaborative learning among robots and place significant computational load on the central server, indicating the need for federated learning (FL) in context of multi-robot training. However, robots trained in a federated setup are subjected to non-independent and identically distributed data (non-IID), resulting in degraded model performance. This extended abstract presents the current state of research aimed at improving robot learning under non-IID conditions in FL. In this regard, this work provides an initial comparative analysis of robot learning methods in centralized and federated training setups, with an emphasis on the impact of non-IID data on learning behaviour in a simulation environment. The results highlight the differences in learning stability across algorithms and present the influence of non-IID goal distributions on performance.
|
| |
| 09:00-10:30, Paper TuI1LB.15 | Add to My Program |
| Exploring History-Aware Online Actor-Critic for Smart Manufacturing Tasks in the RICAIP Testbed |
|
| Horelican, Tomas | Brno University of Technology |
Keywords: Reinforcement Learning, Machine Learning for Robot Control, Autonomous Agents
Abstract: As manufacturing capabilities advance to greater autonomy, interest is increasingly directed toward versatile agents capable of performing complex tasks. Recently, learning-based approaches have shown more rapid progress compared to classical methods. While these advancements are enabled by the offline setting of Imitation Learning (IL), transfer to pure online exploration Reinforcement Learning (RL) remains less explored. This work experiments with a simple extension to the standard Markovian MLP policy by explicitly encoding a history of states using a tiny transformer model.
|
| |
| 09:00-10:30, Paper TuI1LB.16 | Add to My Program |
| Development of a Testbed and Quantitative Evaluation Framework for Characterization of Robot Actuator Dynamics |
|
| Kim, Deokgyu | Yeungnam University |
| Kangwagye, Samuel | Aalborg University |
| Heo, Young Jin | POSTECH |
| Oh, Sehoon | DGIST |
| Lee, Chan | Yeungnam University |
Keywords: Performance Evaluation and Benchmarking, Actuation and Joint Mechanisms, Calibration and Identification
Abstract: The performance of robot actuators is still primarily evaluated using manufacturer-provided static specifications such as maximum torque and rated speed. However, these metrics are insufficient for assessing dynamic behaviors that are essential for physical interaction, including backdrivability, transparency, and disturbance response. This paper presents HYPERDYNE, a novel proof of concept test platform and evaluation framework for dynamic characterization and quantitative benchmarking of robot actuators. The reconfigurable testbed is developed, enabling three test configurations of no-load, fixed-load, and interaction scenarios within a single hardware setup. In addition, an evaluation protocol is proposed that includes system identification, control performance, load robustness, and disturbance rejection. Experimental validation on a QDD actuator demonstrates that the proposed framework enables the extraction of key dynamic parameters such as backlash, friction, inertia, and frequency response characteristics, while also providing performance indices for objective comparison. The results show that actuator performance can be quantitatively assessed beyond conventional static specifications, supporting the development of robots with improved physical interaction capabilities.
|
| |
| 09:00-10:30, Paper TuI1LB.17 | Add to My Program |
| ManiMorph: Object Representations in Robot Manipulators Morphology for Improving Multi-Task Manipulation Performance |
|
| Abdalla, Ali | African Institute for Mathematical Sciences in South Africa (AIMS SA) |
| Przystupa, Michael | University of Alberta |
| Zu, Xinrui | Vrije University Amsterdam |
| Luck, Kevin Sebastian | Vrije Universiteit Amsterdam |
| Berseth, Glen | Université De Montréal |
Keywords: Reinforcement Learning, Deep Learning in Grasping and Manipulation, Deep Learning Methods
Abstract: Robot manipulation tasks involve direct interactions with objects, which can be viewed as dynamic changes to the robot’s kinematic chain. Morphology-aware learning frameworks, in which robot embodiment is explicitly modeled, do not account for these object-induced changes in their architectures. We address this gap by proposing ManiMorph, a multi-task, morphology-aware manipulation-learning framework in which object features are integrated into the robot’s morphological graph. We demonstrate that this node-centric representation, combined with a Feature-wise Linear Modulation (FiLM) task component, enhances the performance of the morphology-aware frameworks for robotic manipulation and generalizes effectively to new object variations.
|
| |
| 09:00-10:30, Paper TuI1LB.18 | Add to My Program |
| Beyond Reactive Adaptation: Long-Horizon Memory for Autonomous Racing Via State Space Models |
|
| Czechmanowski, Grzegorz | Poznan University of Technology, IDEAS Research Institute, IDEAS NCBR |
| Węgrzynowski, Jan | Poznan University of Technology |
| Kicki, Piotr | Poznan University of Technology |
| Walas, Krzysztof, Tadeusz | Poznan University of Technology |
Keywords: Reinforcement Learning, Machine Learning for Robot Control, Autonomous Vehicle Navigation
Abstract: utonomous racing pushes vehicles to their physical limits, requiring control policies that can rapidly adapt to localized changes in track conditions, such as varying surface friction. Current Reinforcement Learning (RL) approaches rely either on ground-truth system identification, which is impractical in the real world, or short-horizon reactive adaptations (e.g., Rapid Motor Adaptation (RMA)) that cannot remember spatial disturbances across multiple laps. In this extended abstract, we propose a novel RL architecture based on Mamba, a structured State Space Model (SSM), for autonomous racing. By fusing vehicle state with Fourier features of vehicle position on the racetrack, our Mamba-based policy builds a long-horizon episodic memory. This allows the policy not only to adapt to unknown friction online but also to map and memorize slippery zones for future laps. Evaluated in a simulated F1Tenth environment, our approach demonstrates continuous lap-to-lap improvement, approaching the performance of an ”oracle” policy trained on exact ground-truth friction, whereas standard Multi-Layer Perceptron (MLP) and Recurrent Neural Network (RNN) baselines plateau at inferior performance levels.
|
| |
| 09:00-10:30, Paper TuI1LB.19 | Add to My Program |
| Low-Dimensional Tactile Glove for Visuo-Tactile Robot Hand Control: A Preliminary Study |
|
| Baek, Gi-gwang | DGIST |
| Kim, DongWook | Daegu Gyeongbuk Institute of Science and Technology (DGIST) |
Keywords: Dexterous Manipulation, Perception for Grasping and Manipulation, Force and Tactile Sensing
Abstract: Dexterous control of multi-joint manipulators such as humanoid robot hands remains challenging when relying solely on visual feedback. Cameras are fundamentally limited in measuring contact forces, slip, and surface deformation during physical interaction, and are susceptible to occlusion in contact-rich scenarios. Tactile sensing is therefore widely considered essential for robust dexterous manipulation. However, existing approaches predominantly rely on high-cost sensors, imposing non-trivial burdens on data collection and robot deployment, which constrains the scalability of tactile sensing in practical robotic systems. To address these limitations, we present a low-dimensional wearable tactile glove as a scalable platform for visuo-tactile robot hand control, and propose a two-level learning framework built upon it. The glove incorporates 20 FSR400 sensors and achieves stable 300 Hz acquisition through dual multiplexing and WiFi TCP communication with clock synchronization. Hardware validation confirms low inter-frame jitter and noise-free signal acquisition across all channels. The proposed framework first investigates whether binary tactile signals are sufficient to recover meaningful force distributions through learning, and subsequently extends to visuo-tactile representation learning, where tactile and visual modalities are jointly leveraged to learn shared cross-modal representations for downstream manipulation tasks.
|
| |
| 09:00-10:30, Paper TuI1LB.20 | Add to My Program |
| Towards Massively Parallel Motion Planning with Inverse Dynamics |
|
| Tsikelis, Ioannis | Inria |
| Mingo Hoffman, Enrico | INRIA Nancy - Grand Est |
Keywords: Optimization and Optimal Control, Computer Architecture for Robotic and Automation, Underactuated Robots
Abstract: Parallel evaluation of robotic system environments is becoming increasingly popular in modern robotics applications for machine learning and stochastic control. At the same time, the field of model-based control has matured enough to provide solutions that cover the needs of sophisticated robotics platforms. However, few works address the parallelization of such solvers to be combined with the above approaches and accelerate research in robot planning and control. We present preliminary results towards a novel implementation of a batched SQP solver for equality-constrained optimal control. After linearizing the dynamics in the SQP step, we employ a state-control equality constrained LQR solver. The additional equality constraints yield a structured system at each stage that can be solved via a Riccati-recursion-based block elimination. We evaluate our approach on an inverse-dynamics-based optimal control problem, in contrast to the forward-dynamics formulations typical of related works. Our results demonstrate computational efficiency and structural advantages for massively parallel environments. Our implementation, available here, is developed in PyTorch, taking advantage of the library's batched linear algebra suite for parallelization.
|
| |
| 09:00-10:30, Paper TuI1LB.21 | Add to My Program |
| Toward Human Preference Optimization for Vision-Language-Action Models: A Pilot Study on the Limits of Imitation Learning |
|
| Lee, Tae-Won | Daegu Gyeongbuk Institute of Science and Technology (DGIST) |
| Kim, DongWook | Daegu Gyeongbuk Institute of Science and Technology (DGIST) |
Keywords: Imitation Learning, Reinforcement Learning, Deep Learning in Grasping and Manipulation
Abstract: Vision-Language-Action (VLA) models trained via imitation learning have achieved impressive results on robotic manipulation, yet their performance degrades significantly on complex, multi-step tasks. We evaluate NVIDIA GR00T N1.6, a state-of-the-art cross-embodiment VLA model (~1.09B parameters), on the SimplerEnv Fractal benchmark to systematically identify where imitation learning falls short. We conduct closed-loop evaluation across six manipulation tasks of increasing complexity using a Google Robot embodiment, with 200 episodes per task. Our results reveal a stark performance gap: simple tasks such as picking a can achieve 90.0% success rate, while complex sequential tasks such as placing an object in a closed drawer achieve only 4.5%. Average episode time further confirms this — simple tasks complete in under 3 seconds, while complex tasks approach the maximum step timeout, indicating the policy fails to make meaningful progress. We identify three failure modes driving this degradation: absence of recovery behaviors, compounding distribution shift over long horizons, and inability to optimize trajectory quality beyond mimicking demonstrations. Based on these findings, we propose Human Preference Optimization (HPO) as a post-training strategy — leveraging human trajectory rankings and reinforcement learning to refine VLA policies beyond what demonstration data alone can teach.
|
| |
| 09:00-10:30, Paper TuI1LB.22 | Add to My Program |
| Enhancing VLA Precision in Robotic Manipulation Via FiLM-Based Force/Torque-Vision Integration |
|
| Nam, Gunhee | Chonnam National University |
| Hong, Ayoung | Chonnam National University |
Keywords: Learning from Demonstration, Deep Learning in Grasping and Manipulation
Abstract: We propose a multimodal integration framework to enhance the precision of Vision-Language-Action (VLA) models in contact-rich robotic tasks. Although visual perception is essential for task grounding, it often lacks the force awareness required for high-precision alignment and insertion. To address this limitation, we leverage Feature-wise Linear Modulation (FiLM) to condition intermediate visual representations on 6-axis Force/Torque (F/T) data. This lightweight fusion strategy allows the model to modulate its action predictions based on real-time physical resistance without incurring significant computational overhead. Experimental results on a UR5e manipulator demonstrate that the proposed F/T-Vision integration enhances contact stability and precision in demanding manipulation tasks compared with vision-only baselines.
|
| |
| 09:00-10:30, Paper TuI1LB.23 | Add to My Program |
| Trajectory Design Trade-Offs in a 1-DoF Transformable Wheel for Obstacle Climbing |
|
| Lee, Jaebaek | Pusan National University |
| Kim, Youngsoo | Pusan National University |
Keywords: Climbing Robots, Performance Evaluation and Benchmarking, Wheeled Robots
Abstract: As mobile service robots expand into human living environments, their ability to negotiate structured obstacles—such as thresholds, curbs, and stairs—has become increasingly important. Transformable wheels offer a compelling alternative to high-DoF locomotion by preserving the efficiency and maneuverability of conventional wheels on flat ground while selectively reconfiguring their geometry only when obstacle negotiation is required. Among such systems, the 1-DoF RPRP transformable wheel achieves step climbing with minimal actuation by mechanically coupling radial transformation and spoke tilting through an internal linkage. This reduced-actuation architecture, however, also creates a distinctive design challenge: because the mechanism lacks kinematic redundancy, a very small number of trajectory parameters exert a disproportionately large influence on climbing behavior. As a result, performance is governed less by control flexibility and more by how transformation timing and posture are coordinated throughout the climbing cycle. Despite this, prior studies on transformable wheels have largely focused on mechanism design and kinematic feasibility, leaving insufficient understanding of how trajectory design shapes the trade-offs among motion smoothness, actuator load, and power demand [1]–[3]. To address this gap, this study presents a trajectory-level design-space exploration framework for a 1-DoF transformable wheel, in which the obstacle-climbing motion is parameterized us
|
| |
| TuAT1 Award Session, Hall A2 |
Add to My Program |
| Award Finalists 1 |
|
| |
| Chair: Kober, Jens | University of Stuttgart |
| |
| 11:00-11:10, Paper TuAT1.1 | Add to My Program |
| GRITS: A Spillage-Aware Guided Diffusion Policy for Robot Food Scooping Tasks |
|
| Tai, Yen-Ling | National Yang Ming Chiao Tung University |
| Yang, Yi-Ru | National Yang Ming Chiao Tung University |
| Yu, Kuan-Ting | XYZ Robotics |
| Chao, Yu-Wei | NVIDIA |
| Chen, Yi-Ting | National Yang Ming Chiao Tung University |
Keywords: Imitation Learning, Learning from Demonstration, Deep Learning Methods
Abstract: Robotic food scooping is a critical manipulation skill for food preparation and service robots. However, existing robot learning algorithms, especially learn-from-demonstration methods, still struggle to handle diverse and dynamic food states, which often results in spillage and reduced reliability. In this work, we introduce GRITS: A Spillage-Aware Guided Diffusion Policy for Robot Food Scooping Tasks. This framework leverages guided diffusion policy to minimize food spillage during scooping and to ensure reliable transfer of food items from the initial to the target location. Specifically, we design a spillage predictor that estimates the probability of spillage given current observation and action rollout. The predictor is trained on a large-scale simulated dataset with food spillage scenarios, constructed from four primitive shapes (spheres, cubes, cones, and cylinders) with varied physical properties such as mass, friction, and particle size. At inference time, the predictor serves as a differentiable guidance signal, steering the diffusion sampling process toward safer trajectories while preserving task success. We validate GRITS on a real-world robotic food scooping platform. GRITS is trained on six food categories and evaluated on ten unseen categories with different shapes and quantities. GRITS achieves an 82% task success rate and a 4% spillage rate, reducing spillage by over 40% compared to baselines without guidance, thereby demonstrating its effectiveness. More details are available on our project website: https://hcis-lab.github.io/GRITS/.
|
| |
| 11:10-11:20, Paper TuAT1.2 | Add to My Program |
| Hierarchical DLO Routing with Reinforcement Learning and In-Context Vision-Language Models |
|
| Li, Mingen | University of Minnesota Twin Cities |
| Yu, Houjian | University of Minnesota, Twin Cities |
| Huang, Yixuan | Princeton University |
| Hong, Youngjin | University of Minnesota |
| Ye, Hantao | University of Minnesota Twin Cities |
| Choi, Changhyun | University of Minnesota, Twin Cities |
Keywords: Deep Learning in Grasping and Manipulation, Reinforcement Learning
Abstract: Long-horizon routing tasks of deformable linear objects (DLOs), such as cables and ropes, are common in industrial assembly lines and everyday life. These tasks are particularly challenging because they require robots to manipulate DLO with long-horizon planning and reliable skill execution. Successfully completing such tasks demands adapting to their nonlinear dynamics, decomposing abstract routing goals, and generating multi-step plans composed of multiple skills, all of which require accurate high-level reasoning during execution. In this paper, we propose a fully autonomous hierarchical framework for solving challenging DLO routing tasks. Given an implicit or explicit routing goal expressed in language, our framework leverages vision-language models (VLMs) for in-context high-level reasoning to synthesize feasible plans, which are then executed by low-level skills trained via reinforcement learning. To improve robustness over long horizons, we further introduce a failure recovery mechanism that reorients the DLO into insertion-feasible states. Our approach generalizes to diverse scenes involving object attributes, spatial descriptions, implicit language commands, and extended 5-clip settings. It achieves an overall success rate of 92% across long-horizon routing scenarios. Please refer to our project page: https://icra2026-dloroute.github.io/DLORoute/
|
| |
| 11:20-11:30, Paper TuAT1.3 | Add to My Program |
| FP3: A 3D Foundation Policy for Robotic Manipulation |
|
| Yang, Rujia | Tsinghua University |
| Chen, Geng | UC San Diego |
| Wen, Chuan | Shanghai Jiao Tong University |
| Gao, Yang | Tsinghua University |
Keywords: Imitation Learning, Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation
Abstract: Following its success in natural language processing and computer vision, foundation models that are pre-trained on large-scale multi-task datasets have also shown great potential in robotics. However, most existing robot foundation models rely solely on 2D image observations, ignoring 3D geometric information, which is essential for robots to perceive and reason about the 3D world. In this paper, we introduce FP3, a large-scale 3D foundation policy model for robotic manipulation. FP3 builds on a scalable diffusion transformer architecture and is pre-trained on 60k trajectories with point cloud observations. With the model design and diverse pre-training data, FP3 can be efficiently fine-tuned for downstream tasks while exhibiting strong generalization capabilities. Experiments on real robots demonstrate that with only 80 demonstrations, FP3 is able to learn a new task with over 90 success rates in novel environments with unseen objects, significantly surpassing existing robot foundation models.
|
| |
| 11:30-11:40, Paper TuAT1.4 | Add to My Program |
| Do You Know Where Your Camera Is? View-Invariant Policy Learning with Camera Conditioning |
|
| Jiang, Tianchong | Toyota Technological Institute at Chicago |
| Ji, Jingtian | Toyota Technological Institute at Chicago |
| Tan, Xiangshan | Toyota Technological Institute at Chicago |
| Fang, Jiading | Toyota Technological Institute at Chicago |
| Bhattad, Anand | Johns Hopkins University |
| Guizilini, Vitor | Toyota Research Institute |
| Walter, Matthew | Toyota Technological Institute at Chicago |
Keywords: Imitation Learning, Deep Learning in Grasping and Manipulation, Deep Learning Methods
Abstract: We study view-invariant imitation learning by explicitly conditioning policies on camera extrinsics. Using Plücker embeddings of per-pixel rays, we show that conditioning on extrinsics significantly improves generalization across viewpoints for standard behavior cloning policies, including ACT, Diffusion Policy, and SmolVLA. To evaluate policy robustness under realistic viewpoint shifts, we introduce six manipulation tasks in robosuite and ManiSkill that pair "fixed" and "randomized" scene variants, decoupling background cues from camera pose. Our analysis reveals that policies without extrinsics often infer camera pose using visual cues from static backgrounds in fixed scenes, a shortcut that collapses when workspace geometry or camera placement shifts. Conditioning on extrinsics restores performance and yields robust RGB-only control without depth. We release the tasks, demonstrations, and code to facilitate reproducibility and future research.
|
| |
| 11:40-11:50, Paper TuAT1.5 | Add to My Program |
| RCM Constraint-Consistent Dynamic Control in Surgical Robots |
|
| Li, Yu | Technical University of Munich |
| Sadeghian, Hamid | Technical University of Munich |
| Yang, Zewen | Technical University of Munich |
| Le Mesle, Valentin | Technical University of Munich |
| Haddadin, Sami | Mohamed Bin Zayed University of Artificial Intelligence |
Keywords: Medical Robots and Systems, Redundant Robots, Compliance and Impedance Control
Abstract: Robotic-assisted minimally invasive surgery (RAMIS) requires accurate enforcement of the remote center of motion (RCM) constraint to ensure safe tool motion through a trocar. Existing virtual RCM controllers are commonly formulated either at the kinematic level or as task-space objectives, which makes torque-level enforcement under trocar motion and physical interaction difficult to formulate consistently. This paper models the RCM as a rheonomic holonomic constraint and incorporates it into a projection-based inverse-dynamics controller with explicit constrained/free-motion torque decomposition. The resulting formulation unifies kinematic RCM enforcement and task-space tracking at the torque level, while preserving a constraint-consistent structure for residual regulation and null-space compliance. The proposed controller is validated in simulation and on a RAMIS training platform against representative projection-based and constrained-dynamics baselines. Across spiral tracking, varying insertion depth, moving trocar conditions, and human interaction, the method achieves lower RCM residuals and smoother torque profiles while maintaining accurate tool-tip tracking. These results support the use of constraint-consistent torque control for reliable virtual RCM enforcement in surgical robotics. The project page is available at https://rcmpc-cube.github.io.
|
| |
| 11:50-12:00, Paper TuAT1.6 | Add to My Program |
| One-Shot Autofocus Via User-Adaptive Gaze Control for Robot-Assisted Microsurgery |
|
| Luan, Yunfei | Shanghai Jiao Tong University |
| Liu, Yuxuan | Shanghai Jiao Tong University |
| Zhuge, Yuyang | Shanghai Key Laboratory of Flexible Medical Robotics, Tongren Hospital, Institute of Medical Robotics, Shanghai Jiao Tong Univer |
| Luo, Yating | Shanghai Jiao Tong University |
| Guo, Yao | Shanghai Jiao Tong University |
| Yang, Guang-Zhong | Shanghai Jiao Tong University |
|
|
| |
| 12:00-12:10, Paper TuAT1.7 | Add to My Program |
| SurgAM: Surgical Affordance Map Prediction with Multimodal Feature Fusion for Robot Autonomy |
|
| Song, Lei | The Chinese University of Hong Kong |
| Long, Yonghao | The Chinese University of Hong Kong |
| Xu, Mengya | National University of Singapore |
| Geng, Jiayi | Peking University People's Hospital |
| Chen, Xiuyuan | Peking University People's Hospital |
| Dou, Qi | The Chinese University of Hong Kong |
Keywords: Surgical Robotics: Laparoscopy, AI-Enabled Robotics
Abstract: Surgical automation is being increasingly studied, yet bridging visual scene understanding with autonomous action planning remains a fundamental challenge. While much research effort has been made on scene perception (e.g., tool recognition and scene segmentation), understanding and predicting actionable possibilities for surgical automation is still underexplored. In this paper, we introduce surgical affordance prediction, which identifies actionable regions for fundamental surgical actions from visual data. Specifically, a novel adaptive feature fusion framework is proposed that leverages the complementary strengths of a self-supervised vision transformer encoder for its superior semantic understanding and a large-scale generative model encoder for its spatially-aware capability. Furthermore, we introduce a hierarchical prompt learning mechanism to adapt to varying procedural contexts. Finally, a scene-guided attention decoder is proposed to focus on critical surgical areas while suppressing background distractions. To validate the effectiveness, we established a new dataset, derived from publicly available surgical datasets with affordance annotations for three basic surgical actions: aspiration, clipping, and retraction. Extensive experiments demonstrate that our approach achieves state-of-the-art performance. Moreover, we validate our framework's applicability for downstream automation on a realistic lung and prostate phantom, and results show that the predicted affordance maps successfully enable autonomous surgical actions.
|
| |
| 12:10-12:20, Paper TuAT1.8 | Add to My Program |
| Geometry-Aware Visual Odometry for Bronchoscopic Navigation Via High-Gain Observer Fusion |
|
| Kasaei, Mohammadreza | University of Edinburgh |
| Zhang, Francis Xiatian | University of Edinburgh |
| Li, Feng | University of Edinburgh, Centre for Inflammation Research |
| Alambeigi, Farshid | University of Texas at Austin |
| Dhaliwal, Kev | University of Edinburgh, Center for Inflammation Research, |
| Khadem, Mohsen | University of Edinburgh |
Keywords: Medical Robots and Systems, Sensor Fusion, Data Sets for Robotic Vision
Abstract: Navigational bronchoscopy is critical for pulmonary interventions, yet current platforms depend heavily on pre-operative CT or external sensors, limiting their use in critical care and resource-constrained settings. Vision-only navigation offers a scalable alternative, but conventional visual odometry (VO) struggles with texture-poor airway images, specularities, and the vanishing-point singularities of tubular anatomy, leading to frequent tracking failures and drift. We present a geometry-aware VO framework that explicitly leverages vanishing-point cues from airway lumens. Detected lumens are back-projected to 3D rays, whose weighted fusion yields a stable forward heading even when parallax cues are absent. This heading, together with looming-based velocity estimates, is fused with noisy VO outputs using a bespoke high-gain observer that enforces airway-following priors and rejects drift. We validate the method on ex-vivo mechanically ventilated human lungs using electromagnetic tracking as ground truth. Compared to state-of-the-art pipelines (ORB-SLAM2, LoFTR-VO, DPVO), our approach reduces absolute trajectory error by more than 50% and achieves the lowest relative pose error across all test sequences.
|
| |
| 12:20-12:30, Paper TuAT1.9 | Add to My Program |
| FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment |
|
| Barbas Laina, Sebastián | TU Munich |
| Boche, Simon | Technical University of Munich |
| Papatheodorou, Sotiris | University of Patras |
| Schaefer, Simon | Technical University of Munich |
| Jung, Jaehyung | Technical University of Munich |
| Oleynikova, Helen | ETH Zurich |
| Leutenegger, Stefan | ETH Zurich |
Keywords: Semantic Scene Understanding, Mapping, Aerial Systems: Perception and Autonomy
Abstract: Geometrically accurate and semantically expressive map representations have proven invaluable for robot deployment and task planning in unknown environments. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments still presents open challenges, mainly due to computational requirements. In this paper we present FindAnything, an open-world mapping framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything combines pure geometric and open-vocabulary semantic information for a higher level of understanding. It proposes an efficient storage of open-vocabulary information through the aggregation of features at the object level. Pixel-wise vision-language features are aggregated based on eSAM segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. We demonstrate that FindAnything performs on par with the state-of-the-art in terms of semantic accuracy while being substantially faster and more memory-efficient, allowing its deployment in large-scale environments and on resource-constrained devices, such as MAVs. We show that the real-time capabilities of FindAnything make it useful for downstream tasks, such as autonomous MAV exploration in a simulated Search and Rescue scenario. Project Page: https://ethz-mrl.github.io/findanything/.
|
| |
| TuAT2 Regular Session, Hall A3 |
Add to My Program |
| Field and Space Robotics |
|
| |
| Chair: Kyriakopoulos, Kostas | New York University - Abu Dhabi |
| |
| 11:00-11:10, Paper TuAT2.1 | Add to My Program |
| Robotic Classification of Divers’ Swimming States Using Visual Pose Keypoints As IMUs |
|
| Kutzke, Demetrious T. | University of Minnesota |
| Wu, Ying-Kun | University of Minnesota |
| Terveen, Elizabeth | Carnegie Mellon University |
| Sattar, Junaed | University of Minnesota |
Keywords: Marine Robotics, Search and Rescue Robots, Deep Learning for Visual Perception
Abstract: Traditional human activity recognition uses either direct image analysis or data from wearable inertial measurement units (IMUs), but can be ineffective in challenging underwater environments. We introduce a novel hybrid approach that bridges this gap to monitor scuba diver safety. Our method leverages computer vision to generate high-fidelity motion data, effectively creating a “pseudo-IMU” from a stream of 3D human joint keypoints. This technique circumvents the critical problem of wireless signal attenuation in water, which plagues conventional diver-worn sensors communicating with an autonomous underwater vehicle (AUV). We apply this system to the vital task of identifying anomalous scuba diver behavior that signals the onset of a medical emergency such as cardiac arrest—a leading cause of scuba diving fatalities. By integrating our classifier onboard an AUV and conducting experiments with simulated distress scenarios, we demonstrate the utility and effectiveness of our method for advancing robotic monitoring and diver safety
|
| |
| 11:10-11:20, Paper TuAT2.2 | Add to My Program |
| SMARTPOSE: Development of a Sample-Efficient, Model-Agnostic, Robust, Two-Stage POSE Estimator for Unknown Satellites |
|
| Joshi, Yash Kishorbhai | Indian Institute of Science |
| Sundaram, Suresh | Indian Institute of Science |
Keywords: Space Robotics and Automation, Robotics in Under-Resourced Settings, Computer Vision for Automation
Abstract: Accurate relative pose estimation is critical for autonomous close-proximity satellite operations, such as on-orbit servicing and debris removal. However, this task remains highly challenging for unknown, non-cooperative targets due to the unavailability of geometric priors, scarce training data, and the extreme variations in lighting and backgrounds inherent to the space environment. Existing approaches for pose estimation can be categorized into: 1. model-based methods, which achieve higher accuracy but assume access to 3D CAD models of target satellites; and 2. model-free methods, which can be used for unknown satellites but typically suffer from reduced accuracy and robustness. To bridge this gap, this paper introduces a sample-efficient, model-free pose estimation framework that achieves high accuracy and robustness for unknown satellites while demonstrating strong generalization under the uncertain conditions inherent to the space environment. The proposed method utilizes a novel appearance aware 3D reconstruction to generate satellite model from images accounting for different lighting conditions during the training. This model is then used to generate a large, diverse dataset to train a pose predictor network (stage 1). The predicted pose is refined using the 3D reconstruction by utilizing the appearance information of the target image along with differentiable rendering (stage 2). Evaluated across SPEED+ and URSO Soyuz datasets, our approach achieves state-of-the-art accuracy and proves highly robust to test-time domain shifts, notably reducing rotation error by 80% on the challenging URSO Soyuz dataset.
|
| |
| 11:20-11:30, Paper TuAT2.3 | Add to My Program |
| Towards Versatile Opti-Acoustic Sensor Fusion and Volumetric Mapping for Safe Underwater Navigation |
|
| Collado-Gonzalez, Ivana | Stevens Institute of Technology |
| McConnell, John | United States Naval Academy |
| Englot, Brendan | Stevens Institute of Technology |
Keywords: Marine Robotics, Sensor Fusion, Mapping
Abstract: Accurate 3D volumetric mapping is critical for autonomous underwater vehicles operating in obstacle-rich environments. Vision-based perception provides high-resolution data but fails in turbid conditions, while sonar is robust to lighting and turbidity but suffers from low resolution and elevation ambiguity. This paper presents a volumetric mapping framework that fuses a stereo sonar pair with a monocular camera to enable safe navigation under varying visibility conditions. Overlapping sonar fields of view resolve elevation ambiguity, producing fully defined 3D point clouds at each time step. The framework identifies regions of interest in camera images, associates them with corresponding sonar returns, and combines sonar range with camera-derived elevation cues to generate additional 3D points. Each 3D point is assigned a confidence value reflecting its reliability. These confidence-weighted points are fused using a Gaussian Process Volumetric Mapping framework that prioritizes the most reliable measurements. Experimental comparisons with other opti-acoustic and sonar-based approaches, along with field tests in a marina environment, demonstrate the method’s effectiveness in capturing complex geometries and preserving critical information for robot navigation in both clear and turbid conditions. Our code will be released as open-source to support community adoption.
|
| |
| 11:30-11:40, Paper TuAT2.4 | Add to My Program |
| Design, Modeling and Direction Control of a Wire-Driven Robotic Fish Based on a 2-DoF Crank–Slider Mechanism |
|
| Wang, Yita | Univ. of Tokyo |
| Chen, Chen | Univ. of Tokyo |
| Chen, Yicheng | Univ. of Tokyo |
| Li, Jinjie | Univ. of Tokyo |
| Motegi, Yuichi | Univ. of Tokyo |
| Ohkuma, Kenji | Univ. of Tokyo |
| Maki, Toshihiro | Univ. of Tokyo |
| Zhao, Moju | Univ. of Tokyo |
Keywords: Marine Robotics, Mechanism Design, Biologically-Inspired Robots
Abstract: Robotic fish have attracted growing attention in recent years owing to their biomimetic design and potential applications in environmental monitoring and biological surveys. Among robotic fish employing the Body–Caudal Fin (BCF) locomotion pattern, motor-driven actuation is widely adopted. Some approaches utilize multiple servo motors to achieve precise body curvature control, while others employ a brushless motor to drive the tail via wire or rod, enabling higher oscillation and swimming speeds. However, the former approaches typically result in limited swimming speed, whereas the latter suffer from poor maneuverability, with few capable of smooth turning. To address this trade-off, we develop a wire-driven robotic fish equipped with a 2-degree-of-freedom (DoF) crank–slider mechanism that decouples propulsion from steering, enabling both high swimming speed and agile maneuvering. In this paper, we first present the design of the robotic fish, including the elastic skeleton, waterproof structure, and the actuation mechanism that realizes the decoupling. We then establish the actuation modeling and body dynamics to analyze the locomotion behavior. Furthermore, we propose a combined feedforward–feedback control strategy to achieve independent regulation of propulsion and steering. Finally, we validate the feasibility of the design, modeling, and control through a series of prototype experiments, demonstrating swimming, turning, and directional control.
|
| |
| 11:40-11:50, Paper TuAT2.5 | Add to My Program |
| RAVEN: Resilient Aerial Navigation Via Open-Set Semantic Memory and Behavior Adaptation |
|
| Kim, Seungchan | Carnegie Mellon University |
| Alama, Omar | Carnegie Mellon University |
| Kurdydyk, Dmytro | Davidson College |
| Keller, John | Carnegie Mellon University |
| Keetha, Nikhil Varma | Carnegie Mellon University |
| Wang, Wenshan | Carnegie Mellon University |
| Bisk, Yonatan | Carnegie Mellon University |
| Scherer, Sebastian | Carnegie Mellon University |
Keywords: Aerial Systems: Perception and Autonomy, Semantic Scene Understanding, Vision-Based Navigation
Abstract: Aerial outdoor semantic navigation requires robots to explore large, unstructured environments to locate target objects. Recent advances in semantic navigation have demonstrated open-set object-goal navigation in indoor settings, but these methods remain limited by constrained spatial ranges and structured layouts, making them unsuitable for long-range outdoor search. While outdoor semantic navigation approaches exist, they either rely on reactive policies based on current observations, which tend to produce short-sighted behaviors, or precompute scene graphs offline for navigation, limiting adaptability to online deployment. We present RAVEN, a 3D memory-based, behavior tree framework for aerial semantic navigation in unstructured outdoor environments. It (1) uses a spatially consistent semantic voxel-ray map as persistent memory, enabling long-horizon planning and avoiding purely reactive behaviors, (2) combines short-range voxel search and long-range ray search to scale to large environments, (3) leverages a large vision-language model to suggest auxiliary cues, mitigating sparsity of outdoor targets. These components are coordinated by a behavior tree, which adaptively switches behaviors for robust operation. We evaluate RAVEN in 10 photorealistic outdoor simulation environments over 100 semantic tasks, encompassing single-object search, multi-class, multi-instance navigation and sequential task changes. Results show RAVEN outperforms baselines by 85.25% in simulation and demonstrate its real-world applicability through deployment on an aerial robot in outdoor field tests.
|
| |
| 11:50-12:00, Paper TuAT2.6 | Add to My Program |
| A Coverage Motion Planning Approach for UVMS-Based Propeller Cleaning in Obstacle-Occluded Environments |
|
| Kopo, Raksi | New York University Abu Dhabi |
| Tarantos, Spyridon | New York University Abu Dhabi |
| Panetsos, Fotis | New York University Abu Dhabi |
| Kyriakopoulos, Kostas | New York University - Abu Dhabi |
Keywords: Marine Robotics, Motion and Path Planning, Constrained Motion Planning
Abstract: This work addresses the problem of underwater propeller cleaning in environments containing obstacles using an Underwater Vehicle Manipulator System (UVMS). Prior propeller-cleaning approaches plan coverage tool paths without explicitly considering the connectivity of the associated Surface-Constrained Configuration Space (SCCS), leading to unnecessary lift-offs in obstacle-occluded settings. In contrast, we formulate the coverage problem in the disconnected SCCS as a Generalized Traveling Salesman Problem (GTSP) within a hierarchical framework, accounting for obstacles and attempting to minimize the number of tool lift-offs. We consider explicitly the tool lift-off paths in the GTSP cost formulation, utilizing the hierarchical framework to guide the search without exhaustively evaluating all possible paths. To achieve smoother tool paths with fewer turns, we introduce a cost that promotes alignment with desired coverage curves. Finally, we time-parameterize the coverage path into a whole-body UVMS trajectory by minimizing the duration of the cleaning task, while respecting the robot hardware limitations. The effectiveness of the proposed method is demonstrated in a realistic simulation scenario.
|
| |
| 12:00-12:10, Paper TuAT2.7 | Add to My Program |
| A Photorealistic Dataset and Vision-Based Algorithm for Anomaly Detection During Proximity Operations in Lunar Orbit |
|
| Leveugle, Selina | University of Toronto |
| Lee, Chang Won | University of Toronto |
| Stolpner, Svetlana | MDA Space |
| Langley, Chris | MDA Space |
| Grouchy, Paul | MDA Space |
| Waslander, Steven Lake | University of Toronto |
| Kelly, Jonathan | University of Toronto |
Keywords: Space Robotics and Automation, Data Sets for Robotic Vision, Computer Vision for Automation
Abstract: NASA's forthcoming Lunar Gateway space station, which will be uncrewed most of the time, will need to operate with an unprecedented level of autonomy. One key challenge is enabling the Canadarm3, the Gateway's external robotic system, to detect hazards in its environment using its onboard inspection cameras. This task is complicated by the extreme and variable lighting conditions in space. In this paper, we introduce the visual anomaly detection and localization task for the space domain and establish a benchmark based on a synthetic dataset called ALLO (Anomaly Localization in Lunar Orbit). We show that state-of-the-art visual anomaly detection methods often fail in the space domain, motivating the need for new approaches. To address this, we propose MRAD (Model Reference Anomaly Detection), a statistical algorithm that leverages the known pose of the Canadarm3 and a CAD model of the Gateway to generate reference images of the expected scene appearance. Anomalies are then identified as deviations from this model-generated reference. On the ALLO dataset, MRAD surpasses state-of-the-art anomaly detection algorithms, achieving an AP score of 62.9% at the pixel level and an AUROC score of 75.0% at the image level. Given the low tolerance for risk in space operations and the lack of domain-specific data, we emphasize the need for novel, robust, and accurate anomaly detection methods to handle the challenging visual conditions found in lunar orbit and beyond.
|
| |
| 12:10-12:20, Paper TuAT2.8 | Add to My Program |
| Learning to Anchor Visual Odometry: KAN-Based Pose Regression for Planetary Landing |
|
| Luo, Xubo | University of Chinese Academy of Sciences |
| Li, Zhaojin | Chinese Academy of Sciences |
| Wan, Xue | Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences |
| Zhang, Wei | Chinese Academy of Sciences |
| Shu, Leizheng | Chinese Academy of Sciences |
|
|
| |
| 12:20-12:30, Paper TuAT2.9 | Add to My Program |
| Design of Wheel Grouser Geometry with Reduced Sinkage for LEV-1 Lunar Rover |
|
| Otsuki, Masatsugu | Japan Aerospace Exploration Agency |
| Yoshikawa, Kent | Japan Aerospace Exploration Agency |
| Maeda, Takao | Tokyo University of Agriculture and Technology |
| Usami, Naoto | Japan Aerospace Exploration Agency |
| Yoshimitsu, Tetsuo | Japan Aerospace Exploration Agency |
Keywords: Space Robotics and Automation, Wheeled Robots, Dynamics
Abstract: Surface-mobile platforms have explored the moon and the red planet for nearly half century, providing a wealth of scientific data. However, surface mobility on planetary bodies remains a challenging task. However, surface mobility on planetary bodies remains a challenging task. In this paper, the formulation of reaction force by a grouser with a generalized geometry for a wheel of a planetary rover is presented, along with its verification through comparisons with the results by the conventional geometry. In a simulation study, the resistive force theory is applied to a general grouser geometry model. The study determines the impact of several parameters, particularly the grouser inclination, on draw-bar pull. The results obtained from the study suggest the formulation of a design for the grouser that is nearly optimal in its capacity to maximize the draw-bar pull per sinkage. We also apply the proposed geometry to the wheel on LEV-1, demonstrating that it works well in actual lunar operations.
|
| |
| TuAT3 Regular Session, Lehar 1-4 |
Add to My Program |
| Grasping and Manipulation |
|
| |
| Co-Chair: Ichnowski, Jeffrey | Carnegie Mellon University |
| |
| 11:00-11:10, Paper TuAT3.1 | Add to My Program |
| Spatially-Anchored Tactile Awareness for Robust Dexterous Manipulation |
|
| Huang, Jialei | Tsinghua University |
| Ye, Yang | Wuhan University |
| Gong, Yuanqing | Sharpa |
| Zhu, Xuezhou | Sharpa |
| Gao, Yang | Tsinghua University |
| Zhang, Kaifeng | Shanghai Qi Zhi Institute |
Keywords: Dexterous Manipulation, Imitation Learning, Multifingered Hands
Abstract: Abstract— Dexterous manipulation requires precise geometric reasoning, yet existing visuo-tactile learning methods struggle with sub-millimeter precision tasks that are routine for traditional model-based approaches. We identify a key limitation: while tactile sensors provide rich contact information, current learning frameworks fail to effectively leverage both the perceptual richness of tactile signals and their spatial relationship with hand kinematics. We believe an ideal tactile representation should explicitly ground contact measurements in a stable reference frame while preserving detailed sensory information—enabling policies to not only detect contact occurrence but also precisely infer object geometry in the hand’s coordinate system. We introduce SaTA (Spatially-anchored Tactile Awareness for dexterous manipulation), an end-to-end policy framework that explicitly anchors tactile features to the hand’s kinematic frame through forward kinematics, enabling accurate geometric reasoning without requiring object models or explicit pose estimation. Our key insight is that spatially- grounded tactile representations allow policies to not only detect contact occurrence but also precisely infer object geometry in the hand’s coordinate system. We validate SaTA on challenging dexterous manipulation tasks, including bimanual USB-C mating in free space—a task demanding sub-millimeter alignment precision—as well as light bulb installation requiring precise thread engagement and rotational control, and card sliding that demands delicate force modulation and angular precision. These tasks represent significant challenges for learning-based methods due to their stringent precision requirements. Across multiple benchmarks, SaTA significantly outperforms strong visuo-tactile baselines, improving success rates by up to 30% while reducing task completion times by 27%.
|
| |
| 11:10-11:20, Paper TuAT3.2 | Add to My Program |
| DAM-VLA: A Dynamic Action Model-Based Vision-Language-Action Framework for Robot Manipulation |
|
| Peng, Xiongfeng | Samsung R&D Institute China-Beijing |
| Yu, Jiaqian | Samsung RnD Institute China - Beijing |
| Li, Dingzhe | Beihang University |
| Jin, Yixiang | Samsung Research China – Beijing (SRC-B) |
| Xu, Lu | Samsung |
| Yamin, Mao | Samsung |
| Zhang, Chao | Samsung Advanced Institute of Technology |
| Li, Weiming | Samsung Advanced Institute of Technology (SAIT) |
| Jang, Sujin | AI Center, Samsung Electronics Co. LTD |
| Lee, Dongwook | AI Center, Samsung Electronics Co. LTD |
| Ji, Daehyun | Samsung Advanced Institute of Technology |
Keywords: Deep Learning in Grasping and Manipulation, Manipulation Planning, Grasping
Abstract: In dynamic environments such as warehouses, hospitals, and homes, robots must seamlessly transition between gross motion and precise manipulations to complete complex tasks. However, current Vision-Language-Action (VLA) frameworks, largely adapted from pre-trained Vision-Language Models (VLMs), often struggle to reconcile general task adaptability with the specialized precision required for intricate manipulation. To address this challenge, we propose DAM-VLA, a dynamic action model-based VLA framework. DAM-VLA integrates VLM reasoning with diffusion-based action models specialized for arm and gripper control. Specifically, it introduces (i) an action routing mechanism, using task-specific visual and linguistic cues to select appropriate action models (e.g., arm movement or gripper manipulation), (ii) a dynamic action model that fuses high-level VLM cognition with low-level visual features to predict actions, and (iii) a dual-scale action weighting mechanism that enables dynamic coordination between the arm-movement and gripper-manipulation models. Across extensive evaluations, DAM-VLA achieves superior success rates compared to state-of-the-art VLA methods in simulated (SIMPLER, FurnitureBench) and real-world settings, showing robust generalization from standard pick-and-place to demanding long-horizon and contact-rich tasks.
|
| |
| 11:20-11:30, Paper TuAT3.3 | Add to My Program |
| Dynamic Robotic Cloth Folding with Efficient Koopman Operator-Based Model Predictive Control |
|
| Caldarelli, Edoardo | Istituto Italiano di Tecnologia |
| Coltraro, Franco | Institut de Robòtica i Informàtica industrial (UPC-CSIC) |
| Colomé, Adrià | Institut de Robòtica i Informàtica Industrial (CSIC-UPC) |
| Rosasco, Lorenzo | Istituto Italiano di Tecnologia & MassachusettsInstitute ofTechnology |
| Torras, Carme | CSIC - UPC |
|
|
| |
| 11:30-11:40, Paper TuAT3.4 | Add to My Program |
| DYMO-Hair: Generalizable Volumetric Dynamics Modeling for Robot Hair Manipulation |
|
| Zhao, Chengyang | Carnegie Mellon University |
| Yoo, Uksang | Carnegie Mellon University |
| Chaudhury, Arkadeep Narayan | Carnegie Mellon University |
| Nam, Giljoo | Meta |
| Francis, Jonathan | Bosch Center for Artificial Intelligence |
| Ichnowski, Jeffrey | Carnegie Mellon University |
| Oh, Jean | Carnegie Mellon University |
Keywords: Deep Learning in Grasping and Manipulation, Representation Learning, Manipulation Planning
Abstract: Hair care is an essential daily activity, yet it remains inaccessible to individuals with limited mobility and challenging for autonomous robot systems due to the fine-grained physical structure and complex dynamics of hair. In this work, we present DYMO-Hair, a model-based robot hair care system. We introduce a novel dynamics learning paradigm that is suited for volumetric quantities such as hair, relying on an action-conditioned latent state editing mechanism, coupled with a compact 3D latent space of diverse hairstyles to improve generalizability. This latent space is pre-trained at scale using a novel hair physics simulator, enabling generalization across previously unseen hairstyles. Using the dynamics model with a Model Predictive Path Integral (MPPI) planner, DYMO-Hair is able to perform visual goal-conditioned hair styling. Experiments in simulation demonstrate that DYMO-Hair's dynamics model outperforms baselines on capturing local deformation for diverse, unseen hairstyles. DYMO-Hair further outperforms baselines in closed-loop hair styling tasks on unseen hairstyles, with an average of 22% lower final geometric error and 42% higher success rate than the state-of-the-art system. Real-world experiments exhibit zero-shot transferability of our system to wigs, achieving consistent success on challenging unseen hairstyles where the state-of-the-art system fails. Together, these results introduce a foundation for model-based robot hair care, advancing toward more generalizable, flexible, and accessible robot hair styling in unconstrained physical environments.
|
| |
| 11:40-11:50, Paper TuAT3.5 | Add to My Program |
| SceneComplete: Open-World 3D Scene Completion in Cluttered Real World Environments for Robot Manipulation |
|
| Agarwal, Aditya | Massachusetts Institute of Technology |
| Singh, Gaurav | Brown University |
| Sen, Bipasha | Massachusetts Institute of Technology |
| Lozano-Perez, Tomas | MIT |
| Kaelbling, Leslie | MIT |
Keywords: Perception for Grasping and Manipulation, RGB-D Perception, Manipulation Planning
Abstract: Careful robot manipulation in every-day cluttered environments requires an accurate understanding of the 3D scene, in order to grasp and place objects stably and reliably and to avoid colliding with other objects. In general, we must construct such a 3D interpretation of a complex scene based on limited input, such as a single RGB-D image. We describe SceneComplete, a system for constructing a complete, segmented, 3D model of a scene from a single view. SceneComplete is a novel pipeline for composing general purpose pretrained perception modules (vision-language, segmentation, image-inpainting, image-to-3D, visual-descriptors and pose-estimation) to obtain highly accurate results. We demonstrate its accuracy and effectiveness with respect to ground-truth models in a large benchmark dataset and show that its accurate whole-object reconstruction enables robust grasp proposal generation, including for a dexterous hand. We release the code and additional results on our website - https://scenecomplete.github.io/.
|
| |
| 11:50-12:00, Paper TuAT3.6 | Add to My Program |
| Robust Bayesian Scene Reconstruction with Retrieval-Augmented Priors for Precise Grasping and Planning |
|
| Wright, Herbert | University of Pennsylvania |
| Zhi, Weiming | The University of Sydney; Vanderbilt University |
| Matak, Martin | University of Utah |
| Johnson-Roberson, Matthew | Carnegie Mellon University |
| Hermans, Tucker | University of Utah |
Keywords: Perception for Grasping and Manipulation, Probabilistic Inference
Abstract: Constructing 3D representations of object geometry is critical for many robotics tasks, particularly manipulation problems. These representations must be built from potentially noisy partial observations. In this work, we focus on the problem of reconstructing a multi-object scene from a single RGBD image using a fixed camera. Traditional scene representation methods generally cannot infer the geometry of unobserved regions of the objects in the image. Attempts have been made to leverage deep learning to train on a dataset of known objects and representations, and then generalize to new observations. However, this can be brittle to noisy real-world observations and objects not contained in the dataset, and do not provide well-calibrated reconstruction confidences. We propose BRRP, a reconstruction method that leverages preexisting mesh datasets to build an informative prior during robust probabilistic reconstruction. We introduce the concept of a retrieval-augmented prior, where we retrieve relevant components of our prior distribution from a database of objects during inference. The resulting prior enables estimation of the geometry of occluded portions of the in-scene objects. Our method produces a distribution over object shape that can be used for reconstruction and measuring uncertainty. We evaluate our method in both simulated scenes and in the real world. We demonstrate the robustness of our method against deep learning-only approaches while being more accurate than a method without an informative prior. Through real-world experiments, we particularly highlight the capability of BRRP to enable successful dexterous manipulation in clutter.
|
| |
| 12:00-12:10, Paper TuAT3.7 | Add to My Program |
| Irrotational Contact Fields |
|
| Castro, Alejandro | Toyota Research Institute |
| Han, Xuchen | Toyota Research Institute |
| Masterjohn, Joseph | Toyota Research Institute |
Keywords: Contact Modeling, Simulation and Animation, Dexterous Manipulation, Dynamics
Abstract: We present a framework for generating convex approximations of complex contact models, incorporating experimentally validated models like Hunt & Crossley coupled with Coulomb's law of friction alongside the principle of maximum dissipation. Our approach is robust across a wide range of stiffness values, making it suitable for both compliant surfaces and rigid approximations. We evaluate these approximations across a wide variety of test cases, detailing properties and limitations. We implement a fully differentiable solution in the open-source robotics toolkit, Drake. Our novel hybrid approach enables computation of gradients for complex geometric models while reusing factorizations from contact resolution. We demonstrate robust simulation of robotic tasks at interactive rates, with accurately resolved stiction and contact transitions, supporting effective sim-to-real transfer.
|
| |
| 12:10-12:20, Paper TuAT3.8 | Add to My Program |
| Leveraging Embodied Mechanical Intelligence for Learning Decluttering Tasks |
|
| Turco, Enrico | Istituto Italiano Di Tecnologia |
| Bo, Valerio | Universitat Politècnica De Cataluyna |
| Castellani, Chiara | Istituto Italiano Di Tecnologia |
| Salvietti, Gionata | University of Siena |
| Malvezzi, Monica | University of Siena |
| Prattichizzo, Domenico | University of Siena |
| Pozzi, Maria | University of Siena |
Keywords: Grasping, Grippers and Other End-Effectors, Modeling, Control, and Learning for Soft Robots
Abstract: In this work, we investigate how a state-of-the-art grasp planner based on deep reinforcement learning performs when applied to a soft-rigid gripper in a decluttering task. The gripper, called Soft ScoopGripper, is endowed with a rigid scoop-shaped part that facilitates the interaction with the environment and with objects. We hypothesize that the clever design of such a gripper can facilitate the learning process, reducing the number of required training steps and eliminating the need for learning non-prehensile actions, such as pushing. To validate our hypothesis, we conducted experiments in both simulated and real-world environments, comparing the selected gripper with a rigid parallel-jaw gripper and a four-fingered soft gripper. Results show that the Soft ScoopGripper learns to effectively declutter scenes using a single action (grasping) instead of two (pushing and grasping). This is due to the fact that the scoop-shaped add-on allows to perform non-prehensile motions during the grasp action.
|
| |
| 12:20-12:30, Paper TuAT3.9 | Add to My Program |
| Language-Guided Dexterous Functional Grasping by LLM Generated Grasp Functionality and Synergy for Humanoid Manipulation (I) |
|
| Li, Zhuo | The Chinese University of Hong Kong |
| Liu, Junjia | The Chinese University of Hong Kong |
| Li, Zhihao | The Chinese University of Hong Kong |
| Dong, Zhipeng | The Chinese University of Hong Kong |
| Teng, Tao | The Chinese University of Hong Kong & Hong Kong Centre For Logistics Robotics |
| Ou, Yongsheng | Dalian University of Technology |
| Caldwell, Darwin G. | Istituto Italiano di Tecnologia |
| Chen, Fei | T-Stone Robotics Institute, The Chinese University of Hong Kong |
|
|
| |
| TuAT4 Regular Session, Strauss 1-2 |
Add to My Program |
| Soft Robotics |
|
| |
| |
| 11:00-11:10, Paper TuAT4.1 | Add to My Program |
| Contact Detection and Manipulation with a Shape-Memory Alloy Based Soft Gripper |
|
| Plottel, Louis | Carnegie Mellon University |
| Desatnik, Richard | Carnegie Mellon University |
| Patel, Dinesh K. | Carnegie Mellon University |
| LeDuc, Philip | Carnegie Mellon University |
| Majidi, Carmel | Carnegie Mellon University |
Keywords: Soft Robot Applications, Dexterous Manipulation, Soft Sensors and Actuators
Abstract: Soft robotics offers the opportunity to create dexterous machines that can safely handle delicate objects. Grippers made from deformable actuators and compliant materials can deform around the objects with which they come in contact. The continuum mechanics of flexible manipulators can be leveraged for safe manipulation tasks such as twisting and grasping during manufacturing. However, to achieve this goal, contact sensing and controls for manipulators in these soft systems still remain a challenge in the field. This letter demonstrates a shape-memory alloy actuated soft gripper, with each finger able to bend about multiple axes. This enables the soft gripper to perform twisting tasks and handle various and fragile objects. Using capacitive bend sensors, we also demonstrate that the measured impedance of motion can be used as a proxy for contact, greatly increasing performance in a delicate manipulation task.
|
| |
| 11:10-11:20, Paper TuAT4.2 | Add to My Program |
| Architected Vacuum Driven Origami Structures Via Direct Ink Writing of RTV Silicone |
|
| Wang, Qiyao | Tsinghua University |
| Stalin, Thileepan | Singapore University of Technology and Design |
| Valdivia y Alvarado, Pablo | Singapore University of Technology and Design, MIT |
Keywords: Soft Robot Materials and Design, Additive Manufacturing, Hydraulic/Pneumatic Actuators
Abstract: Recent advances in soft robotics, wearable devices, and deployable systems have sparked tremendous interest in origami structures due to their controllable volume changes and shape-morphing capabilities. Despite significant progress in the design and fabrication of origami using traditional materials such as paper, textiles, thermoplastics, and thick panels, challenges persist in creating soft elastomeric origami designs that allow for precise, programmable deformations. This work proposes an architected approach for designing and 3D printing Room Temperature Vulcanization (RTV) silicone-based origami structures actuated by negative pressure. Central to this approach is a flexible hinge design, which enables controlled bending angles ranging from 45° to 90° upon the application of vacuum actuation. This architected method simplifies the complex folding of origami structures by strategically arranging the flexible hinges. A Python-based tool was developed to generate G-code directly from user-defined design parameters, streamlining the design-to-fabrication pipeline for Direct Ink Writing (DIW) RTV silicone-based origami parts. Initial fabrication experiments were conducted using a three-step print-assemble-bond approach. As an alternative to eliminating manual processing steps, a monolithic flexible hinge with a cavity was printed within a gel support. This paper introduces a hinge design library and discusses the design-to-fabrication workflow for two origami-inspired active structures.
|
| |
| 11:20-11:30, Paper TuAT4.3 | Add to My Program |
| High-Velocity, Pressure-Driven Eversion for Rapid Vine Robots |
|
| Alvarez, Anna | University of California Santa Barbara |
| Seawright, Anders | University of California Santa Barbara |
| Tripathi, Neel | University of California, Santa Barbara |
| Deng, Selena | University of California Santa Barbara |
| Cruz, Carlos | University of California Santa Barbara |
| Hawkes, Elliot Wright | University of California, Santa Barbara |
Keywords: Soft Robot Materials and Design, Hydraulic/Pneumatic Actuators, Compliant Joints and Mechanisms
Abstract: “Vine robots” are thin-walled, tubular, pneumatic soft robots that lengthen at their tips to navigate constrained and complex environments. Previous studies have already explored the mechanics of vine robot bodies and investigated applications for which the device is well-suited. However, these studies almost exclusively focus on eversion rates in the quasi-static regime, overlooking other potential applications of high-speed vine robots in medical devices, projectile launchers, or for informing biology. To better understand this rapid behavior, we present a dynamic growth model for high-velocity vine robot body extension with a payload mass and verify the model experimentally. To the best of the authors’ knowledge, this is the first instance of vine robots utilized for projectile launching. We find three key results: i) vine robot bodies experience rate-dependent damping that is scale-dependent and monotonically increases with increasing wall thickness; ii) steady-state velocity, or the upper limit of speed in terms of growth velocity, monotonically increases with isometric scaling; and iii) efficiency increases non-linearly with decreasing wall thickness. These findings are used to inform the preliminary design of a large-scale, drug delivery device proof-of-concept, as well as design the fastest–on–record vine, capable of 60 m/s eversion. Our work provides a basic understanding of the dynamic movement of vine robots and opens the door to new areas of application.
|
| |
| 11:30-11:40, Paper TuAT4.4 | Add to My Program |
| Adaptive-Twist Soft Finger Mechanism for Grasping by Wrapping |
|
| Ishikawa, Hiroki | The University of Tokyo |
| Ishibashi, Kyosuke | The University of Tokyo |
| Yamamoto, Ko | University of Tokyo |
Keywords: Soft Robot Materials and Design, Soft Robot Applications, Compliant Joints and Mechanisms
Abstract: This paper presents a soft robot finger capable of adaptive-twist deformation to grasp objects by wrapping them. For a soft hand to grasp and pick-up one object from densely contained multiple objects, a soft finger requires the adaptivetwist deformation function in both in-plane and out-of-plane directions. The function allows the finger to be inserted deeply into a limited gap among objects. Once inserted, the soft finger requires appropriate control of grasping force normal to contact surface, thereby maintaining the twisted deformation. In this paper, we refer to this type of grasping as grasping by wrapping. To achieve these two functions by a single actuation source, we propose a variable stiffness mechanism that can adaptively change the stiffness as the pressure is higher. We conduct a finite element analysis (FEA) on the proposed mechanism and determine its design parameter based on the FEA result. Using the developed soft finger, we report basic experimental results and demonstrations on grasping various objects.
|
| |
| 11:40-11:50, Paper TuAT4.5 | Add to My Program |
| 3D Printable Crease-Free Origami Vacuum Bending Actuators for Soft Robots |
|
| Wang, Zhanwei | Vrije Universiteit Brussel |
| Huaijin, Chen | Vub |
| Zaidi, Syeda Shadab Zehra | Scuola Superiore Sant'Anna |
| Roels, Ellen | Vrije Universiteit Brussel |
| Cools, Hendrik | Vrije Universiteit Brussel (VUB) |
| Vanderborght, Bram | Vrije Universiteit Brussel |
| Terryn, Seppe | Vrije Universiteit Brussel (VUB) |
Keywords: Hydraulic/Pneumatic Actuators, Soft Sensors and Actuators, Grippers and Other End-Effectors, Soft Robot Materials and Design
Abstract: While vacuum-based bending actuation offers benefits such as safety and compactness in soft robotics, it is often overlooked due to its limited actuation pressure, which restricts both bending angle and force output. This study presents a crease-free, origami-inspired vacuum bending actuator that advances both state-of-the-art vacuum bending actuators and traditional origami deformation principles by introducing orderly self-folding through optimized stiffness distribution. Achieved through finite element method (FEM), this design provides several advantages: (i) Self-folding allows for high bending angles (up to 138°) in a very compact form. (ii) The crease-free design facilitates 3D printing from a single soft material using a consumer-level fused filament fabrication (FFF) printer, specifically thermoplastic polyurethane (TPU) with a Shore hardness of 60A, potentially higher flexibility and durability. (iii) The compact configuration enables modular design, supporting reconfiguration as demonstrated in adaptable locomotion soft robots. (iv) The large bending angles allow the actuator to wrap around objects, offering extensive contact compared to other designs. This capability, c
|
| |
| 11:50-12:00, Paper TuAT4.6 | Add to My Program |
| Soft Omni-Functional Robotic Gripper with a Force-Enhanced Pleated Mechanism for High Force and Multi-DoF Manipulation |
|
| Lee, Sinyoung | Chungang University |
| Kang, Genesung | Yonsei University |
| Kim, Hongmin | Yonsei University |
| Shin, Dongjun | Yonsei University |
Keywords: Soft Robot Applications, Hydraulic/Pneumatic Actuators, Dexterous Manipulation
Abstract: Developing a robotic hand that integrates high fingertip force, rapid response, and multi-degree-of-freedom (DoF) motion, similar to the human hand, remains a challenge in the field of robotic hands. This study presents the Soft-OmniFunctional Robotic Gripper (SOFRo Gripper), designed to achieve all aforementioned characteristics. The finger module of the SOFRo Gripper incorporates synergistically arranged chambers together with a multi-node tendon routing strategy that distributes actuation forces, enabling both flexion and ab/adduction motions while enhancing fingertip force. Furthermore, to maximize fingertip force, the Force-Enhanced Pleated (FEP) mechanism was applied to the chambers, increasing force by 32.41% compared to conventional chamber designs. The proposed SOFRo Gripper achieves a high fingertip force of 68.76 N and dexterous motion capabilities, enabling a maximum lifting force of 400.0 N and in-hand manipulation. To validate its versatility, extensive experiments were conducted, demonstrating the hand's capability to perform a wide range of tasks. As a result, the SOFRo Gripper successfully performed grasping tasks involving various objects, as well as high-force tasks (e.g., lifting heavy objects, closing a valve), delicate tasks (e.g., grasping tofu, inserting a light bulb), and high-speed tasks (e.g., spinning a top, catching a ball). The system demonstrates high force capability and performs a wide range of tasks.
|
| |
| 12:00-12:10, Paper TuAT4.7 | Add to My Program |
| Exploring Interference between Concurrent Skin Stretches (I) |
|
| Cheng, Ching Hei | University of Melbourne |
| Eden, Jonathan | University of Melborune |
| Oetomo, Denny | The University of Melbourne |
| Tan, Ying | The University of Melbourne |
| |
| 12:10-12:20, Paper TuAT4.8 | Add to My Program |
| Effect of Virtual Mass and Time Delay on the Stability of Haptic Rendering (I) |
|
| Mashayekhi, Ahmad | Vrije Universiteit Brussel |
| Shakeri, Mehdi | Univeristy of Regina |
| Khorasani, Amin | Vrije Universiteit Brussel |
| Verstraten, Tom | Vrije Universiteit Brussel |
Keywords: Haptics and Haptic Interfaces, Human Factors and Human-in-the-Loop, Virtual Reality and Interfaces
Abstract: Virtual mass simulation is one of the recent topics in the field of haptic devices (HDs), which can alter the apparent mass of the HD. Simulating negative values of virtual mass leads to a decrease in the apparent effective mass, improving transparency but weakening stability. Positive virtual mass rendering increases the apparent mass, reduces transparency, and enhances stability. This paper analyzes the stability of a haptic device while simulating a virtual environment consisting of a mass, spring, and damper in the presence of a constant time delay. The results are closed-form equations that can predict the stability boundary for small and even large values of virtual damping and time delay. These closed-form equations demonstrate that the maximum renderable virtual mass is twice the physical mass of the HD, and the minimum value equals its negative; both occur in the case of zero time delay. Increasing the time delay reduces both the minimum and maximum values of the renderable virtual mass. The study also shows that using virtual mass can improve the maximum value of a renderable virtual spring. The equations show that, in the absence of delay, properly tuning the virtual mass and virtual damping can enlarge the maximum renderable stiffness by up to 5.8 times in theory. In the experiments under time delay, the stiffness increased by a factor of 3.5, compared to the theoretical prediction of 4.1 times. The results further reveal situations where a nonzero minimum stiffne
|
| |
| 12:20-12:30, Paper TuAT4.9 | Add to My Program |
| Self-Closing Suction Grippers for Industrial Grasping Via Form-Flexible Design (I) |
|
| Wang, Huijiang | University of Cambridge |
| Kunz, Holger | FORMHAND Automation GmbH |
| Adler, Timon | FORMHAND Automation GmbH |
| Iida, Fumiya | University of Cambridge |
| |
| TuI2I Interactive Session, Hall C |
Add to My Program |
| Interactive Session 2 |
|
| |
| |
| 15:00-16:30, Paper TuI2I.1 | Add to My Program |
| Open-Vocabulary Mobile Manipulation Based on Double Relaxed Contrastive Learning with Dense Labeling |
|
| Yashima, Daichi | Keio University |
| Korekata, Ryosuke | Keio University |
| Sugiura, Komei | Keio University |
Keywords: Deep Learning Methods, Learning Categories and Concepts, Deep Learning for Visual Perception
Abstract: Growing labor shortages are increasing the demand for domestic service robots (DSRs) to assist in various settings. In this study, we develop a DSR that transports everyday objects to specified pieces of furniture based on open-vocabulary instructions. Our approach focuses on retrieving images of target objects and receptacles from pre-collected images of indoor environments. For example, given an instruction "Please get the right red towel hanging on the metal towel rack and put it in the white washing machine on the left," the DSR is expected to carry the red towel to the washing machine based on the retrieved images. This is challenging because the correct images should be retrieved from thousands of collected images, which may include many images of similar towels and appliances. To address this, we propose RelaX-Former, which learns diverse and robust representations from among positive, unlabeled positive, and negative samples. We evaluated RelaX-Former on a dataset containing real-world indoor images and human annotated instructions including complex referring expressions. The experimental results demonstrate that RelaX-Former outperformed existing baseline models across standard image retrieval metrics. Moreover, we performed physical experiments using a DSR to evaluate the performance of our approach in a zero-shot transfer setting. The experiments involved the DSR to carry objects to specific receptacles based on open-vocabulary instructions, achieving an overall success rate of 75%.
|
| |
| 15:00-16:30, Paper TuI2I.2 | Add to My Program |
| Mobile Manipulation Instruction Generation from Multiple Images with Automatic Metric Enhancement |
|
| Katsumata, Kei | Keio University |
| Kambara, Motonari | Keio University |
| Yashima, Daichi | Keio University |
| Korekata, Ryosuke | Keio University |
| Sugiura, Komei | Keio University |
Keywords: Deep Learning Methods, Deep Learning for Visual Perception, Mobile Manipulation
Abstract: We consider the problem of generating mobile manipulation instructions based on a target object image and receptacle image. Conventional image captioning models are not able to generate appropriate instructions because their architectures are typically optimized for single-image. In this study, we propose a model that identifies both the target object and receptacle to generate free-form instruction sentences for mobile manipulation tasks. Furthermore, we introduce a novel training method, the human centric calibration phase that combines learning-based automatic evaluation metrics with n-gram based automatic evaluation metrics. This method enables the model to learn the co-occurrence relationships between words and appropriate paraphrases. The results demonstrate that our proposed method outperforms baseline methods including representative multimodal large language models on all automatic evaluation metrics. Moreover, physical experiments reveal that using our method to augment data on language instructions improves the performance of an existing multimodal language understanding model for mobile manipulation.
|
| |
| 15:00-16:30, Paper TuI2I.3 | Add to My Program |
| RAMPA: Robotic Augmented Reality for Machine Programming by DemonstrAtion |
|
| Dogangun, Fatih | Bogazici University |
| Bahar, Serdar | Bogazici University |
| Yildirim, Yigit | Bogazici University |
| Temir, Bora Toprak | Bogazici University |
| Ugur, Emre | Bogazici University |
| Dogan, Mustafa Doga | Adobe Research |
Keywords: Virtual Reality and Interfaces, Learning from Demonstration
Abstract: This paper introduces Robotic Augmented Reality for Machine Programming by Demonstration (RAMPA), the first ML-integrated, XR-driven end-to-end robotic system, allowing training and deployment of ML models such as ProMPs on the fly, and utilizing the capabilities of state-of-the-art and commercially available AR headsets, e.g., Meta Quest 3, to facilitate the application of Programming by Demonstration (PbD) approaches on industrial robotic arms, e.g., Universal Robots UR10. Our approach enables in-situ data recording, visualization, and fine-tuning of skill demonstrations directly within the user’s physical environment. RAMPA addresses critical challenges of PbD, such as safety concerns, programming barriers, and the inefficiency of collecting demonstrations on the actual hardware. The performance of our system is evaluated against the traditional method of kinesthetic control in teaching three different robotic manipulation tasks and analyzed with quantitative metrics, measuring task performance and completion time, trajectory smoothness, system usability, user experience, and task load using standardized surveys. Our findings indicate a substantial advancement in how robotic tasks are taught and refined, promising improvements in operational safety, efficiency, and user engagement in robotic programming.
|
| |
| 15:00-16:30, Paper TuI2I.4 | Add to My Program |
| Waliner: Lightweight and Resilient Plugin Mapping Method with Wall Features for Visually Challenging Indoor Environments |
|
| Noh, DongKi | LG Electronics Inc |
| Lee, Byunguk | Kumoh National Institute of Technology |
| Kim, Hanngyoo | Kumoh National Institute of Technology |
| Lee, Seung-Hwan | Kumoh National Institute of Technology |
| Kim, HyunSung | LG Electronics Inc |
| Kim, Juwon | Dept. Electr. and Comput. Eng., Inha University, South Korea |
| Choi, Jeong-Sik | LG Electronics Inc |
| Baek, Seung-Min | LG Electronics Inc |
Keywords: Field Robots, Embedded Systems for Robotic and Automation, Mapping
Abstract: Vision-based indoor navigation systems have been proposed previously for service robots. However, in real-world scenarios, many of these approaches remain vulnerable to visually challenging environments such as white walls. In-home service robots, which are mass-produced, require affordable sensors and processors. Therefore, this paper presents a lightweight and resilient plugin mapping method called Waliner, using an RGB-D sensor and an embedded processor equipped with a neural processing unit (NPU). Waliner can be easily implemented in existing algorithms and enhances the accuracy and robustness of 2D/3D mapping in visually challenging environments with minimal computational overhead by leveraging a) structural building components, such as walls; b) the Manhattan world assumption; and c) an extended Kalman filter-based pose estimation and map management technique to maintain reliable mapping performance under varying lighting and featureless conditions. As verified in various real-world in-home scenes, the proposed method yields over a 5 % improvement in mapping consistency as measured by the map similarity index (MSI) while using minimal resources.
|
| |
| 15:00-16:30, Paper TuI2I.5 | Add to My Program |
| Semi-Autonomous Teleoperation Using Differential Flatness of a Crane Robot for Aircraft In-Wing Inspection |
|
| Marquette, Wade | University of Washington |
| Schultz, Kyle | University of Washington |
| Jonnalagadda, Vamsi | University of Washington |
| Wong, Benjamin | University of Washington |
| Garbini, Joseph | U. of Washington |
| Devasia, Santosh | University of Washington |
Keywords: Telerobotics and Teleoperation, Assembly
Abstract: Visual inspection of confined spaces such as aircraft wings is ergonomically challenging for human mechanics. This work presents a novel crane robot that can travel the entire span of the aircraft wing, enabling mechanics to perform inspection from outside of the confined space. However, teleoperation of the crane robot can still be a challenge due to the need to avoid obstacles in the workspace and potential oscillations of the camera payload. The main contribution of this work is to exploit the differential flatness of the crane-robot dynamics for designing reduced-oscillation, collision-free time trajectories of the camera payload for use in teleoperation. Autonomous experiments verify the efficacy of removing undesired oscillations by 89%. Furthermore, teleoperation experiments demonstrate that the controller eliminated collisions (from 33% to 0%) when 12 participants performed an inspection task with the use of proposed trajectory selection when compared to the case without it. Moreover, even discounting the failures due to collisions, the proposed approach improved task efficiency by 18.7% when compared to the case without it
|
| |
| 15:00-16:30, Paper TuI2I.6 | Add to My Program |
| EMPERROR: A Flexible Generative Perception Error Model for Probing Self-Driving Planners |
|
| Hanselmann, Niklas | Mercedes-Benz AG R&D, University of Tübingen |
| Doll, Simon | Mercedes-Benz AG, University of Tübingen |
| Cordts, Marius | Mercedes-Benz AG |
| Lensch, Hendrik Peter Asmus | University of Tübingen |
| Geiger, Andreas | University of Tübingen |
Keywords: Deep Learning Methods, Object Detection, Segmentation and Categorization, Autonomous Agents
Abstract: To handle the complexities of real-world traffic, learning planners for self-driving from data is a promising direction. While recent approaches have shown great progress, they typically assume a setting in which the ground-truth world state is available as input. However, when deployed, planning needs to be robust to the long-tail of errors incurred by a noisy perception system, which is often neglected in evaluation. To address this, previous work has proposed drawing adversarial samples from a perception error model (PEM) mimicking the noise characteristics of a target object detector. However, these methods use simple PEMs that fail to accurately capture all failure modes of detection. In this paper, we present EMPERROR, a novel transformer-based generative PEM, apply it to stress-test an imitation learning (IL)-based planner and show that it imitates modern detectors more faithfully than previous work. Furthermore, it is able to produce realistic noisy inputs that increase the planner’s collision rate by up to 85 %, demonstrating its utility as a valuable tool for a more complete evaluation of self-driving planners.
|
| |
| 15:00-16:30, Paper TuI2I.7 | Add to My Program |
| 3D Robotic Control of Micro-Scale Optical Swarms at an Interface |
|
| Carlisle, Nicholas | University of Canterbury |
| Nock, Volker | University of Canterbury |
| Williams, Martin | Massey University |
| Whitby, Catherine | Massey University |
| Chen, Jack L Y | AUT |
| Avci, Ebubekir | Massey University |
Keywords: Swarm Robotics, Micro/Nano Robots, Automation at Micro-Nano Scales
Abstract: Optical force-induced assembly is a promising yet scarcely explored approach for developing functional tools and objects at the microscale, with a wide range of potential applications. Our previous work was the first to investigate the manipulation of these assemblies in the XY plane. Here, we expand on these techniques by systematically exploring optical trap manipulation with the addition of Z-axis control. Manipulation of the Z-axis is referred to as axial displacement and is a viable approach for actively manipulating the assembly morphology. Experiments are conducted for the first time to explore and detail the response of the assembly during active 3D trap manipulation, informing the development of an autonomous control algorithm over the 2D area of the assembly during motion. This control presents techniques to increase assembly stability or alter the area of the assembly for tasks such as passing through constrictions. This work aims to develop the control techniques required to create a unique micromanufacturing approach inspired by the Kilobot thousand robot swarm.
|
| |
| 15:00-16:30, Paper TuI2I.8 | Add to My Program |
| Structured Pruning for Efficient Visual Place Recognition |
|
| Grainge, Oliver Edward | University of Southampton |
| Milford, Michael J | Queensland University of Technology |
| Bodala, Indu | University of Southampton |
| Ramchurn, Sarvapali | University of Southampton |
| Ehsan, Shoaib | University of Essex |
Keywords: Deep Learning for Visual Perception, Recognition, Localization
Abstract: Visual Place Recognition (VPR) is fundamental for the global re-localization of robots and devices, enabling them to recognize previously visited locations based on visual inputs. This capability is crucial for maintaining accurate mapping and localization over large areas. Given that VPR methods need to operate in real-time on embedded systems, it is critical to optimize these systems for minimal resource consumption. While the most efficient VPR approaches employ standard convolutional backbones with fixed descriptor dimensions, these often lead to redundancy in the embedding space as well as in the network architecture. Our work introduces a novel structured pruning method, to not only streamline common VPR architectures but also to strategically remove redundancies within the feature embedding space. This dual focus significantly enhances the efficiency of the system, reducing both map and model memory requirements and decreasing feature extraction and retrieval latencies. Our approach has reduced memory usage and latency by 21% and 16%, respectively, across models, while minimally impacting recall@1 accuracy by less than 1%. This significant improvement enhances real-time applications on edge devices with negligible accuracy loss.
|
| |
| 15:00-16:30, Paper TuI2I.9 | Add to My Program |
| FT-CPG: Learning Central Pattern Generators for Fault-Tolerant Quadruped Locomotion under Multi-Joint Failures |
|
| Zhang, Pei | NorthEastern University (China) |
| Hua, Zhaobo | Northeastern University (China) |
| Qiu, Qiyu | Northeastern University (China) |
| Ding, Jinliang | NorthEastern University (China) |
Keywords: Legged Robots, Reinforcement Learning, AI-Based Methods
Abstract: Quadruped robots used for rescue and exploration are susceptible to various leg failures, where unpredictable joint locking or power loss can pose an immediate risk of falling. Traditional controllers lack fault-tolerant control capabilities in the case of multi-joint concurrent faults, and erroneous controller outputs may lead to robot damage. This paper proposes a model-free reinforcement learning framework based on central pattern generators (CPG) for fault-tolerant control (FT-CPG). The framework uses biomimetic gait generation and section-wise training to address various types of multi-joint concurrent faults. FT-CPG adopts a fault-tolerant CPG module to generate safe gaits, while utilizing neural network-based policies to infer failures and coordinate the rhythmic behaviors of the CPG, ensuring the ability to track velocity commands under fault conditions. Experiments show that FT-CPG is robust in unexpected situations, where a single leg experiences failures across any number of joints, with each joint randomly encountering locking or power loss faults. Furthermore, the proposed framework preserves the robot's omnidirectional mobility. Finally, zero-shot sim-to-real transfer was successfully implemented on the real-world Unitree Go1 robot, effectively addressing various multi-joint leg failures.
|
| |
| 15:00-16:30, Paper TuI2I.10 | Add to My Program |
| EKF-Based Radar-Inertial Odometry with Online Temporal Calibration |
|
| Kim, Changseung | Ulsan National Institute of Science and Technology |
| Bae, Geunsik | Ulsan National Institute of Science and Technology |
| Shin, Woojae | Korea Advanced Institute of Science and Technology |
| Wang, Sen | Imperial College London |
| Oh, Hyondong | KAIST |
Keywords: Sensor Fusion, Localization
Abstract: Accurate time synchronization between heterogeneous sensors is crucial for ensuring robust state estimation in multi-sensor fusion systems. Sensor delays often cause discrepancies between the actual time when the event was captured and the time of sensor measurement, leading to temporal misalignment (time offset) between sensor measurement streams. In this paper, we propose an extended Kalman filter (EKF)-based radar-inertial odometry (RIO) framework that estimates the time offset online. The radar ego-velocity measurement model, derived from a single radar scan, is formulated to incorporate the time offset into the update. By leveraging temporal calibration, the proposed RIO enables accurate propagation and measurement updates based on a common time stream. Experiments on both simulated and real-world datasets demonstrate the accurate time offset estimation of the proposed method and its impact on RIO performance, validating the importance of sensor time synchronization. Our implementation of the EKF-RIO with online temporal calibration is available at https://github.com/spearwin/EKF-RIO-TC.
|
| |
| 15:00-16:30, Paper TuI2I.11 | Add to My Program |
| Nonlinear Model Predictive Control for Robotic Pushing of Planar Objects with Generic Shape |
|
| Federico, Sara | Università Degli Studi Della Campania Luigi Vanvitelli |
| Costanzo, Marco | Università Degli Studi Della Campania "Luigi Vanvitelli" |
| De Simone, Marco | Università Degli Studi Della Campania Luigi Vanvitelli |
| Natale, Ciro | Università Degli Studi Della Campania "Luigi Vanvitelli" |
Keywords: Contact Modeling, Dexterous Manipulation, Optimization and Optimal Control
Abstract: Robotic manipulation of objects in cluttered dynamic scenes is challenging for a twofold reason. Object detection and localization are complex due to partial occlusions and high variability in the object classes and manipulation in tight spaces is difficult due to potential collisions. The present letter focuses on the low-level control of the non-prehensile pushing action aimed at moving planar objects of generic shape along a given path with an assigned time law. Based on the continuous and nonlinear dynamics of the system, we propose a nonlinear model predictive controller (NMPC), which avoids the need for linearization and, thus, the hybrid dynamics arising from it. An extensive comparison with a state-of-the-art linear MPC demonstrates that the NMPC can successfully react to more general disturbances, outperforming the linear one. Experimental results confirm the effectiveness of the method in a task where a robot is required to grasp fruits in a container with other obstructing objects (shown in the attached video).
|
| |
| 15:00-16:30, Paper TuI2I.12 | Add to My Program |
| Robust and Error-Tolerant Peg-In-Hole Assembly Using Simple Control |
|
| Ueda, Masanori | Kanazawa University |
| Tsuji, Tokuo | Kanazawa University |
| Ishikawa, Shota | DENSO CORPORATION |
| Hiramitsu, Tatsuhiro | Kanazawa University |
| Seki, Hiroaki | Kanazawa University |
| Suzuki, Yosuke | Kanazawa University |
| Nishimura, Toshihiro | Kanazawa University |
| Watanabe, Tetsuyou | Kanazawa University |
Keywords: Compliant Joints and Mechanisms, Manipulation Planning, Industrial Robots
Abstract: We developed a simple peg-in-hole strategy that uses flexible joints and peg rotations. Even when the circle peg in the peg-in-hole assembly contains position and orientation errors, it can be inserted in a passive and robust manner. Additionally, using force-torque sensors to estimate the contact position allows the correction of the orientation of the peg and its insertion into the hole if the initial attempt fails. We conducted horizontal and vertical peg-in-hole experiments with random position and orientation errors to demonstrate the effectiveness of the developed method. This method does not rely on high-frequency sensors or servos, which enables a quick and low-cost peg-in-hole assembly with tolerance to position and orientation errors and direction.
|
| |
| 15:00-16:30, Paper TuI2I.13 | Add to My Program |
| Automatic Lighthouse Calibration Using Conics for Indoor Robot Localization |
|
| Alvarado-Marin, Said | INRIA |
| Abadie, Alexandre | Inria |
| Balbi, Martina | INRIA |
| Watteyne, Thomas | Inria |
| Maksimovic, Filip | INRIA |
Keywords: Localization, Wheeled Robots, Computer Vision for Automation
Abstract: In this letter, we propose a technique for calibrating Lighthouse localization systems using a single view of two or more coplanar circles traced by a moving robot. The calibration method leverages conic algebra to compute the homography between the Lighthouse view and the world plane, up to similarity. This approach requires minimal user intervention and is particularly suited for automatically calibrating large-scale deployments involving hundreds of mobile robots. We validate our method using a centimeter-scale differential-drive robot, utilizing 5 cm circles to calibrate a 2x2 m^2 area. The proposed technique achieved a mean positional accuracy of 7.77 mm, compared to the 5.37 mm accuracy of a previous calibration method based on manual measurements and known correspondences. We demonstrate that the conics traced by the robot are accurate enough for reliable homography estimation, even under varying conditions of tire material and surface type. A camera-based motion capture system served as the ground truth for all experiments. This work represents a step toward scalable and decentralized lighthouse calibration, enabling efficient 2D localization in large-scale robotic systems.
|
| |
| 15:00-16:30, Paper TuI2I.14 | Add to My Program |
| GazeScope: A Framework of Gaze Attention-Based Automatic Field-Of-View Adjustment for Laparoscopic Robots |
|
| Zhang, Jing | Shenzhen Campus of Sun Yat-Sen University |
| Wang, Baichuan | Shenzhen Campus of Sun Yat-sen University |
| Pan, Zhijie | Shenzhen Campus of Sun Yat-sen University |
| Li, Mengtang | Shenzhen Campus of Sun Yat-sen University |
| |
| 15:00-16:30, Paper TuI2I.15 | Add to My Program |
| H-MaP: An Iterative and Hybrid Sequential Manipulation Planner |
|
| Cicek, Berk | Bilkent University |
| Yenicesu, Arda Sarp | Bilkent University |
| Tuncer, Cankut Bora | Bilkent University |
| Demiray, Kutay | Bilkent University |
| Oguz, Ozgur S. | Bilkent University |
Keywords: Constrained Motion Planning, Manipulation Planning, Motion and Path Planning
Abstract: This paper introduces H-MaP, a hybrid sequential manipulation planner that addresses complex tasks requiring both sequential actions and dynamic contact mode switches. Our approach reduces configuration space dimensionality by decoupling object trajectory planning from manipulation planning through object-based waypoint generation, informed contact sampling, and optimization-based motion planning. This architecture enables handling of challenging scenarios involving tool use, auxiliary object manipulation, and bimanual coordination. Experimental results across seven diverse tasks demonstrate H-MaP's superior performance compared to existing methods, particularly in highly constrained environments where traditional approaches fail due to local minima or scalability issues. The planner's effectiveness is validated through both simulation and real-robot experiments. https://sites.google.com/view/h-map/
|
| |
| 15:00-16:30, Paper TuI2I.16 | Add to My Program |
| Hydrodynamics Regularization in Reinforcement Learning for Navigating Crowded Scenarios |
|
| Pingrui, Lai | Shanghai Jiaotong University |
| Renjie, Pan | Shanghai Jiao Tong University |
| Jiaqi, Yu | Shanghai Jiaotong University |
| Hua, Yang | Shanghai Jiaotong University |
|
|
| |
| 15:00-16:30, Paper TuI2I.17 | Add to My Program |
| The Empirical Turn in Robot Ethics: Reconciling Theoretical Thought Experiments with Practical Reality |
|
| Weng, Yueh-Hsuan | Kyushu University |
| Torabi, David | Kyushu University |
| Torresen, Jim | University of Oslo |
| Dong, Zonghao | Tohoku University |
| Hirata, Yasuhisa | Tohoku University |
Keywords: Social HRI, Ethics and Philosophy, AI-Enabled Robotics
Abstract: The inclusion of robots in daily life presents significant ethical, legal, and social implications (ELSI) that stem from their interactions with humans. Social robots are able to operate in environments that are rich in cultural norms, emotions, and social cues, leading to critical questions about privacy, trust, and safety. We explore the ways in which the interdisciplinary field of robot ethics can tackle these challenges using a hybrid methodological approach that incorporates thought experiments and empirical research. Ethical dilemmas can be systematically analyzed using thought experiments, and empirical methods can provide real-world insights to validate and refine these theoretical frameworks. In the paper, the use of living labs as dynamic environments for testing and integrating ethical design principles into robot design is emphasized, ensuring that robots comply with ethical expectations and legal standards.
|
| |
| 15:00-16:30, Paper TuI2I.18 | Add to My Program |
| Jamming Metal Sheets Using Electropermanent Magnets for Stiffness Modulation |
|
| Gaeta, Leah T. | Boston University |
| Vo, Vi | Boston University |
| Lee, Sang-Yoep | Boston University |
| Raste, Srushti | Wesleyan University |
| Venkatesam, Megha | Boston University |
| Rogatinsky, Jacob | Boston University |
| Albayrak, Deniz | Boston University |
| Ranzani, Tommaso | Boston University |
Keywords: Soft Robot Materials and Design, Soft Robot Applications
Abstract: Soft robots exhibit natural compliance which is desirable in many applications, but often require stiffness modulation techniques when more rigidity is needed. However, many existing stiffening techniques lack portability or fast response times, hindering the ubiquitous adoption of soft robots. Here we introduce a new instantaneous stiffness modulation method based in magnetism that exhibits portability due to electronic control. This technique jams together thin layers of inherently magnetic metal sheets with a magnetic field generated by electropermanent magnets (EPMs), producing rapid stiffness changes. Quasi-static and dynamic mechanical characterizations for samples with varied layer numbers are presented, highlighting how the magnetic attraction generated by EPMs can be exploited to create a jamming effect. Stiffness increases of up to 68% and energy absorptions of up to 113 mJ were found during quasi-static and dynamic characterizations, respectively. Finally, we demonstrate how this jamming technique can be used in a haptic feedback application and to play a miniaturized version of the game of Skee-Ball.
|
| |
| 15:00-16:30, Paper TuI2I.19 | Add to My Program |
| Chance-Constrained Convex MPC for Robust Quadruped Locomotion under Parametric and Additive Uncertainties |
|
| Trivedi, Ananya | Northeastern University |
| Prajapati, Sarvesh | Northeastern University |
| Zolotas, Mark | Toyota Research Institute |
| Everett, Michael | Northeastern University |
| Padir, Taskin | Northeastern University |
Keywords: Robust/Adaptive Control, Planning under Uncertainty, Legged Robots
Abstract: Recent advances in quadrupedal locomotion have focused on improving stability and performance across diverse environments. However, existing methods often lack adequate safety analysis and struggle to adapt to varying payloads and complex terrains, typically requiring extensive tuning. To overcome these challenges, we propose a Chance-Constrained Model Predictive Control (CCMPC) framework that explicitly models payload and terrain variability as distributions of parametric and additive disturbances within the single rigid body dynamics model. Our approach ensures safe and consistent performance under uncertain dynamics by expressing the model’s friction cone constraints, which define the feasible set of ground reaction forces, as chance constraints. Moreover, we solve the resulting stochastic control problem using a computationally efficient quadratic programming formulation. Extensive Monte Carlo simulations of quadrupedal locomotion across varying payloads and complex terrains demonstrate that CCMPC significantly outperforms two competitive benchmarks: Linear MPC and MPC with hand-tuned safety mar- gins to maintain stability, reduce foot slippage, and track the center of mass. Hardware experiments on the Unitree Go1 robot show successful locomotion across various indoor and outdoor terrains with unknown loads exceeding 50% of the robot’s body weight, despite no additional parameter tuning.
|
| |
| 15:00-16:30, Paper TuI2I.20 | Add to My Program |
| AirIO: Learning Inertial Odometry with Enhanced IMU Feature Observability |
|
| Qiu, Yuheng | Carnegie Mellon University |
| Xu, Can | University of Toronto |
| Chen, Yutian | Carnegie Mellon University |
| Zhao, Shibo | Carnegie Mellon University |
| Geng, Junyi | Pennsylvania State University |
| Scherer, Sebastian | Carnegie Mellon University |
Keywords: Aerial Systems: Perception and Autonomy, Deep Learning Methods, Localization
Abstract: Inertial odometry (IO) using only Inertial Measurement Units (IMUs) offers a lightweight and cost-effective solution for Unmanned Aerial Vehicle (UAV) applications, yet existing learning-based IO models often fail to generalize to UAVs due to the highly dynamic and non-linear-flight patterns that differ from pedestrian motion. In this work, we identify that the conventional practice of transforming raw IMU data to global coordinates undermines the observability of critical kinematic information in UAVs. By preserving the body-frame representation, our method achieves substantial performance improvements, with a 66.7% average increase in accuracy across three datasets. Furthermore, explicitly encoding attitude information into the motion network results in an additional 23.8% improvement over prior results. Combined with a data-driven IMU correction model (AirIMU) and an uncertainty-aware Extended Kalman Filter (EKF), our approach ensures robust state estimation under aggressive UAV maneuvers without relying on external sensors or control inputs. Notably, our method also demonstrates strong generalizability to unseen data not included in the training set, underscoring its potential for real-world UAV applications.
|
| |
| 15:00-16:30, Paper TuI2I.21 | Add to My Program |
| Enhancing Dual-Loop Pressure Control in Pneumatic Soft Robotics with a Comparison of Evolutionary Algorithms for PID & FOPID Controller Tuning |
|
| Libby, Jacqueline | Stevens Institute of Technology |
| Massoud, Mostafa Mo. | Stevens Institute of Technology |
| Alves, Paulo Henrique Teixeira Franca | Stevens Institute of Technology |
Keywords: Hydraulic/Pneumatic Actuators, Modeling, Control, and Learning for Soft Robots, Hardware-Software Integration in Robotics
Abstract: The control of pneumatic soft robotics is challenging due to nonlinearites arising from many factors including pneumatic system components and material properties of the soft actuator. Manual methods for PID controller tuning are inadequate for the nonlinear and time-variant dynamics present in soft robotics. Affordable pneumatic components such as on/off valves cause discontinuities in flow rate, introducing nonlinearities and oscillatory fluctuations into the system. This study proposes a dual-loop control system: one for PID and Fractional-Order PID (FOPID) control of a solenoid valve that feeds air into the actuator, and another for PID control of the pump upstream of the valve. The PID and FOPD parameters are optimized using evolutionary algorithms: Genetic Algorithm (GA), Particle Swarm Optimization (PSO), and Simulated Annealing (SA). Simulations and real-world experiments are conducted to validate the optimized parameters. Our results demonstrate that the dual-loop hardware configuration reduces fluctuations from the valves compared with a single-loop control scheme. The experimental statistical analysis confirms that FOPID achieves the highest significant improvements in rise time (PSO) and peak time (GA, PSO), while PID performs better for overshoot (GA, PSO). These findings highlight the importance of selecting an appropriate optimization algorithm based on the specific control objective, as FOPID does not outperform PID in every metric across all methods.
|
| |
| 15:00-16:30, Paper TuI2I.22 | Add to My Program |
| Safety-Critical and Distributed Nonlinear Predictive Controllers for Teams of Quadrupedal Robots |
|
| Imran, Basit Muhammad | Virginia Tech |
| Kim, Jeeseop | The University of Texas at El Paso |
| Chunawala, Taizoon Aliasgar | Virginia Polytechnic Institute and State University |
| Leonessa, Alexander | Virginia Tech |
| Akbari Hamed, Kaveh | Virginia Tech |
Keywords: Legged Robots, Motion Control, Multi-Contact Whole-Body Motion Planning and Control
Abstract: This paper presents a novel hierarchical, safety-critical control framework that integrates distributed nonlinear model predictive controllers (DNMPCs) with control barrier functions (CBFs) to enable cooperative locomotion of multi-agent quadrupedal robots in complex environments. While NMPC-based methods are widely used to enforce safety constraints and navigate multi-robot systems (MRSs) through complex environments, trajectory optimization frameworks based on invariant sets offer formal safety guarantees for MRSs. CBFs, typically implemented via quadratic programs (QPs) at the planning layer, provide formal safety guarantees. However, their zero-control horizon limits their effectiveness for extended trajectory planning in inherently unstable, underactuated, and nonlinear legged robot models. Furthermore, the integration of CBFs into real-time NMPC for sophisticated MRSs, such as quadrupedal robot teams, remains underexplored. This paper develops computationally efficient, distributed NMPC algorithms that incorporate CBF-based collision safety guarantees within a consensus protocol, enabling longer planning horizons for safe cooperative locomotion under disturbances and rough terrain conditions. The optimal trajectories generated by the DNMPCs are tracked using full-order, nonlinear whole-body controllers at the low level. The proposed approach is validated through extensive numerical simulations with up to four Unitree A1 robots and hardware experiments involving two A1 robots subjected to external pushes, rough terrain, and uncertain obstacle information. Comparative results demonstrate that the proposed CBF-integrated DNMPC achieves a higher success rate than baseline NMPCs employing CBFs at the high or low-level layers.
|
| |
| 15:00-16:30, Paper TuI2I.23 | Add to My Program |
| OpenIN: Open-Vocabulary Instance-Oriented Navigation in Dynamic Domestic Environments |
|
| Tang, Yujie | Beijing Institute of Technology |
| Wang, Meiling | Beijing Institute of Technology |
| Deng, Yinan | Beijing Institute of Technology |
| Zheng, Zibo | University of Nottingham of Ningbo China |
| Deng, Jingchuan | Beijing Institute of Technology |
| Zuo, Sibo | Beijing Institute of Technology |
| Yue, Yufeng | Beijing Institute of Technology |
Keywords: Service Robotics, Domestic Robotics, Semantic Scene Understanding
Abstract: In daily domestic settings, frequently used objects like cups often have unfixed positions and multiple instances within the same category, and their carriers frequently change as well. As a result, it becomes challenging for a robot to efficiently navigate to a specific instance. To tackle this challenge, the robot must capture and update scene changes and plans continuously. However, current object navigation approaches primarily focus on the semantic level and lack the ability to dynamically update scene representation. In contrast, this paper captures the relationships between frequently used objects and their static carriers. It constructs an open-vocabulary Carrier-Relationship Scene Graph (CRSG) and updates the carrying status during robot navigation to reflect the dynamic changes of the scene. Based on the CRSG, we further propose an instance navigation strategy that models the navigation process as a Markov Decision Process. At each step, decisions are informed by the Large Language Model's commonsense knowledge and visual-language feature similarity. We designed a series of long-sequence navigation tasks for frequently used everyday items in the Habitat simulator. The results demonstrate that by updating the CRSG, the robot can efficiently navigate to moved targets. Additionally, we conducted extensive experiments on a real robot, demonstrating the effectiveness of our method and exploring its limitations. The project page can be found here: https://OpenIN-nav.github.io.
|
| |
| 15:00-16:30, Paper TuI2I.24 | Add to My Program |
| Plantar Compensation Via Dynamic Control of Pneumatic Insoles for Flatfoot Deformity |
|
| Zhang, Bin | ZheJiang University |
| Guo, Yan | Zhejiang University |
| Zhang, Yijun | The First Affiliated Hospital Zhejiang University School of Medicine, Zhejiang University |
| Yi, Jingang | Rutgers University |
| Wang, Binrui | China Jiliang University |
| Liu, Tao | Zhejiang University |
| Li, Wenyang | University of Electro-Communications |
| He, Long | Zhiyuan Research Institute |
Keywords: Prosthetics and Exoskeletons, Rehabilitation Robotics, Medical Robots and Systems
Abstract: Human feet are crucial for supporting body weight and adapting to complex terrains. Adult-acquired flatfoot deformity (AAFD) arises from congenital or acquired causes, impairing the foot's ability to transition between flexible and rigid states, known as the lock-unlock mechanism during the stance and swing phases. In this study, we propose a plantar dynamic support system that utilizes pneumatic airbags, regulated through a model predictive control (MPC) strategy to minimize tracking errors. Experiments were conducted to measure kinetic parameters and electromyography signals, validating the system's efficacy. The results showed improvements in the normalized navicular height truncated (NNHt) index and reductions in muscle activity of the fibularis longus (FL), soleus (SOL), and gastrocnemius (GAST) by 4.42%, 16.65%, and 23.84%, respectively, during the stance phase.
|
| |
| 15:00-16:30, Paper TuI2I.25 | Add to My Program |
| Customize Harmonic Potential Fields Via Hybrid Optimization Over Homotopic Paths |
|
| Wang, Shuaikang | Peking University |
| Guo, Tiecheng | Peking University |
| Guo, Meng | Peking University |
Keywords: Motion and Path Planning, Autonomous Vehicle Navigation, Autonomous Agents
Abstract: Safe navigation within a workspace is a fundamental skill for autonomous robots to accomplish more complex tasks. Harmonic potentials are artificial potential fields that are analytical, globally convergent and provably free of local minima. Thus, it has been widely used for generating safe and reliable robot navigation control policies. However, most existing methods do not allow customization of the harmonic potential fields nor the resulting paths, particularly regarding their topological properties. In this paper, we propose a novel method that automatically finds homotopy classes of paths that can be generated by valid harmonic potential fields. The considered complex workspaces can be as general as forest worlds consisting of numerous overlapping star-obstacles. The method is based on a hybrid optimization algorithm that searches over homotopy classes, selects the structure of each tree-of-stars within the forest, and optimizes over the continuous weight parameters for each purged tree via the projected gradient descent. The key insight is to transform the forest world to the unbounded point world via proper diffeomorphic transformations. It not only facilitates a simpler design of the multi-directional D-signature between non-homotopic paths, but also retain the safety and convergence properties. Extensive simulations and hardware experiments are conducted for non-trivial scenarios, where the navigation potentials are customized for desired homotopic properties.
|
| |
| 15:00-16:30, Paper TuI2I.26 | Add to My Program |
| SEAL: Towards Safe Autonomous Driving Via Skill-Enabled Adversary Learning for Closed-Loop Scenario Generation |
|
| Stoler, Benjamin | Carnegie Mellon University |
| Navarro, Ingrid | Carnegie Mellon University |
| Francis, Jonathan | Bosch Center for Artificial Intelligence |
| Oh, Jean | Carnegie Mellon University |
Keywords: Intelligent Transportation Systems, Autonomous Vehicle Navigation, Performance Evaluation and Benchmarking
Abstract: Verification and validation of autonomous driving (AD) systems and components is of increasing importance, as such technology increases in real-world prevalence. Safety-critical scenario generation is a key approach to robustify AD policies through closed-loop training. However, existing approaches for scenario generation rely on simplistic objectives, resulting in overly-aggressive or non-reactive adversarial behaviors. To generate diverse adversarial yet realistic scenarios, we propose SEAL, a scenario perturbation approach which leverages learned objective functions and adversarial, human-like skills. SEAL-perturbed scenarios are more realistic than SOTA baselines, leading to improved ego task success across real-world, in-distribution, and out-of-distribution scenarios, of more than 20%. To facilitate future research, we release our code and tools: https://navars.xyz/seal/
|
| |
| 15:00-16:30, Paper TuI2I.27 | Add to My Program |
| Detection of Texting While Walking in Occluded Environment Using Variational Autoencoder for Safe Mobile Robot Navigation |
|
| Terao, Hayato | The University of Tokyo |
| Wu, Jiaxu | The University of Tokyo |
| An, Qi | The University of Tokyo |
| Yamashita, Atsushi | The University of Tokyo |
Keywords: Robot Safety, Object Detection, Segmentation and Categorization, Autonomous Vehicle Navigation
Abstract: As autonomous mobile robots begin to populate public spaces, it is becoming increasingly important for robots to accurately distinguish pedestrians and navigate safely to avoid collisions. Texting while walking is a common but hazardous behavior among pedestrians that poses significant challenges for robot navigation systems. While several studies have addressed the detection of text walkers, many have overlooked the impact of occlusions, a very common phenomenon where parts of pedestrians are obscured from sensor’s view. This study proposes a machine learning method that distinguishes text walkers from other pedestrians in video data. The proposed method processes each video frame to extract body keypoints, encodes the keypoints into a latent space, and classifies pedestrian activities into three categories: normal walking, texting while walking, and other activities. A variational autoencoder is incorporated to enhance the system’s robustness under various occlusion scenarios. Performance tests in real-world environments identified potential areas for improvement, particularly in distinguishing pedestrian activities with similar body postures. However, ablation studies demonstrated that the proposed system performs reliably across different occlusion scenarios.
|
| |
| 15:00-16:30, Paper TuI2I.28 | Add to My Program |
| Spatial Coordinate Transformation for 3D Neural Implicit Mapping |
|
| Kang, Kyeongsu | Sungkyunkwan University |
| Ha, Seongbo | Sungkyunkwan University |
| Lee, Sibaek | Sungkyunkwan University (SKKU) |
| Yu, Hyeonwoo | SungKyunKwan University |
Keywords: Mapping, SLAM
Abstract: Implicit Neural Representation (INR)-based SLAM has a critical issue where all keyframes must be stored in memory for post-training whenever a remapping is needed due to the neural network's weights themselves representing the map. To address this, previous INR-based SLAM proposed methods to modify INR-based maps without changing the neural network's weights. However, these approaches suffer from low memory efficiency and increased space complexity. In this paper, we introduce a remapping method for INR-based maps that does not require post-traning the neural network's weights and needed low space cost. The problem of function modification, such as updating a map defined as a neural network function, can be viewed as transforming the function’s domain. Leveraging function domain transformation, we propose a method to update INR-based maps by identifying the transformation function between the post-optimization and pre-optimization domains. Additionally, to prevent cases where the transformation between the post-optimization and pre-optimization domains does not form a one-to-many relationship, we introduce a temporal domain and propose a method to find the spatial coordinate transformation function accordingly. Evaluations in INR-based techniques demonstrate that our proposed method effectively update to maps while requiring significantly less memory compared to existing remapping approaches.
|
| |
| 15:00-16:30, Paper TuI2I.29 | Add to My Program |
| A Benchmarking Study of Vision-Based Robotic Grasping Algorithms |
|
| Ramesh Babu, Bharath Kumar | Worcester Polytechnic Institute |
| Sreenivasarao Balakrishna, Sumukh | Worcester Polytechnic Institute |
| Flynn, Brian | University of Massachusetts Lowell |
| Kapoor, Vinayak | Worcester Polytechnic Institute |
| Norton, Adam | University of Massachusetts Lowell |
| Yanco, Holly | UMass Amherst |
| Calli, Berk | Worcester Polytechnic Institute |
Keywords: Grasping, Performance Evaluation and Benchmarking, Deep Learning in Grasping and Manipulation
Abstract: We present a benchmarking study of vision-based robotic grasping algorithms with distinct approaches, and pro- vide a comparative analysis. In particular, we compare two machine-learning-based and two analytical algorithms using an existing benchmarking protocol from the literature and deter- mine the algorithms’ strengths and weaknesses under different experimental conditions. These conditions include variations in lighting, background textures, cameras with different noise levels, and grippers. We also run analogous experiments in simulations and with real robots and present the discrepancies. Some experiments are also run in two different laboratories using same protocols to further analyze the repeatability of our results. We believe that this study, comprising 5040 experiments, provides important insights into the role and challenges of systematic experimentation in robotic manipulation, and guides the development of new algorithms by considering the factors that could impact the performance. The experiment recordings and our benchmarking software are publicly available.
|
| |
| 15:00-16:30, Paper TuI2I.30 | Add to My Program |
| Underactuated Legged-Rolling Locomotion of a Minimal Two-Rod Tensegrity Robot with Self-Recovery Capability |
|
| Zheng, Yanqiu | Tokyo University of Science |
| Xiang, Yuxuan | Japan Advanced Institute of Science and Technology |
| He, Yuetong | Japan Advanced Institute of Science and Technology |
| Asano, Fumihiko | Japan Advanced Institute of Science and Technology |
| Tokuda, Isao T. | Ritsumeikan University |
Keywords: Legged Robots, Underactuated Robots
Abstract: A control method is proposed for a tensegrity robot to generate legged-rolling locomotion (i.e., rolling movement produced by a legged system). The robot has a minimal structure composed only of two rods and four elastic cables. The difficulty of the control arises from the minimalistic structure that makes the system underactuated. Our control strategy is divided into two phases: 1) overcoming the gravitational potential energy and 2) adjusting the robot's posture to prepare for the landing. Numerical simulations demonstrated that the system was capable of traversing complex terrains with two types of gaits, i.e., quasi-static and dynamic. The proposed structure also enabled the robot to autonomously recover from arbitrary stationary states and initiate legged-rolling locomotion. Physical experiments validated the applicability of the tensegrity robot to various terrains such as uphill and stair climbing and showed its capability of overcoming discrete steps up to 20 % of the robot's frame length.
|
| |
| 15:00-16:30, Paper TuI2I.31 | Add to My Program |
| An Intention-Aware Robust Safety Framework for Robot Teleoperation: Unifying Object Interaction and Obstacle Avoidance |
|
| Gao, Zhitao | Huazhong University of Science and Technology |
| Peng, Fangyu | Huazhong University of Science and Technology |
| Chen, Chen | Wuhan University of Science and Technology |
| Zhang, Yukui | Huazhong University of Science and Technology |
| Zhou, Wenke | Huazhong University of Science and Technology |
| Jiang, ChengAo | Huazhong University of Science and Technology |
| Yan, Rong | Huazhong University of Science and Technology |
| Tang, Xiaowei | Huazhong University of Science and Technology |
| Wang, Yu | Huazhong University of Science and Technology |
Keywords: Telerobotics and Teleoperation, Intention Recognition, Robot Safety
Abstract: Control barrier functions (CBFs) have proven to be effective for obstacle avoidance in robot teleoperation systems. However, for classical CBF, model uncertainties and external disturbances can significantly degrade the robustness of safety control. Moreover, the fixed safety boundary lacks adaptability to dynamic switching on operational intentions. To address these limitations, this paper presents a hierarchical safety teleoperation framework that separates the safety layer from the leader-follower teleoperation layers. On this basis, a virtual proxy is introduced to construct a robust control-affine system decoupled from physical robot uncertainties and external disturbances. Building upon this, we propose an intention-aware adaptive control barrier function (IA-ACBF), which consists of two modules: intention detection and intention quantification. The intention detection module determines the operator's transient intention, which belongs to object interaction or obstacle avoidance. The intention quantification module then maps this to the adaptation of safety boundaries. Finally, the performance of the proposed method is validated through simulations and experiments with the physical robot.
|
| |
| 15:00-16:30, Paper TuI2I.32 | Add to My Program |
| Adaptive Non-Linear Centroidal MPC with Stability Guarantees for Robust Locomotion of Legged Robots |
|
| Elobaid, Mohamed | KAUST (King Abdullah University of Science and Technology) |
| Turrisi, Giulio | Istituto Italiano Di Tecnologia |
| Rapetti, Lorenzo | IIT |
| Romualdi, Giulio | Istituto Italiano Di Tecnologia |
| Dafarra, Stefano | Istituto Italiano Di Tecnologia |
| Kawakami, Tomohiro | Honda R&D Co., Ltd |
| Chaki, Tomohiro | Honda R&D Co., Ltd |
| Yoshiike, Takahide | Honda R&D Co. Ltd., |
| Semini, Claudio | Istituto Italiano Di Tecnologia |
| Pucci, Daniele | Italian Institute of Technology |
Keywords: Legged Robots, Optimization and Optimal Control, Robust/Adaptive Control
Abstract: Nonlinear model predictive locomotion controllers based on the reduced centroidal dynamics are nowadays ubiquitous in legged robots. These schemes, even if they assume an inherent simplification of the robot’s dynamics, were shown to endow robots with a step-adjustment capability in reaction to small pushes, and in the case of uncertain parameters - as unknown payloads - they were shown to provide some “practical”, albeit limited, robustness. In this work, we provide rigorous certificates of their closed-loop stability via reformulating the online centroidal MPC controller. This is achieved thanks to a systematic procedure inspired by the machinery of adaptive control, together with ideas coming from Control Lyapunov Functions. Our reformulation, in addition, provides robustness for a class of unmeasured constant disturbances. To demonstrate the generality of our approach, we validated our formulation on a new generation of humanoid robots - the 56.7 kg ergoCub, as well as on the commercially available 21 kg quadruped robot Aliengo.
|
| |
| 15:00-16:30, Paper TuI2I.33 | Add to My Program |
| PB-NBV: Efficient Projection-Based Next-Best-View Planning Framework for Reconstruction of Unknown Objects |
|
| Jia, Zhizhou | Beijing Institute of Technology |
| Li, Yuetao | Beijing Institute of Technology |
| Hao, Qun | Beijing Institute of Technology |
| Zhang, Shaohui | Beijing Institute of Technology |
| |
| 15:00-16:30, Paper TuI2I.34 | Add to My Program |
| The Folding Hand: Anthropomorphic Robotic Hands with a Compact Reconfigurable Humanoid Palm Design |
|
| Lu, Qiujie | Fudan University |
| Zou, Jiehan | Fudan University |
| Gan, Zhongxue | Fudan University |
Keywords: Multifingered Hands, In-Hand Manipulation, Grasping
Abstract: The human palm is a remarkable and highly functional part of the hand that significantly contributes to dexterity, grasp versatility, and overall manipulation capability. The metacarpophalangeal joints (MCP) of the palm facilitate movement of the fingers for flexion, extension, abduction, adduction, and limited circumduction, which can be a challenge to replicate the same function in robotic hands by simple design. In this paper, we proposed a single actuated folding mechanism as the metacarpals and passive rotating MCP joints to perform abduction, adduction, and circumduction of the human hand. The proposed anthropomorphic hand called Folding Hand which has a reconfigurable palm and five underactuated tendon-driven fingers. The design of the hand is compact and the price is low, with all six actuators and five sensors on the hand costing less than 180. Additionally, a methodology has been developed to comprehensively analyze the grasping capacity by combining the grasping quality and the grasping workspace. The experimental results show that the folding palm mechanism and the compliant rotating finger base can replicate human hand capabilities and performance precision and in-hand manipulation tasks.
|
| |
| 15:00-16:30, Paper TuI2I.35 | Add to My Program |
| Onion-LO: Why Does LiDAR Odometry Fail across Different LiDAR Types and Scenarios? |
|
| Cheng, Xiaolong | Southeast University |
| Geng, Keke | Southeast University |
| Liu, Zhichao | Southeast University |
| Ma, Tianxiao | Southeast University |
| Sun, Ye | Southeast University |
Keywords: SLAM, Localization, Field Robots
Abstract: LiDAR odometry is a fundamental technology for autonomous navigation. However, existing LiDAR-based odometry methods typically demand extensive manual parameter tuning and remain prone to instability when deployed across varying LiDAR types and environments. This letter focuses on the essence of point clouds and introduces a fast, highly adaptable, and robust LiDAR odometry framework named Onion-LO. Onion-LO demonstrates strong compatibility with various LiDAR types and reliable operation across diverse scenarios. This is facilitated by an onion-like point cloud processing structure termed Onion Ball. The Onion Ball supports multi-threaded implementation, efficiently executing point cloud distribution analysis, segmentation, and downsampling. In addition, we design an adaptive optimization strategy for local map management and iterative optimization, which effectively enhances the system's robustness and accuracy. Extensive experiments on five datasets demonstrate that Onion-LO outperforms existing state-of-the-art methods regarding localization accuracy and robustness. Additional evaluations across 11 LiDAR sensors and 8 diverse scenarios further confirm its strong generalization capability. Our method is designed for practical deployment and supports real-time operation on onboard processors. We open-source the code on https://anonymous.4open.science/r/Onion-LO.
|
| |
| 15:00-16:30, Paper TuI2I.36 | Add to My Program |
| Geo-LSTM: A Geometry and Temporal Feature Fusion Algorithm for Multi-Sensor 3D Localization |
|
| Li, Kai | Hanyang University |
| Bao, Le | Jiangsu University of Technology |
| Kim, Wansoo | Hanyang University ERICA |
Keywords: Localization, Sensor Fusion, Human-Robot Collaboration
Abstract: Accurate three-dimensional (3D) localization is critical for robust human-robot collaboration (HRC) in dynamic indoor environments. However, realizing high-precision localization in complex scenarios still faces challenges such as multipath effects, field-of-view occlusion, etc. To address these limitations, we propose Geo-LSTM, a geometry-constrained long short-term memory (LSTM) framework that integrates ultra-wideband (UWB) sensors, inertial measurement unit (IMU), and barometric pressure (BMP) sensors. First, a Simplified Geometric Localization (SGL) algorithm is proposed, which uses dual-BMP sensors and IMU sensor to obtain precise height information and utilizes the geometric relationships between the UWB tag and anchors to compute an initial location estimate, serving as a priori input for the Geo-LSTM network. This Geo-LSTM algorithm then incorporates multi-source geometric information to extract time-series features from the UWB ranging data and the tag's a priori location, further enhancing 3D localization accuracy. The experimental results from the cluttered indoor environments, including real-world HRC tasks with occlusions, show that the Geo-LSTM algorithm achieves an average 3D localization root mean square error (RMSE) of 0.103 m, representing improvements of 38.60% and 31.20% over the weighted least squares (WLS) method and the range-based LSTM algorithm, respectively. These results demonstrate Geo-LSTM's potential for reliable multi-sensor 3D localization in HRC applications.
|
| |
| 15:00-16:30, Paper TuI2I.37 | Add to My Program |
| Mixed-Type Query Selection for Robotic Scientific Data Collection |
|
| Rankin, Ian C. | Oregon State University |
| Somers, Thane | Oregon State University |
| Eng, Alivia M. | Georgia Institute of Technology |
| Hollinger, Geoffrey A. | Oregon State University |
Keywords: Human-Robot Collaboration, Motion and Path Planning, Field Robots
Abstract: We propose combining preference and rating query types into a mixed-type query selection to learn reward functions for robotic decision making to improve scientific data collection. Mixed-type query selection allows the scientist operating a robot to specify the robot’s tradeoffs and goals in terms of both rating, giving a score to one robot plan, and preferences, selecting a preferred plan to another plan. While previous methods have used active learning to allow the user to specify tradeoffs between objectives using rating and preferences individually, our proposed method considers using multiple query types. We assume a user responds to these queries with some noise on their true preferences. Online estimation of error model parameters is difficult; therefore, we show results with both a tuned known error model and a heuristic mixed-type query selection method. When the error model is known, we show performance increases using our mixed-type query selection versus using only ratings or only preferences. In the more realistic case with an unknown error model, we show our heuristic performs better than the worst case single query type in all cases we tested.
|
| |
| 15:00-16:30, Paper TuI2I.38 | Add to My Program |
| Playbook: Scalable Discrete Skill Discovery from Unstructured Datasets for Long-Horizon Decision-Making Problems |
|
| Kang, Minjae | Seoul National University (SNU) |
| Hong, Mineui | Carnegie Mellon University |
| Oh, Songhwai | Seoul National University |
Keywords: Learning from Demonstration, Continual Learning, Machine Learning for Robot Control
Abstract: Skill discovery methods enable agents to tackle intricate tasks by acquiring diverse and useful skills from task-agnostic datasets in an unsupervised manner. To apply these methods to more general and everyday tasks, the skill set must be scalable. However, current approaches struggle with this scalability, often facing the challenge of catastrophic forgetting when learning new skills. To address this imitation, we propose a scalable skill discovery algorithm, a playbook, which can accommodate unseen tasks by acquiring new skills while maintaining previously learned ones. The scalable structure of the playbook, consisting of finite and independent plays and primitives, enables expansion by adding new elements to accommodate new tasks. The proposed method is evaluated in the complex robotic manipulation benchmarks, and the results show that the playbook outperforms existing state-of-the-art methods.
|
| |
| 15:00-16:30, Paper TuI2I.39 | Add to My Program |
| Latent-RAG: Identity Retrieval-Guided Latent Augmentation for Privacy-Preserving Person Re-Identification |
|
| Back, Seung-hyeok | Chonnam National University |
| Lee, Eungi | Chonnam National University |
| Kim, Hyung-Il | Chonnam National University |
| Yoo, Seok Bong | Chonnam National University |
Keywords: Deep Learning for Visual Perception, Surveillance Robotic Systems, Recognition
Abstract: Person re-identification (re-ID) is crucial for security applications, including autonomous robots that monitor individuals via continuous image acquisition. Such data are transmitted to a database; however, if stored without adequate protection, they can be intercepted, posing privacy risks. In response, the existing methods balance privacy and accuracy, but protected images still reveal structural cues, such as silhouettes or edges. These methods rely on randomness to defend against recovery attacks, limiting the guarantee of complete protection. Thus, this work proposes latent retrieval-augmented generation (RAG), an identity retrieval-guided latent augmentation framework for privacy-preserving person re-ID that balances the re-ID performance with privacy protection. The proposed method generates augmented codes that distort appearance and disrupt mapping to the original input by retrieving identity-similar latent codes and applying inverse self-attention, enhancing its robustness to recovery attacks. Next, this approach employs gradient-based latent code manipulation to preserve identity vectors to maintain re-ID accuracy. The hierarchical latent codes are concurrently adjusted to eliminate structural cues that could threaten privacy. The experimental results demonstrate that Latent-RAG induces strong visual distortion, reliable re-ID accuracy and a robust defense against recovery attacks, even without additional training with a few frozen parameters in a pretrained generator. Our code is available at https://github.com/BACKAI/Latent-RAG.
|
| |
| 15:00-16:30, Paper TuI2I.40 | Add to My Program |
| Point Cloud-Based Grasping for Soft Hand Exoskeleton |
|
| Hu, Chen | King's College London |
| Tricomi, Enrica | Technical University of Munich |
| Rho, Eojin | Korea Advanced Institute of Science and Technology (KAIST) |
| Kim, Daekyum | Korea University |
| Masia, Lorenzo | Technische Universität München (TUM) |
| Luo, Shan | King's College London |
| Gionfrida, Letizia | King's College London |
Keywords: Sensor-based Control, RGB-D Perception, Prosthetics and Exoskeletons
Abstract: Grasping is a fundamental skill for interacting with and manipulating objects in the environment. However, this ability can be challenging for individuals with hand impairments. Soft hand exoskeletons designed to assist grasping can enhance or restore essential hand functions, yet controlling these soft exoskeletons to support users effectively remains difficult due to the complexity of understanding the environment. This study presents a vision-based predictive control framework that leverages contextual awareness from depth perception to predict the grasping target and determine the next control state for activation. Unlike data-driven approaches that require extensive labelled datasets and struggle with generalizability, our method is grounded in geometric modelling, enabling robust adaptation across diverse grasping scenarios. The Grasping Ability Score (GAS) was used to evaluate performance, with our system achieving a state-of-the-art GAS of 91 ± 2% across 15 objects and healthy participants, demonstrating its effectiveness across different object types. The proposed approach maintained reconstruction success for unseen objects, underscoring its enhanced generalizability compared to learning-based models.
|
| |
| 15:00-16:30, Paper TuI2I.41 | Add to My Program |
| Knowledge-Guided Graph Convolutional Network for Multi-Label Image Classification |
|
| Dewi, Christine | Deakin University |
| Thiruvady, Dhananjay | Deakin University |
| Philemon, Stephen Abednego | Satya Wacana Christian University |
| Zaidi, Nayyar | Deakin University |
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, Visual Learning
Abstract: Multi-label image classification is a significant challenge in computer vision due to the presence of multiple interconnected objects in a single image. Traditional convolutional neural networks (CNN) often fail to capture semantic dependencies between labels, limiting performance in complex scenes. To address this issue, we propose a novel framework that combines Knowledge-Guided Graph Convolutional Network (KGGCN) with Darknet53 backbone to improve label dependency modeling. Our method fuses external semantic information from ConceptNet5, which allows the model to learn contextual relationships between labels. Our work evaluate this approach on two benchmark datasets, VOC 2007 and COCO, and obtain state-of-the-art results. KGGCN achieves an Average Precision (mAP) of 96.24% on VOC 2007 and 85.25% on COCO, outperforming existing methods in most categories. Moreover, ablation studies further highlight the benefits of external knowledge integration contributing to higher mAP scores. Finally, our proposed method KGGCN demonstrates the effectiveness of combining deep visual features with structured semantic knowledge for multi-label image classification.
|
| |
| 15:00-16:30, Paper TuI2I.42 | Add to My Program |
| HHI-Assist: A Dataset and Benchmark of Human-Human Interaction in Physical Assistance Scenario |
|
| Saadatnejad, Saeed | EPFL |
| Hosseininejad, Reyhaneh | EPFL |
| Barreiros, Jose | Toyota Research Institute |
| Tsui, Katherine | Toyota Research Institute |
| Alahi, Alexandre | EPFL |
Keywords: Data Sets for Robot Learning, Physical Human-Robot Interaction, Intention Recognition
Abstract: The increasing labor shortage and aging population underline the need for assistive robots to support human care recipients. To enable safe and responsive assistance, robots require accurate human motion prediction in physical interaction scenarios. However, this remains a challenging task due to the variability of assistive settings and the complexity of coupled dynamics in physical interactions. In this work, we address these challenges through two key contributions: (1) HHI-Assist, a dataset comprising motion capture clips of human-human interactions in assistive tasks; and (2) a conditional Transformer-based denoising diffusion model for predicting the poses of interacting agents. Our model effectively captures the coupled dynamics between caregivers and care receivers, demonstrating improvements over baselines and strong generalization to unseen scenarios. By advancing interaction-aware motion prediction and introducing a new dataset, our work has the potential to significantly enhance robotic assistance policies. The dataset and code are available at https://sites.google.com/view/hhi-assist/home .
|
| |
| 15:00-16:30, Paper TuI2I.43 | Add to My Program |
| Bi-Stable Thin Soft Robot for In-Plane Locomotion in Narrow Space |
|
| Wang, Xi | University of Nottingham |
| Liu, Yihan | Department of Electrical and Electronic Engineering, University of Nottingham, Nottingham, UK |
| Tu, Junhao | University of Michigan |
| Chang, Jung-Che | University of Nottingham |
| Wang, Feiran | University of Nottingham |
| Axinte, Dragos | University of Nottingham |
| Dong, Xin | University of Nottingham |
Keywords: Soft Robot Applications, Mechanism Design, Micro/Nano Robots
Abstract: Dielectric elastomer actuators (DEAs), also recognized as artificial muscle, have been widely developed for the soft locomotion robot. With the complaint skeleton and miniaturized dimension, they are well suited for the narrow space inspection. In this work, we propose a novel low profile (1.1mm) and lightweight (1.8g) bi-stable in-plane DEA (Bi-DEA) constructed by supporting a dielectric elastomer onto a flat bi-stable mechanism. It has an amplified displacement and output force compared with the in-plane DEA (I-DEA) without the bi-stable mechanism. Then, the Bi-DEA is applied to a thin soft robot, using three electrostatic adhesive pads (EA-Pads) as anchoring elements. This robot is capable of crawling and climbing to access millimetre-scale narrow gaps. A theoretical model of the bi-stable mechanism and the DEA are presented. The enhanced performance of the Bi-DEA induced by the mechanism is experimentally validated. EA-Pad provides the adhesion between the actuator and the locomotion substrate, allowing crawling and climbing on various surfaces, i.e., paper and acrylic. The thin soft robot has been demonstrated to be capable of crawling through a 4mm narrow gap with a speed up to 3.3mm/s (0.07 body length per second and 2.78 body thickness per second).
|
| |
| 15:00-16:30, Paper TuI2I.44 | Add to My Program |
| BOW: Bayesian Optimization Over Windows for Motion Planning in Complex Environments |
|
| Raxit, Sourav | University of New Orleans |
| Redwan Newaz, Abdullah Al | University of New Orleans |
| Padrao, Paulo | Providence College |
| Fuentes, Jose | Florida International University |
| Bobadilla, Leonardo | Florida International University |
Keywords: Motion and Path Planning, Collision Avoidance, Nonholonomic Motion Planning
Abstract: This paper introduces the BOW Planner, a scalable motion planning algorithm designed to navigate robots through complex environments using constrained Bayesian optimization (CBO). Unlike traditional methods, which often struggle with kinodynamic constraints such as velocity and acceleration limits, the BOW Planner excels by concentrating on a planning window of reachable velocities and employing CBO to sample control inputs efficiently. This approach enables the planner to manage high-dimensional objective functions and stringent safety constraints with minimal sampling, ensuring rapid and secure trajectory generation. Theoretical analysis confirms the algorithm’s asymptotic convergence to near-optimal solutions, while extensive evaluations in cluttered and constrained settings reveal substantial improvements in computation times, trajectory lengths, and solution times compared to existing techniques. Successfully deployed across various real-world robotic systems, the BOW Planner demonstrates its practical significance through exceptional sample efficiency, safety-aware optimization, and rapid planning capabilities, making it a valuable tool for advancing robotic applications. The BOW Planner is released as an open-source package and videos of real-world and simulated experiments are available at https://bow-web.github.io/.
|
| |
| 15:00-16:30, Paper TuI2I.45 | Add to My Program |
| Inference-Stage Adaptation-Projection Strategy Adapts Diffusion Policy to Cross-Manipulators Scenarios |
|
| Yao, Xiangtong | Technical University of Munich |
| Zhou, Yirui | The Hong Kong Polytechnic University |
| Meng, Yuan | Technical University of Munich |
| Liu, Yanwen | Technical University of Munich |
| Dong, Liangyu | Technical University of Munich |
| Zhang, Zitao | Technical University of Munich |
| Bing, Zhenshan | Technical University of Munich |
| Huang, Kai | Sun Yat-Sen University |
| Sun, Fuchun | Tsinghua University |
| Knoll, Alois | Tech. Univ. Muenchen TUM |
Keywords: Imitation Learning, Transfer Learning, Deep Learning in Grasping and Manipulation
Abstract: Diffusion policies are powerful visuomotor models for robotic manipulation, yet they often fail to generalize to manipulators or end-effectors unseen during training and struggle to accommodate new task requirements at inference time. Addressing this typically requires costly data recollection and policy retraining for each new hardware or task configuration. To overcome this, we introduce an adaptation-projection strategy that enables a diffusion policy to perform cost-effective adaptation to novel manipulators and dynamic task settings, entirely at inference time and without retraining or fine-tuning the policy. Our method first trains a diffusion policy in SE(3) space using demonstrations from a base manipulator. During online deployment, it projects the policy's generated trajectories to satisfy the kinematic and task-specific constraints imposed by the new hardware and objectives. Moreover, this projection dynamically adapts to physical differences (e.g., tool-center-point offsets, jaw widths) and task requirements (e.g., obstacle heights), ensuring robust and successful execution. We validate our approach on real-world pick-and-place, pushing, and pouring tasks across multiple manipulators, including the Franka Panda and Kuka iiwa 14, equipped with a diverse array of end-effectors like flexible grippers, Robotiq 2F/3F grippers, and various 3D-printed designs. Our results demonstrate consistently high success rates in these cross-manipulator scenarios, proving the effectiveness and practicality of our adaptation-projection strategy.
|
| |
| 15:00-16:30, Paper TuI2I.46 | Add to My Program |
| You Only Pose Once: A Minimalist's Detection Transformer for Monocular RGB Category-Level 9D Multi-Object Pose Estimation |
|
| Lee, Hakjin | PIT in Corp |
| Seo, Junghoon | PIT in Corp |
| Sim, Jaehoon | PIT in Corp |
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception, Visual Learning
Abstract: Accurately recovering the full 9-DoF pose of unseen instances within specific categories from a single RGB image remains a core challenge for robotics and automation. Most existing solutions still rely on pseudo-depth, CAD models, or multi-stage cascades that separate 2D detection from pose estimation. Motivated by the need for a simpler, RGB-only alternative that learns directly at the category level, we revisit a longstanding question: textit{Can object detection and 9-DoF pose estimation be unified with high performance, without any additional data?} We show that they can be achieved with our method, YOPO, a single-stage, query-based framework that treats category-level 9-DoF estimation as a natural extension of 2D detection. YOPO augments a transformer detector with a lightweight pose head, a bounding-box–conditioned translation module, and a 6D-aware Hungarian matching cost. The model is trained end-to-end only with RGB images and category-level pose labels. Despite its minimalist design, YOPO sets a new state of the art on three benchmarks. On the REAL275 dataset, it achieves 79.6% IoU_{50} and 54.1% under the 10^circ10{rm{cm}} metric, surpassing all prior RGB-only methods and closing much of the gap to RGB-D systems. The code, models, and additional qualitative results can be found on our project pagefootnote{url{https://mikigom.github.io/YOPO-project-page/}}.
|
| |
| 15:00-16:30, Paper TuI2I.47 | Add to My Program |
| Constructing Contact Estimation Models for Barometric Tactile Sensors |
|
| Shah, Sharmi | Massachusetts Institute of Technology |
| Chun, Ethan | Massachusetts Institute of Technology |
| Kim, Hongmin | Massachusetts Institute of Technology |
| SaLoutos, Andrew | Massachusetts Institute of Technology |
| Nguyen, David | Massachusetts Institute of Technology |
| Seo, TaeWon | Hanyang University |
| Kim, Sangbae | Massachusetts Institute of Technology |
Keywords: Force and Tactile Sensing, Grippers and Other End-Effectors, Contact Modeling
Abstract: Barometric tactile sensors present a cheap and customizable method for adding tactile sensing to robotic platforms. These sensors consist of commercially available MEMS barometers embedded in an elastomer. However, as the sensing surface and elastomer volume increase in complexity, time-dependent material dynamics reduce sensing accuracy. We present a collection of inference and usage recommendations towards mitigating these dynamics and improving sensor force and localization resolution. Using two custom, curved, barometric tactile sensors as case studies, we demonstrate that a new data collection regime alone can improve normal force predictions by 30.4% compared to prior work. We further introduce a Binned-RNN inference architecture and demonstrate its efficacy through select ablations. Small enough to run on the sensor’s integrated microcontroller at 100Hz, we find our model achieves a minimum spatial resolution of 0.86 mm on an ellipsoid tactile sensor. Finally, we demonstrate the robustness of these sensing capabilities through freeform contact and controlled object rolling.
|
| |
| 15:00-16:30, Paper TuI2I.48 | Add to My Program |
| PACER: Preference-Conditioned All-Terrain Costmap Generation |
|
| Mao, Luisa | University of Texas Austin |
| Warnell, Garrett | U.S. Army Research Laboratory |
| Stone, Peter | The University of Texas at Austin |
| Biswas, Joydeep | The University of Texas at Austin |
Keywords: Vision-Based Navigation, Deep Learning for Visual Perception
Abstract: In autonomous robot navigation, terrain cost assignment is typically performed using a semantics-based paradigm in which terrain is first labeled using a pre-trained semantic classifier and costs are then assigned according to a user-defined mapping between label and cost. While this approach is rapidly adaptable to changing user preferences, only preferences over the types of terrain that are already known by the semantic classifier can be expressed. In this paper, we hypothesize that a machine-learning-based alternative to the semantics-based paradigm above will allow for rapid cost assignment adaptation to preferences expressed over new terrains at deployment time without the need for additional training. To investigate this hypothesis, we introduce and study PACER, a novel approach to costmap generation that accepts as input a single birds-eye view (BEV) image of the surrounding area along with a user-specified preference context and generates a corresponding BEV costmap that aligns with the preference context. Using both real and synthetic data along with a combination of proposed training tasks, we find that PACER is able to adapt quickly to new user preferences while also exhibiting better generalization to novel terrains compared to both semantics-based and representation-learning approaches.
|
| |
| 15:00-16:30, Paper TuI2I.49 | Add to My Program |
| Multi-Modal Locomotion Mode Recognition in the Real World for Robotic Hip Complex Exoskeletons |
|
| Shin, Hyesoo | Korea Institute of Science and Technology |
| Kim, Sangdo | Korea Institute of Science and Technology |
| Kim, Sunwoo | Korea Institute of Science and Technology |
| Lee, Jongwon | Korea Institute of Science and Technology |
| Kim, Jinkyu | Korea University |
| Kim, KangGeon | Korea Institute of Science and Technology |
Keywords: Wearable Robotics, Sensor Fusion, Embedded Systems for Robotic and Automation
Abstract: Lower limb exoskeletons assist users by supporting joint movements. Since joint motion patterns vary depending on how the user moves, accurately recognizing the type of movement (locomotion mode) is crucial for controlling the exoskeleton and ensuring user safety. Inspired by how humans use multiple types of sensory information to control movement, we developed a multi-modal locomotion mode recognition (LMR) system that uses both mechanical and visual sensor data to identify locomotion modes. Our approach utilizes two fusion methods: intermediate fusion, which combines the data in the form of features, and late fusion, which integrates the sensor data by averaging the recognition results from each sensor. By fusing these two different modalities, the prediction accuracy improved by an average of 11.7% with the test data. Through comparisons with uni-modal LMR systems that rely on a single type of sensor data for locomotion mode recognition, we found that the improved performance of the multi-modal LMR system is due to the visual information's ability to generalize different gait patterns across users and the mechanical sensor data's consistency within the same classes.
|
| |
| 15:00-16:30, Paper TuI2I.50 | Add to My Program |
| Future-Oriented Navigation: Dynamic Obstacle Avoidance with One-Shot Energy-Based Multimodal Motion Prediction |
|
| Zhang, Ze | Chalmers University of Technology |
| Hess, Georg | Chalmers University of Technology |
| Hu, Junjie | Chalmers University of Technology |
| Dean, Emmanuel | Chalmers University of Technology |
| Svensson, Lennart | Chalmers University of Technology |
| Akesson, Knut | Chalmers University of Technology |
Keywords: Human-Aware Motion Planning, Collision Avoidance, Deep Learning Methods
Abstract: This paper proposes an integrated approach for the safe and efficient control of mobile robots in dynamic and uncertain environments. The approach consists of two key steps: one-shot multimodal motion prediction to anticipate motions of dynamic obstacles and model predictive control to incorporate these predictions into the motion planning process. Motion prediction is driven by an energy-based neural network that generates high-resolution, multi-step predictions in a single operation. The prediction outcomes are further utilized to create geometric shapes formulated as mathematical constraints. Instead of treating each dynamic obstacle individually, predicted obstacles are grouped by proximity in an unsupervised way to improve performance and efficiency. The overall collision-free navigation is handled by model predictive control with a specific design for proactive dynamic obstacle avoidance. The proposed approach allows mobile robots to navigate effectively in dynamic environments. Its performance is accessed across various scenarios that represent typical warehouse settings. The results demonstrate that the proposed approach outperforms other existing dynamic obstacle avoidance methods.
|
| |
| 15:00-16:30, Paper TuI2I.51 | Add to My Program |
| Multi-Dimensional Perturbation Strategies for Adversarial Attacks in Multi-Agent Deep Reinforcement Learning |
|
| Chen, Runwen | Zhengzhou University |
| Feng, Shuo | Zhengzhou University |
| Qi, Tianzhe | China Academy of Space Technology |
| Shi, Yucheng | Zhengzhou University |
| Hu, Xiaorong | Zhengzhou University |
| Zhao, Yang | China Academy of Space Technology |
| Sun, Bo | China Academy of Space Technology |
| Jin, Zhao | Zhengzhou University |
| Luo, Yizhe | Zhengzhou University |
| Xu, Mingliang | Zhengzhou University |
Keywords: Collision Avoidance, Reinforcement Learning, Motion and Path Planning
Abstract: Research indicates that single-agent reinforcement learning is vulnerable to adversarial attacks, which can lead to decision-making errors. Similarly, multi-agent deep reinforcement learning (MADRL) systems face analogous adversarial threats. However, existing attack methods require substantial investment in agent design and computational resources, limiting the feasibility of such attacks. To address this issue, we reformulate adversarial attacks as an optimization problem and propose the MREFDW-GA algorithm, which integrates dimension-weighted perturbations and a multi-stage robustness evaluation function. This approach combines dimension-weighted perturbations with a multi-stage robustness evaluation function, thereby enhancing the efficiency of evolutionary algorithms while dynamically adjusting search strategies to escape local optima. Experimental results demonstrate that this method can effectively execute black-box attacks by iteratively generating adversarial perturbations, significantly degrading the performance of MADRL systems and opening new research avenues for efficient black-box attacks.
|
| |
| 15:00-16:30, Paper TuI2I.52 | Add to My Program |
| Acoustic Drone Package Delivery Detection |
|
| Marcoux, François | Université De Sherbrooke |
| Grondin, Francois | Université De Sherbrooke |
Keywords: Surveillance Robotic Systems, Aerial Systems: Applications, Aerial Systems: Perception and Autonomy
Abstract: In recent years, the illicit use of un- manned aerial vehicles (UAVs) for deliveries in re- stricted area such as prisons became a significant security challenge. While numerous studies have fo- cused on UAV detection or localization, little atten- tion has been given to delivery events identification. This study presents the first acoustic package deliv- ery detection algorithm using a ground-based micro- phone array. The proposed method estimates both the drone’s propeller speed and the delivery event using solely acoustic features. A deep neural network detects the presence of a drone and estimates the propeller’s rotation speed or blade passing frequency (BPF) from a mel spectrogram. The algorithm ana- lyzes the BPFs to identify probable delivery moments based on sudden changes before and after a specific time. Results demonstrate a mean absolute error of the blade passing frequency estimator of 16 Hz when the drone is less than 150 meters away from the microphone array. The drone presence detection esti- mator has a accuracy of 97%. The delivery detection algorithm correctly identifies 96% of events with a false positive rate of 8 %. This study shows that deliveries can be identified using acoustic signals up to a range of 100 meters.
|
| |
| 15:00-16:30, Paper TuI2I.53 | Add to My Program |
| Flow-Enabled Generalization to Human Demonstrations in Few-Shot Imitation Learning |
|
| Tang, Runze | Australian National University |
| Sweetser, Penny | Australian National University |
Keywords: Imitation Learning, Learning from Demonstration, Machine Learning for Robot Control
Abstract: Imitation Learning (IL) enables robots to learn complex skills from demonstrations without explicit task modeling, but it typically requires large amounts of demonstrations, creating significant collection costs. Prior work has investigated using flow as an intermediate representation to enable the use of human videos as a substitute, thereby reducing the amount of required robot demonstrations. However, most prior work has focused on the flow, either on the object or on specific points of the robot/hand, which cannot describe the motion of interaction. Meanwhile, relying on flow to achieve generalization to scenarios observed only in human videos remains limited, as flow alone cannot capture precise motion details. Furthermore, conditioning on scene observation to produce precise actions may cause the flow-conditioned policy to overfit to training tasks and weaken the generalization indicated by the flow. To address these gaps, we propose SFCrP, which includes a Scene Flow prediction model for Cross-embodiment learning (SFCr) and a Flow and Cropped point cloud conditioned Policy (FCrP). SFCr learns from both robot and human videos and predicts any point trajectories. FCrP follows the general flow motion and adjusts the action based on observations for precision tasks. Our method outperforms SOTA baselines across various real-world task settings, while also exhibiting strong spatial and instance generalization to scenarios seen only in human videos.
|
| |
| 15:00-16:30, Paper TuI2I.54 | Add to My Program |
| RESPLE: Recursive Spline Estimation for LiDAR-Based Odometry |
|
| Cao, Ziyu | Linköping University |
| Talbot, William | ETH Zurich |
| Li, Kailai | University of Groningen |
Keywords: Sensor Fusion, Range Sensing
Abstract: We present a novel recursive Bayesian estimation framework using B-splines for continuous-time 6-DoF dynamic motion estimation. The state vector consists of a recurrent set of position control points and orientation control point increments, enabling efficient estimation via a modified iterated extended Kalman filter without involving error-state formulations. The resulting recursive spline estimator (RESPLE) is further leveraged to develop a versatile suite of direct LiDAR-based odometry solutions, supporting the integration of one or multiple LiDARs and an IMU. We conduct extensive real-world evaluations using public datasets and our own experiments, covering diverse sensor setups, platforms, and environments. Compared to existing systems, RESPLE achieves comparable or superior estimation accuracy and robustness, while attaining real-time efficiency. Our results and analysis demonstrate RESPLE's strength in handling highly dynamic motions and complex scenes within a lightweight and flexible design, showing strong potential as a universal framework for multi-sensor motion estimation. We release the source code and experimental datasets at https://github.com/ASIG-X/RESPLE.
|
| |
| 15:00-16:30, Paper TuI2I.55 | Add to My Program |
| VIRAA-SLAM: Flexible Robust Visual-Inertial-Range-AOA Tightly-Coupled Localization |
|
| Ma, Xingyu | Beijing University of Posts and Telecommunications |
| Guo, Ningyan | Beijing University of Posts and Telecommunication |
| Xin, Rui | Beijing University of Posts and Telecommunications |
| Cen, Zhigang | Beijing University of Posts and Telecommunications |
| Feng, Zhiyong | Beijing University of Posts and Telecommunications |
| |
| 15:00-16:30, Paper TuI2I.56 | Add to My Program |
| A Nonlinear MPC for Physical Human-Aerial Robot Interaction in Collaborative Transportation Tasks |
|
| Gonzalez-Morgado, Antonio | Universidad De Sevilla |
| Soueidan, Jonas | LAAS-CNRS |
| Heredia, Guillermo | University of Seville |
| Ollero, Anibal | AICIA. G41099946 |
| Fraisse, Philippe | LIRMM |
| Tognon, Marco | Inria Rennes |
| Cognetti, Marco | LAAS-CNRS and Université De Toulouse |
Keywords: Aerial Systems: Mechanics and Control, Physical Human-Robot Interaction, Optimization and Optimal Control
Abstract: Aerial robots are transitioning from traditional surveillance and monitoring roles to more advanced tasks involving physical interaction. Despite this progress, physical Human-Aerial Robot Interaction remains largely underexplored due to the complexity and stability-related issues of such platforms. This paper introduces a novel control framework for letting an aerial platform transport an object with a human operator cooperatively. The control approach is built on a nonlinear model predictive control (NMPC), integrating the dynamic models of humans, aerial robots, and transported objects. To ensure safe and robust physical interaction, the NMPC is combined with a compliant controller. Additionally, our controller prioritizes forward motion over lateral movements to accommodate the human's natural direction of motion. We validate this framework through indoor flight experiments, demonstrating how a human operator and a fully actuated hexarotor can effectively collaborate to transport a bar. The results highlight the aerial robot's ability to assist the human during physical transportation tasks, enhancing efficiency and comfort.
|
| |
| 15:00-16:30, Paper TuI2I.57 | Add to My Program |
| A Unilateral Active Knee Exoskeleton to Assist Individuals with Hemiparesis - a Pilot Study |
|
| Pergolini, Andrea | Scuola Superiore Sant'Anna of Pisa |
| Sanz-Morère, Clara Beatriz | Spanish National Research Council |
| Livolsi, Chiara | IUVO S.r.l, Scuola Superiore Sant'Anna of Pisa |
| Fantozzi, Matteo | Scuola Superiore Sant'Anna |
| Dell'Agnello, Filippo | Scuola Superiore Sant'Anna |
| Ciapetti, Tommaso | IRCSS Fondazione Don Carlo Gnocchi |
| Maselli, Alessandro | Dipartimento delle Professioni Tecnico Sanitarie, della Riabilitazione e della Prevenzione Azienda USL Toscana Sudest |
| Baldoni, Andrea | ISTITUTO DI BIOROBOTICA |
| Trigili, Emilio | Scuola Superiore Sant'Anna |
| Crea, Simona | Scuola Superiore Sant'Anna, The BioRobotics Institute |
| Vitiello, Nicola | Scuola Superiore Sant Anna |
|
|
| |
| 15:00-16:30, Paper TuI2I.59 | Add to My Program |
| A Digital Twin for Robotic Post Mortem Tissue Sampling Using Virtual Reality |
|
| Neidhardt, Maximilian | Hamburg University of Technology |
| Bosse, Ludwig | Technical University Hamburg |
| Raudonis, Vidas | Kaunas University of Technolog |
| Allgoewer, Kristina | University Medical Center Hamburg-Eppendorf |
| Heinemann, Axel | University Medical Center Hamburg-Eppendorf |
| Ondruschka, Benjamin | University Medical Center Hamburg-Eppendorf |
| Schlaefer, Alexander | Hamburg University of Technology |
Keywords: Medical Robots and Systems, Motion and Path Planning
Abstract: Studying tissue samples obtained during autopsies is the gold standard when diagnosing the cause of death and for understanding disease pathophysiology. Recently, the interest in post mortem minimally invasive biopsies has grown which is a less destructive approach in comparison to an open autopsy and reduces the risk of infection. While manual biopsies under ultrasound guidance are more widely performed, robotic post mortem biopsies have been recently proposed. This approach can further reduce the risk of infection for physicians. However, planning of the procedure and control of the robot need to be efficient and usable. We explore a virtual reality setup with a digital twin to realize fully remote planning and control of robotic post mortem biopsies. The setup is evaluated with forensic pathologists in a usability study for three interaction methods. Furthermore, we evaluate clinical feasibility and evaluate the system with three human cadavers. Overall, 132 needle insertions were performed with an off-axis needle placement error of 5.30±3.25 mm. Tissue samples were successfully biopsied and histopathologically verified. Users reported a very intuitive needle placement approach, indicating that the system is a promising, precise, and low-risk alternative to conventional approaches.
|
| |
| 15:00-16:30, Paper TuI2I.60 | Add to My Program |
| A Human Finger-Inspired Rigid-Soft Hybrid Gripper for Damage-Free and Fast Grasping |
|
| Zhou, Pengyu | Fudan University |
| Gao, Zeyang | Fudan University |
| Zhang, Xiaoxu | Fudan University |
| Yin, Xiaowen | Fudan Univerisity |
| Fang, Hongbin | Fudan University |
| Xu, Jian | Fudan University |
Keywords: Grippers and Other End-Effectors, Hydraulic/Pneumatic Actuators, Soft Robot Applications
Abstract: Rigid-soft hybrid grippers show good protection and high-payload capacity for fragile and heavy objects. However, because of inadequate actuation speed, it is still challenging for hybrid grippers to grasp moving objects in unstructured environments. To solve the limitation, this article presents a rigid-soft hybrid gripper with four grasping modes that can not only grasp deformable and heavy objects like tofu and a dumbbell but also capture moving objects with low response time. Inspired by the structure of human fingers, a rigid-soft hybrid finger with a soft outer body and a rigid inner skeleton is designed. The finger consists of a soft pneumatic actuator (SPA), an endoskeleton linkage, a self-locking mechanism, a fast-responding mechanism, a pneumatic artificial muscle actuator (PAMA), a power transition bolt, and two split pins. The fast response speed of the PAMA and the amplification of the endoskeleton linkage enable the gripper to capture moving objects. A kinematic model is established to verify the endoskeleton linkage’s angular velocity amplification ability and describe its bending angle. Experiments demonstrate that the rigid-soft finger can bend to 145.14° within 71 ms. Eventually, the gripper is mounted on a robotic arm to demonstrate that it can grasp fragile and deformable objects, hold heavy objects, and capture moving objects. The grasping strategies and structure of the gripper provide a new idea for designing a high-performance rigid-soft hybrid gripper.
|
| |
| 15:00-16:30, Paper TuI2I.61 | Add to My Program |
| A Path Planning Strategy for Robotic Bronchoscopic Multi-Sample Biopsy |
|
| Pan, Qiqi | Fudan University |
| Luo, Jingjing | Fudan University |
| Feng, Yongfei | Fudan University |
| Duan, Wenke | Fudan University |
| Guo, Shijie | Hebei University of Technology |
| Hongbo, Wang | Fudan University |
Keywords: Surgical Robotics: Planning, Task and Motion Planning, Constrained Motion Planning
Abstract: Lung cancer is the leading cause of cancer death globally, and early diagnosis via transbronchial biopsy (TBB) improves outcomes. However, conventional bronchoscopes for multiple pulmonary nodules face inefficiencies and operator skill dependency. This paper proposes a path planning strategy for robotic bronchoscopic multi-sample TBB. It abstracts the bronchial tree as a circuit, with lesions as constant resistance bulbs and bronchial branches as resistors with equivalent resistance based on their morphology. Multi-target path planning is transformed into minimizing total circuit resistance, optimizing trajectories to reduce redundant movements of robotic manipulators. Comparing to traditional methods, evaluations show that over 60% reduced movement distance and 76% less operation time are achieved; experiments accomplish over 40% efficiency improvement, enhancing multi-sample TBB efficiency and safety.
|
| |
| 15:00-16:30, Paper TuI2I.62 | Add to My Program |
| GeoPF: Infusing Geometry into Potential Fields for Reactive Planning in Non-Trivial Environments |
|
| Gong, Yuhe | University of Nottingham |
| Laha, Riddhiman | Technical University of Munich |
| Figueredo, Luis | University of Nottingham (UoN) |
Keywords: Reactive and Sensor-Based Planning, Integrated Planning and Control, Collision Avoidance
Abstract: Reactive intelligence remains one of the cornerstones of versatile robotics operating in cluttered, dynamic, and human-centred environments. Among reactive approaches, potential fields (PF) continue to be widely adopted due to their simplicity and real-time applicability. However, existing PF methods typically oversimplify environmental representations by relying on isotropic, point- or sphere-based obstacle approximations. In human-centred settings, this simplification results in overly conservative paths, cumbersome tuning, and computational overhead—even breaking real-time requirements. In response, we propose the Geometric Potential Field (GeoPF), a reactive motion-planning framework that explicitly infuses geometric primitives—points, lines, planes, cubes, and cylinders—their structure and spatial relationship in modulating the real-time repulsive response. Extensive quantitative analyses consistently show GeoPF’s higher success rates, reduced tuning complexity (a single parameter set across experiments), and substantially lower computational costs (up to 2 orders of magnitude) compared to traditional PF methods. Real-world experiments further validate GeoPF’s reliability, robustness, and practical ease of deployment. GeoPF provides a fresh perspective on reactive planning problems driving geometric-aware temporal motion generation, enabling flexible and low-latency motion planning suitable for modern robotic applications.
|
| |
| 15:00-16:30, Paper TuI2I.63 | Add to My Program |
| SCOOP'D: Learning Mixed-Liquid-Solid Scooping Via Sim2Real Generative Policy |
|
| Wang, Kuanning | Fudan University |
| Gu, Yongchong | Fudan University |
| Fu, Yuqian | Fudan University |
| Shangguan, Zeyu | University of Southern California |
| He, Sicheng | University of Southern California |
| Xue, Xiangyang | Fudan University |
| Fu, Yanwei | Fudan University |
| Seita, Daniel | University of Southern California |
Keywords: Deep Learning in Grasping and Manipulation, Imitation Learning, Data Sets for Robot Learning
Abstract: Scooping items with tools such as spoons and ladles is common in daily life, ranging from assistive feeding to retrieving items from environmental disaster sites. However, developing a general and autonomous robotic scooping policy is challenging since it requires reasoning about complex tool-object interactions. Furthermore, scooping often involves manipulating deformable objects, such as granular media or liquids, which is challenging due to their infinite-dimensional configuration spaces and complex dynamics. We propose a method, SCOOP'D, which uses simulation from OmniGibson (built on NVIDIA Omniverse) to collect scooping demonstrations using algorithmic procedures that rely on privileged state information. Then, we use generative policies via diffusion to imitate demonstrations from observational input. We directly apply the learned policy in diverse real-world scenarios, testing its performance on various item quantities, item characteristics, and container types. In zero-shot deployment, our method demonstrates promising results across 465 trials in diverse scenarios, including objects of different difficulty levels that we categorize as "Level 1" and "Level 2." SCOOP'D outperforms all baselines and ablations, suggesting that this is a promising approach to acquiring robotic scooping skills. Project page: https://scoopdiff.github.io/
|
| |
| 15:00-16:30, Paper TuI2I.64 | Add to My Program |
| UP-SLAM: Adaptively Structured Gaussian SLAM with Uncertainty Prediction in Dynamic Environments |
|
| Zheng, Wancai | Zhejiang University of Technology |
| Ou, Linlin | Zhejiang University of Technology |
| Jiajie, He | Zhejiang University of Technology |
| Zhou, Libo | Zhejiang University of Technology |
| Wei, Yan | Zhejiang University of Technology |
| Yu, Xinyi | Zhejiang University of Technology |
Keywords: SLAM, Mapping, RGB-D Perception
Abstract: Recent 3D Gaussian Splatting (3DGS) techniques for visual Simultaneous Localization and Mapping (SLAM) have significantly progressed in tracking and high-fidelity mapping. However, their sequential optimization framework and sensitivity to dynamic objects limit real-time performance and robustness in real-world scenarios. We present UP-SLAM, a real-time RGB-D SLAM system for dynamic environments that decouples tracking and mapping through a parallelized framework. A probabilistic anchor is employed to manage Gaussian primitives adaptively, enabling efficient initialization and pruning without hand-crafted thresholds. To robustly filter dynamic regions during tracking, we propose a training-free uncertainty estimator that fuses multi-modal residuals to estimate per-pixel motion uncertainty, achieving open-set dynamic object handling without reliance on semantic labels. Furthermore, a temporal encoder is designed to enhance rendering quality, while a shallow multilayer perception transforms low-dimensional features into DINO features, enriching the Gaussian field and enhancing uncertainty prediction robustness. Extensive experiments on multiple challenging datasets suggest that UP-SLAM outperforms state-of-the-art methods in both localization accuracy (by 59.8%) and rendering quality (by 4.72 dB PSNR), while maintaining real-time performance and producing reusable, artifact-free static maps in dynamic environments.The Project Page: https://aczheng-cai.github.io/up_slam.github.io/
|
| |
| 15:00-16:30, Paper TuI2I.65 | Add to My Program |
| Multi-State Consistency Visual Language Model Combine Wavelet Transform for Weakly Supervised Robot Visual Segmentation |
|
| Xiao, Feng | Norwegian University of Science and Technology |
| Han, Peihua | Norwegian University of Science and Technology |
| Li, Guoyuan | Norwegian University of Science and Technology |
| Zhang, Houxiang | Norwegian University of Science and Technology |
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, Deep Learning Methods
Abstract: Robotic visual segmentation is essential for enabling robots to operate in complex environments. Although supervised methods have achieved remarkable progress, their dependence on dense annotations hinders scalability. Weakly supervised semantic segmentation (WSSS) alleviates this issue but suffers from sparse supervision, leading to noisy pseudo-labels and boundary errors. Large visual models (LVMs), pretrained on diverse data, provide rich semantic priors that can strengthen weak supervision and address these limitations. To this end, we designed a dual-branch architecture, introducing two large pre-trained models with complementary characteristics. We align the feature spaces of the two branches through consistency learning to alleviate the representation differences and weakly supervised noise problems caused by cross-domain migration, thereby obtaining more robust and fine-grained semantic features. Furthermore, to effectively restore spatial details and improve the quality of segmentation boundaries, we introduce a wavelet transform in the decoder. Wavelet decomposition can simultaneously capture low-frequency global information and high-frequency local details at multiple scales, allowing the model to enhance spatial restoration capabilities while maintaining semantic consistency. Experimental results show that our method improves the performance by 7.7% compared with the state-of-the-art methods in WSSS.
|
| |
| 15:00-16:30, Paper TuI2I.66 | Add to My Program |
| Autonomous Block Assembly for Boom Cranes with Passive Joint Dynamics: Integrated Vision MPC Control |
|
| Ebmer, Gerald | TU Wien |
| Vu, Minh Nhat | TU Wien, Austria |
| Glück, Tobias | AIT Austrian Institute of Technology GmbH |
| Kemmetmueller, Wolfgang | TU Wien |
Keywords: Robotics and Automation in Construction, Integrated Planning and Control, Building Automation
Abstract: This paper presents an autonomous control framework for articulated boom cranes performing prefabricated block assembly in construction environments. The key challenge addressed is precise placement control under passive joint dynamics that cause pendulum-like sway, complicating the accurate positioning of building components. Our integrated approach combines real-time vision-based pose estimation of building blocks, collision-aware B-spline path planning, and nonlinear model predictive control (NMPC) to achieve autonomous pickup, placement, and obstacle-avoidance assembly operations. The framework is validated on a laboratory-scale testbed that emulates crane kinematics and passive dynamics while enabling rapid experimentation. The collision-aware planner generates feasible B-spline references in real-time on CPU hardware with anytime performance, while the NMPC controller actively suppresses passive joint sway and tracks the planned trajectory under continuous vision feedback. Experimental results demonstrate autonomous block stacking and obstacle-avoidance assembly, with sway damping reducing settling times by more than an order of magnitude compared to uncontrolled passive dynamics, confirming the real-time feasibility of the integrated approach for construction automation.
|
| |
| 15:00-16:30, Paper TuI2I.67 | Add to My Program |
| Learning Geometry-Aware Nonprehensile Pushing and Pulling with Dexterous Hands |
|
| Li, Yunshuang | University of Southern California |
| Ling, Yiyang | University of Southern California |
| Sukhatme, Gaurav | University of Southern California |
| Seita, Daniel | University of Southern California |
Keywords: Multifingered Hands, Data Sets for Robot Learning
Abstract: Nonprehensile manipulation, such as pushing and pulling, enables robots to move, align, or reposition objects that may be difficult to grasp due to their geometry, size, or relationship to the robot or the environment. Much of the existing work in nonprehensile manipulation relies on parallel-jaw grippers or tools such as rods and spatulas. In contrast, multi-fingered dexterous hands offer richer contact modes and versatility for handling diverse objects to provide stable support over the objects, which compensates for the difficulty of modeling the dynamics of nonprehensile manipulation. Therefore, we propose Geometry-aware Dexterous Pushing and Pulling(GD2P) for nonprehensile manipulation with dexterous robotic hands. We study pushing and pulling by framing the problem as synthesizing and learning pre-contact dexterous hand poses that lead to effective manipulation. We generate diverse hand poses via contact-guided sampling, filter them using physics simulation, and train a diffusion model conditioned on object geometry to predict viable poses. At test time, we sample hand poses and use standard motion planners to select and execute pushing and pulling actions. We perform extensive real-world experiments with an Allegro Hand and a LEAP Hand, demonstrating that GD2P offers a scalable route for generating dexterous nonprehensile manipulation motions with its applicability to different hand morphologies. Our project website is available at: https://geodex2p.github.io/.
|
| |
| 15:00-16:30, Paper TuI2I.68 | Add to My Program |
| DTP-Attack: A Decision-Based Black-Box Adversarial Attack on Trajectory Prediction |
|
| Li, Jiaxiang | Tongji University |
| Yan, Jun | Tongji University |
| Watzenig, Daniel | TU Graz |
| Yin, Huilin | Tongji University |
Keywords: Autonomous Vehicle Navigation, Motion and Path Planning, Intelligent Transportation Systems
Abstract: Trajectory prediction systems are critical for autonomous vehicle safety, yet remain vulnerable to adversarial attacks that can cause catastrophic traffic behavior misinterpretations. Existing attack methods require white-box access with gradient information and rely on rigid physical constraints, limiting real-world applicability. We propose DTP-Attack, a decision-based black-box adversarial attack framework tailored for trajectory prediction systems. Our method operates exclusively on binary decision outputs without requiring model internals or gradients, making it practical for real-world scenarios. DTP-Attack employs a novel boundary walking algorithm that navigates adversarial regions without fixed constraints, naturally maintaining trajectory realism through proximity preservation. Unlike existing approaches, our method supports both intention misclassification attacks and prediction accuracy degradation. Extensive evaluation on nuScenes and Apolloscape datasets across state-of-the-art models including Trajectron++ and Grip++ demonstrates superior performance. DTP-Attack achieves 41−81% attack success rates for intention misclassification attacks that manipulate perceived driving maneuvers with perturbations below 0.45m, and increases prediction errors by 1.9 − 4.2× for accuracy degradation. Our method consistently outperforms existing black-box approaches while maintaining high controllability and reliability across diverse scenarios. These results reveal fundamental vulnerabilities in current trajectory prediction systems, highlighting urgent needs for robust defenses in safety-critical autonomous driving applications.
|
| |
| 15:00-16:30, Paper TuI2I.69 | Add to My Program |
| All-UWB SLAM Using UWB Radar and UWB AOA |
|
| Hanchapola Appuhamilage, Gihan Charith Premachandra | Singapore University of Technology and Design |
| Athukorala, Achala | Singapore University of Technology and Design |
| Tan, U-Xuan | Singapore University of Techonlogy and Design |
Keywords: SLAM, Range Sensing, Search and Rescue Robots
Abstract: There has been a growing interest in autonomous systems designed to operate in adverse conditions (e.g. smoke, dust), where the visible light spectrum fails. In this context, Ultra-wideband (UWB) radar is capable of penetrating through such challenging environmental conditions due to the lower frequency components within its broad bandwidth. Therefore, UWB radar has emerged as a potential sensing technology for Simultaneous Localization and Mapping (SLAM) in vision-denied environments where optical sensors (e.g. LiDAR, Camera) are prone to failure. Existing approaches involving UWB radar as the primary exteroceptive sensor generally extract features in the environment, which are later initialized as landmarks in a map. However, these methods are constrained by the number of distinguishable features in the environment. Hence, this paper proposes a novel method incorporating UWB Angle of Arrival (AOA) measurements into UWB radar-based SLAM systems to improve the accuracy and scalability of SLAM in feature-deficient environments. The AOA measurements are obtained using UWB anchor-tag units which are dynamically deployed by the robot in featureless areas during mapping of the environment. This paper thoroughly discusses prevailing constraints associated with UWB AOA measurement units and presents solutions to overcome them. Our experimental results show that integrating UWB AOA units with UWB radar enables SLAM in vision-denied feature-deficient environments.
|
| |
| 15:00-16:30, Paper TuI2I.70 | Add to My Program |
| Backdrivable Redundantly Actuated Parallel Robot for Sensorless Physical Human-Robot Interaction |
|
| Yigit, Arda | LS2N |
| Foucault, Simon | Université Laval |
| Gosselin, Clement | Université Laval |
Keywords: Physical Human-Robot Interaction, Parallel Robots, Redundant Robots
Abstract: This paper introduces a new redundantly actuated parallel robot capable of sensorless physical human-robot interaction. Three 3-DoF legs are attached to an end-effector platform through spherical joints. This architecture alleviates most parallel singularities, thus enabling a very large workspace. The use of quasi-direct-drive actuators yields a backdrivable robot with very low impedance, since all actuators are fixed to the base. Furthermore, since the actuators are force/torque controlled, internal antagonistic forces can be controlled without additional sensing devices. Experiments are carried out on a physical prototype and validate the large workspace and physical interaction capabilities of the robot.
|
| |
| 15:00-16:30, Paper TuI2I.71 | Add to My Program |
| RIPNEON: Memory-Lite and Computation-Efficient Occupancy Mapping Via Block Read-Write and Key Grids Expansion |
|
| Dong, Qianli | Nankai University |
| Zhang, Xuebo | Nankai University, |
| Zhang, Shiyong | Nankai University |
| Xi, Haobo | Nankai University |
| Wang, Ziyu | Nankai University |
| Ma, Zhe | Nankai University |
| Zhang, Zhiyong | Nankai University |
Keywords: Mapping, Aerial Systems: Perception and Autonomy, Motion and Path Planning
Abstract: Mobile robot motion planning heavily relies on grid-based occupancy maps, while existing works require high memory usage and expensive updating overhead. In this work, we propose a memory-lite grid-block data structure and an efficient map updating algorithm for LiDAR-based online exploration-oriented planning. To accelerate a query operation and reduce the memory usage, we adopt the grid-block-based map as the basic data structure and propose to dynamically read and write blocks around the sensor. For each block, the occupied grids and frontier grids are maintained in two separate lists, serving as key grids for the map update. Instead of updating free grids by ray-racasting, we propose a key grids expansion algorithm to avoid repetitively querying grids on casted beams. The proposed algorithm not only speeds up the occupancy map update but also detects the frontier grids, which are crucial for exploration tasks, without extra computation. We compare the proposed method with state-of-the-art mapping methods on the KITTI dataset and a self-collected dataset. The proposed method outperforms other methods in terms of memory usage and map update computation. It is also deployed on a UAV for a real-world exploration test. The source code is released at: https://github.com/NKU-MobFly-Robotics/RipNeon.
|
| |
| 15:00-16:30, Paper TuI2I.72 | Add to My Program |
| Perfect Prediction or Plenty of Proposals? What Matters Most in Planning for Autonomous Driving |
|
| Distelzweig, Aron | Albert-Ludwigs-Universität Freiburg |
| Janjoš, Faris | Robert Bosch GmbH |
| Scheel, Oliver | Bosch |
| Varra, Sirish Reddy | RWTH Aachen University |
| Rajan, Raghu | University of Freiburg |
| Boedecker, Joschka | University of Freiburg |
Keywords: Intelligent Transportation Systems, Imitation Learning, AI-Based Methods
Abstract: Traditionally, prediction and planning in autonomous driving (AD) have been treated as separate, sequential modules. Recently, there has been a growing shift towards tighter integration of these components, known as Integrated Prediction and Planning (IPP), with the aim of enabling more informed and adaptive decision-making. However, it remains unclear to what extent this integration actually improves planning performance. In this work, we investigate the role of prediction in IPP approaches, drawing on the widely adopted Val14 benchmark, which encompasses more common driving scenarios with relatively low interaction complexity, and the interPlan benchmark, which includes highly interactive and out-of-distribution driving situations. Our analysis reveals that even access to perfect future predictions does not lead to better planning outcomes, indicating that current IPP methods often fail to fully exploit future behavior information. This suggests that planning may not benefit as much from accurate prediction. Instead, we focus on high-quality proposal generation, while using predictions primarily for collision checks. We find that many imitation learning-based planners struggle to generate realistic and plausible proposals, performing worse than PDM—a simple lane-following approach. Motivated by this observation, we build on PDM with an enhanced proposal generation method, shifting the emphasis towards producing diverse but realistic and high-quality proposals. This proposal-centric approach significantly outperforms existing methods, especially in out-of-distribution and highly interactive settings, where it sets new state-of-the-art results.
|
| |
| 15:00-16:30, Paper TuI2I.73 | Add to My Program |
| DD-MDN: Human Trajectory Forecasting with Diffusion-Based Dual Mixture Density Networks and Uncertainty Self-Calibration |
|
| Hetzel, Manuel | University of Applied Sciences Aschaffenburg |
| Turacan, Kerim | Aschaffenburg University of Applied Sciences |
| Reichert, Hannes | University of Applied Sciences Aschaffenburg |
| Doll, Konrad | University of Applied Sciences Aschaffenburg |
| Sick, Bernhard | University of Kassel |
Keywords: Planning under Uncertainty, Probabilistic Inference, Deep Learning Methods
Abstract: Human Trajectory Forecasting (HTF) predicts future human movements from past trajectories and environmental context, with applications in Autonomous Driving, Smart Surveillance, and Human-Robot Interaction. While prior work has focused on accuracy, social interaction modeling, and diversity, little attention has been paid to uncertainty modeling, calibration, and forecasts from short observation periods, which are crucial for downstream tasks such as path planning and collision avoidance. We propose DD-MDN, an end-to-end probabilistic HTF model that combines high positional accuracy, calibrated uncertainty, and robustness to short observations. Using a few-shot denoising diffusion backbone and a dual mixture density network, our method learns self-calibrated residence areas and probability-ranked anchor paths, from which diverse trajectory hypotheses are derived, without predefined anchors or endpoints. Experiments on the ETH/UCY, SDD, inD, and IMPTC datasets demonstrate state-of-the-art accuracy, robustness at short observation intervals, and reliable uncertainty modeling. The code is available at: url{https://github.com/kav-institute/ddmdn}.
|
| |
| 15:00-16:30, Paper TuI2I.74 | Add to My Program |
| Gust Estimation and Rejection with a Disturbance Observer for Proprioceptive Underwater Soft Morphing Wings |
|
| Cook, Tobias | University of Edinburgh |
| Micklem, Leo | Portland State University |
| Dong, Huazhi | The University of Edinburgh |
| Yang, Yunjie | The University of Edinburgh |
| Mistry, Michael | University of Edinburgh |
| Giorgio-Serchi, Francesco | University of Edinburgh |
Keywords: Modeling, Control, and Learning for Soft Robots, Marine Robotics, Soft Sensors and Actuators
Abstract: Unmanned underwater vehicles are increasingly employed for maintenance and surveying tasks at sea, but their operation in shallow waters is often hindered by hydrodynamic disturbances such as waves, currents, and turbulence. These unsteady flows can induce rapid changes in direction and speed, compromising vehicle stability and manoeuvrability. Marine organisms contend with such conditions by combining proprioceptive feedback with flexible fins and tails to reject disturbances. Inspired by this strategy, we propose soft morphing wings endowed with proprioceptive sensing to mitigate environmental perturbations. The wing’s continuous deformation provides a natural means to infer dynamic disturbances: sudden changes in camber directly reflect variations in the oncoming flow. By interpreting this proprioceptive signal, a disturbance observer can reconstruct flow parameters in real time. To enable this, we develop and experimentally validate a dynamic model of a hydraulically actuated soft wing with controllable camber. We then show that curvature-based sensing allows accurate estimation of disturbances in the angle of attack. Finally, we demonstrate that a controller leveraging these proprioceptive estimates can reject disturbances in the lift response of the soft wing. By combining proprioceptive sensing with a disturbance observer, this technique mirrors biological strategies and provides a pathway for soft underwater vehicles to maintain stability in hazardous environments.
|
| |
| 15:00-16:30, Paper TuI2I.75 | Add to My Program |
| Reward Evolution with Graph-Of-Thoughts: A Bi-Level Language Model Framework for Reinforcement Learning |
|
| Yao, Changwei | Carnegie Mellon University |
| Liu, Xinzi | The University of Tokyo |
| Li, Chen | Carnegie Mellon University |
| Savvides, Marios | Carnegie Mellon University |
Keywords: AI-Based Methods, Reinforcement Learning, AI-Enabled Robotics
Abstract: Designing effective reward functions remains a major challenge in reinforcement learning (RL), often requiring considerable human expertise and iterative refinement. Recent advances leverage Large Language Models (LLMs) for automated reward design, but these approaches are limited by hallucinations, reliance on human feedback, and challenges with handling complex, multi-step tasks. In this work, we introduce Reward Evolution with Graph-of-Thoughts (RE-GoT), a novel bi-level framework that enhances LLMs with structured graph-based reasoning and integrates Visual Language Models (VLMs) for automated rollout evaluation. RE-GoT first decomposes tasks into text-attributed graphs, enabling comprehensive analysis and reward function generation, and then iteratively refines rewards using visual feedback from VLMs without human intervention. Extensive experiments on 10 RoboGen and 4 ManiSkill2 tasks demonstrate that RE-GoT consistently outperforms existing LLM-based baselines. On RoboGen, our method improves average task success rates by 32.25%, with notable gains on complex multi-step tasks. On ManiSkill2, RE-GoT achieves an average success rate of 93.73% across four diverse manipulation tasks, significantly surpassing prior LLM-based approaches and even exceeding expert-designed rewards. Our results indicate that combining LLMs and VLMs with graph-of-thoughts reasoning provides a scalable and effective solution for autonomous reward evolution in RL.
|
| |
| 15:00-16:30, Paper TuI2I.76 | Add to My Program |
| SEG-Parking: Towards Safe, Efficient, and Generalizable Autonomous Parking Via End-To-End Offline Reinforcement Learning |
|
| Yang, Zewei | The Hong Kong University of Science and Technology (Guangzhou) |
| Peng, Zengqi | The Hong Kong University of Science and Technology (Guangzhou) |
| Ma, Jun | The Hong Kong University of Science and Technology |
Keywords: Autonomous Vehicle Navigation, Integrated Planning and Control, Reinforcement Learning
Abstract: Autonomous parking is a critical component for achieving safe and efficient urban autonomous driving. However, unstructured environments and dynamic interactions pose significant challenges to autonomous parking tasks. To address this problem, we propose SEG-Parking, a novel end-to-end offline reinforcement learning (RL) framework to achieve interaction-aware autonomous parking. Notably, a specialized parking dataset is constructed for parking scenarios, which include those without interference from the opposite vehicle (OV) and complex ones involving interactions with the OV. Based on this dataset, a goal-conditioned state encoder is pretrained to map the fused perception information into the latent space. Then, an offline RL policy is optimized with a conservative regularizer that penalizes out-of-distribution actions. Extensive closed-loop experiments are conducted in the high-fidelity CARLA simulator. Comparative results demonstrate the superior performance of our framework with the highest success rate and robust generalization to out-of-distribution parking scenarios. The related dataset and source code are available at https://github.com/Yeulerzzz/SEG-Parking.
|
| |
| 15:00-16:30, Paper TuI2I.77 | Add to My Program |
| Sketch2CAD: Generative Adversarial Network for Automated Conversion of Hand-Drawn Sketches to Parametric CAD Models |
|
| Wang, Xiaogang | Southwest University |
| YunCong, Liu | Southwest University |
| Zhang, Yu | Southwest University |
Keywords: Computer Vision for Manufacturing
Abstract: This paper addresses the labor-intensive process of converting imprecise hand-drawn sketches into precise, parametric CAD sketches. We present Sketch2CAD, a novel deep learning framework that leverages generative adversarial networks (GANs) to automate this conversion. Our approach consists of two main stages: first, a sketch correction module transforms freehand sketches into clean, standardized CAD-like sketches; second, a semantic segmentation module parses the generated sketches to identify and classify geometric primitives (lines, circles, arcs, points). We further introduce an optimized post-processing algorithm that extracts parametric primitives and infers geometric constraints from the segmentation results, enabling direct integration with commercial CAD software. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in both primitive accuracy (94.56%) and constraint recognition. This work provides a robust solution that reduces manual effort in CAD drafting while maintaining engineering precision, particularly suitable for robotics applications requiring rapid prototyping and design iteration.
|
| |
| 15:00-16:30, Paper TuI2I.78 | Add to My Program |
| Large-Scale Autonomous Vehicle Fleet Management |
|
| Mulumba, Timothy | New York University |
Keywords: Autonomous Vehicle Navigation, Multi-Robot Systems, Optimization and Optimal Control
Abstract: We present a hierarchical framework for city-scale autonomous ride-hailing that integrates vehicle prepositioning, request matching, charging, and facility ingress. A fine-grained mixed-integer program (MIP) coordinates prepositioning and matching on short horizons, while a coarse-grained “Deployment+Summoning” decomposition enforces charger/parking capacities at scale. On ride-hail traces, the method increases coverage and reduces wait relative to greedy and decoupled baselines, while keeping charger overuse near zero under rolling-horizon execution. We detail boundary-condition handling for 24/7 operations and specify a concrete RL training/validation protocol for a constraint-aware hybrid in which learned policies act tactically under a MIP-based safety shield.
|
| |
| 15:00-16:30, Paper TuI2I.79 | Add to My Program |
| Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference |
|
| Nguyen, Huy Dung | Capgemini Engineering |
| Bairouk, Anass | Capgemini |
| Maras, Mirjana | Capgemini |
| Xiao, Wei | MIT |
| Wang, Tsun-Hsuan | Massachusetts Institute of Technology |
| Chareyre, Patrick | Capgemini |
| Hasani, Ramin | Massachusetts Institute of Technology (MIT) |
| Blanchon, Marc | Capgemini Engineering |
| Rus, Daniela | MIT |
Keywords: Deep Learning for Visual Perception, Semantic Scene Understanding, Vision-Based Navigation
Abstract: Autonomous driving systems require a comprehensive understanding of the environment, achieved by extracting visual features essential for perception, planning, and control. However, models trained solely on single-task objectives or generic datasets often lack the contextual information needed for robust performance in complex driving scenarios. In this work, we present a unified encoder trained across a diverse set of computer vision tasks essential for urban driving, including depth estimation, pose estimation, 3D scene flow estimation, and semantic, instance, panoptic, and motion segmentation. This single-encoder approach not only integrates these complementary visual cues, inspired by the diversity of visual cues used in human driving perception, but also enables a compact and inference-efficient model that embeds a rich, navigation-relevant latent space. Indeed, the unified encoder learns to embed multi-task knowledge into a shared representation, allowing for better downstream task adaptation, particularly for steering estimation. To ensure the efficient learning across tasks within a unified encoder, we propose a multi-scale pose decoder and employ knowledge distillation from a multi-backbone teacher model. Our experiments demonstrate that (1) the unified encoder achieves strong generalization across all visual tasks, comparable to state-of-the-art dedicated models, and (2) its frozen latent representations significantly outperform both fine-tuned models and ImageNet-pretrained baselines for steering estimation. These results underscore how multi-task feature learning, inspired by the diversity of perceptual cues used in human driving, offers an efficient and context-rich foundation for autonomous driving systems.
|
| |
| 15:00-16:30, Paper TuI2I.80 | Add to My Program |
| These Magic Moments: Differentiable Uncertainty Quantification of Radiance Field Models |
|
| Ewen, Parker | University of Michigan |
| Chen, Hao | University of Michigan |
| Isaacson, Seth | University of Michigan |
| Wilson, Joseph | University of Michigan |
| Skinner, Katherine | University of Michigan |
| Vasudevan, Ram | University of Michigan |
Keywords: Deep Learning for Visual Perception, Visual Learning, View Planning for SLAM
Abstract: Uncertainty quantification is crucial for autonomous systems, enabling safe and robust decision making in tasks ranging from active perception to robotic planning. This paper introduces a novel approach to quantify uncertainty for radiance fields by deriving pixel-wise moment expressions from the rendering equation. While radiance fields offer powerful scene representations, their high dimensionality and complexity have historically made uncertainty quantification computationally prohibitive for real-time applications. This paper demonstrates that the probabilistic nature of the rendering process enables efficient and differentiable computation of higher-order moments for radiance field outputs, including color, depth, and semantic predictions. The proposed method outperforms existing radiance field uncertainty estimation techniques while offering a more direct, computationally efficient, and differentiable formulation without the need for post-processing. Beyond uncertainty quantification, this paper also illustrates the utility of the proposed approach in downstream applications such as next-best-view (NBV) selection and active ray sampling for neural radiance field training. Extensive experiments on both synthetic and real-world scenes demonstrate state-of-the-art performance, confirming that principled uncertainty quantification can be seamlessly integrated into radiance field pipelines without sacrificing efficiency or accuracy.
|
| |
| 15:00-16:30, Paper TuI2I.81 | Add to My Program |
| Dual-Layer PIBT for Lifelong Multi-Agent Pathfinding with Narrow Passages |
|
| Song, Zhenyu | Zhejiang University |
| Zheng, Ronghao | Zhejiang University |
| Liu, Meiqin | Zhejiang University |
| Zhang, Senlin | Zhejiang University |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning, Multi-Robot Systems
Abstract: Lifelong Multi-Agent Pathfinding (Lifelong MAPF) is an extension of the Multi-Agent Pathfinding (MAPF) problem. It has significant applications in scenarios such as warehouse logistics and delivery services. Narrow passages that restrict side-by-side traversal are common in such scenarios, posing a major challenge to lifelong MAPF problem.To address this issue, this paper proposes dual-layer PIBT, a lifelong MAPF method specifically designed for biconnected environments containing narrow passages. The method leverages loop decomposition of the biconnected graph to establish coordinated unidirectional constraints - all narrow passages belonging to the same loop are assigned consistent traversal directions, enabling rapid conflict-free navigation decisions.The experimental results demonstrate significant reductions in both makespan and task service time compared to the baseline method.
|
| |
| 15:00-16:30, Paper TuI2I.82 | Add to My Program |
| VividFace: Real-Time and Realistic Facial Expression Shadowing for Humanoid Robots |
|
| Li, Peizhen | Macquarie University |
| Cao, Longbing | Macquarie University |
| Wu, Xiao-Ming | Nanyang Technological University |
| Zhang, Yang | University of North Texas |
Keywords: Gesture, Posture and Facial Expressions, Emotional Robotics, Human and Humanoid Motion Analysis and Synthesis
Abstract: Humanoid facial expression shadowing enables robots to realistically imitate human facial expressions in real time, which is critical for lifelike, facially expressive humanoid robots and affective human–robot interaction. Existing progress in humanoid facial expression imitation remains limited, often failing to achieve either real-time performance or realistic expressiveness due to offline video-based inference designs and insufficient ability to capture and transfer subtle facial details. To address these limitations, we present VividFace, a real-time and realistic facial expression shadowing system for humanoid robots. An optimized imitation framework X2CNet++ enhances expressiveness by fine-tuning the human-to-humanoid facial motion transfer module and introducing a feature-adaptation training strategy for better alignment across different image sources. Real-time shadowing is further enabled by a video-stream-compatible inference pipeline and a streamlined workflow based on asynchronous I/O for efficient communication across devices. VividFace produces vivid humanoid faces by mimicking human facial expressions within 0.05 seconds, while generalizing across diverse facial configurations. Extensive real-world demonstrations validate its practical utility. Videos are available at: https://lipzh5.github.io/VividFace/.
|
| |
| 15:00-16:30, Paper TuI2I.83 | Add to My Program |
| Multi-Robot Learning-Informed Task Planning under Uncertainty |
|
| Khanal, Abhish | George Mason University |
| Paudel, Abhishek | George Mason University |
| Pham, Hung | George Mason University |
| Stein, Gregory | George Mason University |
Keywords: Multi-Robot Systems, Planning under Uncertainty, Task Planning
Abstract: We want a multi-robot team to complete complex tasks in minimum time where the locations of task-relevant objects are not known. Effective task completion requires reasoning over long horizons about the likely locations of task-relevant objects, how individual actions contribute to overall progress, and how to coordinate team efforts. Planning in this setting is extremely challenging: even when task-relevant information is partially known, coordinating which robot performs which action and when is difficult, and uncertainty introduces a multiplicity of possible outcomes for each action, which further complicates long-horizon decision-making and coordination. To address this, we propose a multi-robot planning abstraction that integrates learning to estimate uncertain aspects of the environment with model-based planning for long-horizon coordination. We demonstrate the efficient multi-stage task planning of our approach for 1, 2, and 3 robot teams over competitive baselines in large ProcTHOR household environments. Additionally, we demonstrate the effectiveness of our approach with a team of two LoCoBot mobile robots in real household settings.
|
| |
| 15:00-16:30, Paper TuI2I.84 | Add to My Program |
| Leveraging Cost-Effective Robotics for K-12 STEM Education through Water Quality Monitoring Tasks |
|
| Mukherjee, Rishi | University of Minnesota |
| Ruiz, Andrew | Michigan Technological University |
| Henderson, Travis | CSE, UMN |
| Tejpaul, Resha | CSE, UMN |
| Simonson, Kris | High Tech Kids |
| Mulla, David | University of Minnesota |
| McNeill, Brian | University of Minnesota Extension 4-H Youth Development |
| Papanikolopoulos, Nikos | University of Minnesota |
| Sattar, Junaed | University of Minnesota |
Keywords: Education Robotics, Environment Monitoring and Management, Marine Robotics
Abstract: Engaging K-12 students in authentic scientific research remains a significant challenge, particularly at the intersection of environmental science and robotics. We introduce the Jar Jar ROV, a low-cost, open-source Remotely Operated Vehicle (ROV) platform designed for citizen science-based water quality monitoring by middle school students. This paper presents the design of the platform and the results of a large-scale deployment with over 100 students across a US state who built, programmed, and deployed the ROVs in local lakes. The educational framework yielded high student engagement in hands-on activities, with ROV construction earning a perfect average score from mentors. From a scientific standpoint, the program successfully established a grassroots monitoring network, generating nearly eleven thousand validated measurements of temperature, pH, dissolved oxygen, and turbidity. However, our evaluation identified a critical "engagement gap," with student interest declining sharply during more complex tasks such as electronics assembly and data uploading. This paper contributes both a validated, scalable model for integrating robotics into environmental education and a clear, data-driven roadmap for future improvements. These enhancements focus on lowering technical barriers and creating a more intuitive link between data collection and scientific discovery, addressing a key challenge in empowering the next generation of citizen scientists.
|
| |
| 15:00-16:30, Paper TuI2I.85 | Add to My Program |
| A Dual-Channel Framework for Blind Perceptual Quality Assessment in Bilateral Teleoperation |
|
| Wang, Zican | Technical University of Munich |
| Xu, Xiao | Technical University of Munich |
| Jin, Zhi | Sun Yat-Sen University |
| Yang, Dong | Technical University of Munich |
| Steinbach, Eckehard | Technical University of Munich |
Keywords: Haptics and Haptic Interfaces, Telerobotics and Teleoperation
Abstract: This paper proposes a perceptual no-reference (blind) haptic quality assessment framework for predicting the Quality of Experience (QoE) in teleoperation systems with force feedback. The proposed approach employs a deep neural network that combines semantic and distortion-based channels. The semantic network generates a semantic vector that characterizes the interaction between the robot and its environment. Meanwhile, the distortion network decomposes complex noise introduced by control algorithms and communication artifacts into artificial noise of known types. To train the proposed network, we also construct an augmented dataset for perceptual quality assessment in teleoperation based on the subjective experiments. The dataset augmentation and the model are validated with real-world teleoperation tasks. Our experimental results demonstrate that the performance of our No-Reference (NR) haptic quality assessment model is comparable to or surpasses that of commonly used Full-Reference (FR) methods, achieving Spearman’s Rank-Order Correlation scores above 0.85 for QoE prediction.
|
| |
| 15:00-16:30, Paper TuI2I.86 | Add to My Program |
| Pose Retargeting from a Single RGB Camera: Optimization-Based Hand Pose Retargeting and Wrist Pose Estimation |
|
| Chen, Longrui | University of Leeds |
| Chen, Lipeng | Shanghai Jiao Tong University |
| Yao, Kunpeng | University of Leeds |
| Dogar, Mehmet R | University of Leeds |
Keywords: Telerobotics and Teleoperation, Grasping, Visual Tracking
Abstract: Robot teleoperation plays a crucial role in collecting data for large-scale imitation learning. Inferring operator's hand pose is crucial for vision-based teleoperation, and current solutions either rely on additional neural network training or hardware to infer the operator's wrist pose. To our knowledge, there is no open-source, general teleoperation toolkit that can be easily deployed to retarget both hand and wrist poses from a single RGB camera. In this paper, we propose OAT (Optimization-based hAnd pose retargeting and wrisT pose estimation), a streamlined approach to retarget human hand and wrist pose to the robot. We leverage the off-the-shelf MediaPipe framework to estimate the operator's hand pose and employ an optimization-based method to infer the operator's wrist pose within the camera frame by 2D/3D hand joint matching. This integrated pipeline facilitates teleoperation from virtually any location using any device equipped with an RGB camera, offering a highly accessible and easily implementable solution. Furthermore, a hand-based camera calibration optimization is proposed to improve the accuracy of wrist pose estimation. In addition to minimal hardware requirements and deployment convenience, our system also demonstrates superior real-time performance compared to state-of-the-art vision-based teleoperation methods.
|
| |
| 15:00-16:30, Paper TuI2I.87 | Add to My Program |
| How to Model Your Crazyflie Brushless |
|
| Gräfe, Alexander | RWTH Aachen University |
| Scherer, Christoph | Technical University Berlin |
| Hoenig, Wolfgang | TU Berlin |
| Trimpe, Sebastian | RWTH Aachen University |
Keywords: Aerial Systems: Mechanics and Control, Dynamics, Reinforcement Learning
Abstract: The Crazyflie quadcopter is widely recognized as a leading platform for nano-quadcopter research. In early 2025, the Crazyflie Brushless was introduced, featuring brushless motors that provide around 50% more thrust compared to the brushed motors of its predecessor, the Crazyflie 2.1. This advancement has opened new opportunities for research in agile nano-quadcopter control. To support researchers utilizing this new platform, this work presents a dynamics model of the Crazyflie Brushless and identifies its key parameters. Through simulations and hardware analyses, we assess the accuracy of our model. We furthermore demonstrate its suitability for reinforcement learning applications by training an end-to-end neural network position controller and learning a backflip controller capable of executing two complete rotations with a vertical movement of just 1.8 meters. This showcases the model’s ability to facilitate the learning of controllers and acrobatic maneuvers that successfully transfer from simulation to hardware. Utilizing this application, we investigate the impact of domain randomization on control performance, offering valuable insights into bridging the sim-to-real gap with the presented model. We have open-sourced the entire project, enabling users of the Crazyflie Brushless to swiftly implement and test their own controllers on an accurate simulation platform.
|
| |
| 15:00-16:30, Paper TuI2I.88 | Add to My Program |
| Robustness-Aware Tool Selection and Manipulation Planning with Learned Energy-Informed Guidance |
|
| Dong, Yifei | KTH |
| Zhang, Yan | Idiap Research Institute; EPFL |
| Calinon, Sylvain | Idiap Research Institute |
| Pokorny, Florian T. | KTH Royal Institute of Technology |
Keywords: Manipulation Planning
Abstract: Humans subconsciously choose robust ways of selecting and using tools, for example, choosing a ladle over a flat spatula to serve meatballs. However, robustness under external disturbances remains underexplored in robotic tool-use planning. This paper presents a robustness-aware method that jointly selects tools and plans contact-rich manipulation trajectories, explicitly optimizing for robustness against disturbances. At the core of our method is an energy-based robustness metric that guides the planner toward robust manipulation behaviors. We formulate a hierarchical optimization pipeline that first identifies a tool and configuration that optimizes robustness, and then plans a corresponding manipulation trajectory that maintains robustness throughout execution. We evaluate our method across three representative tool-use tasks. Simulation and real-world results demonstrate that our method consistently selects robust tools and generates disturbance-resilient manipulation plans.
|
| |
| 15:00-16:30, Paper TuI2I.89 | Add to My Program |
| Path Tracking Control for a Transformable Wheel-Legged Robot Using Model Predictive Control |
|
| Sun, Chongping | Dalian Maritime University |
| Zhao, Na | Dalian Maritime University |
| Zhao, Kaijie | Dalian Maritime University |
| Luo, Yudong | Dalian Maritime University |
| Shen, Yantao | University of Nevada, Reno |
Keywords: Discrete Event Dynamic Automation Systems, Foundations of Automation
Abstract: Transformable wheel-legged robots can adjust their configuration according to terrain conditions, enabling effective operation in harsh environments. While existing controllers based on preset commands have successfully demonstrated the feasibility of reconfigurable mechanisms, they still struggle to handle complex autonomous operations. To address this, we develop a comprehensive motion model for such robots, encompassing chassis kinematics, chassis-wheel kinematics, and stability models, along with a hierarchical path tracking method. The upper controller uses model predictive control with an error state-space model to optimize real-time tracking error under input constraints and generate desired commands. The lower controller utilizes feedforward control to convert desired inputs into actual ones, while accommodating physical constraints and geometric coupling associated with variable-radius wheels. Comparative analyses confirm the effectiveness of the proposed approach and demonstrate the robot's performance under different wheel modes.
|
| |
| 15:00-16:30, Paper TuI2I.90 | Add to My Program |
| Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation |
|
| Xu, Shaocong | Xiamen University |
| Wei, Songlin | University of Southern California |
| Wei, Qizhe | Beijing Institute of Technology |
| Geng, Zheng | Beijing Academy of Artificial Intelligence |
| Li, Hong | Beihang University |
| Licheng, Shen | AIR, Tsinghua University |
| Sun, Qianpu | Tsinghua University |
| Han, Shu | Wuhan University |
| Ma, Bin | Shaanxi University of Science and Technology |
| Li, Bohan | South China University of Technology |
| Ye, Chongjie | CUHK(SZ) |
| Zheng, Yuhang | National University of Singapore |
| Wang, Nan | Tongji University |
| Zhang, Saining | Nanyang Technological University |
| Zhao, Hao | Tsinghua University |
Keywords: Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation, Data Sets for Robotic Vision
Abstract: Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences (1.32M frames) rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines (e.g., Depth-Anything-v2, DepthCrafter), and a normal variant (DKT-Normal) sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame (832×480). Integrated into a grasping stack, DKT’s depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: “Diffusion knows transparency.” Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.
|
| |
| 15:00-16:30, Paper TuI2I.91 | Add to My Program |
| AI-Driven Adaptive Autonomy: Is AI Really Pervasive? Research Gaps from Bibliometric Assessment |
|
| Casini, Simona | University of Pisa |
| Caiti, Andrea | University of Pisa |
| Ducange, Pietro | University of Pisa |
| Marcelloni, Francesco | University of Pisa |
| Pollini, Lorenzo | University of Pisa |
Keywords: Human-Centered Robotics, Learning Categories and Concepts, Long term Interaction
Abstract: Artificial Intelligence is widely recognised as a driver of adaptive autonomy in robotics. Yet, the extent to which AI techniques truly permeate the functional architecture of autonomous systems is still only partially characterised. Existing bibliometric analyses typically map research themes, keywords or algorithms but provide limited insight into how contributions distribute across the functional logic of autonomous systems. This raises a fundamental question: is AI really pervasive across the functions that enable robots to act adaptively in complex environments? Which areas are mature or under-explored? To achieve this outcome, the paper adopts a functional, control-loop-oriented perspective that avoids the bias of vertical domains or robot-specific applications, More than 2500 scientific works, published in the last 25 years, were mapped across the 13 functional modules, using a multi-label neural classification pipeline, and analysed via co-occurrence and structural techniques. This approach allowed to highlight not only areas where AI is already known to be central and consistently confirmed, but also those where its impact would be expected to be significant yet remains surprisingly limited. By combining architectural reasoning with bibliometric evidence, the study provides a broader lens for assessing research gaps and for situating current advances within the long-term agenda of adaptive and human-centred autonomy.
|
| |
| 15:00-16:30, Paper TuI2I.92 | Add to My Program |
| Learning Semantic Priorities for Autonomous Target Search |
|
| Lodel, Max | Delft University of Technology |
| Wilde, Nils | Dalhousie University |
| Babuska, Robert | Delft University of Technology |
| Alonso-Mora, Javier | Delft University of Technology |
Keywords: Integrated Planning and Learning, Semantic Scene Understanding, Search and Rescue Robots
Abstract: The use of semantic features can improve the efficiency of target search in unknown environments for robotic search and rescue missions. Current target search methods rely on training with large datasets of similar domains, which limits the adaptability to diverse environments. However, human experts possess high-level knowledge about semantic relationships necessary to effectively guide a robot during target search missions in diverse and previously unseen environments. In this paper, we propose a target search method that leverages expert input to train a model of semantic priorities. By employing the learned priorities in a frontier exploration planner using combinatorial optimization, our approach achieves efficient target search driven by semantic features while ensuring robustness and complete coverage. The proposed semantic priority model is trained with several synthetic datasets of simulated expert guidance for target search. Simulation tests in previously unseen environments show that our method consistently achieves faster target recovery than a coverage-driven exploration planner.
|
| |
| 15:00-16:30, Paper TuI2I.93 | Add to My Program |
| MotionNet-PGA: MotionNet with Polar-Guided Attention for Moving Object Segmentation in Scanning Radar |
|
| Yuan, RenYi | National Yang Ming Chiao Tung University |
| Wang, Chieh-Chih | National Yang Ming Chiao Tung University |
| Lin, Wen-Chieh | National Yang Ming Chiao Tung University |
Keywords: Computer Vision for Transportation, Object Detection, Segmentation and Categorization, Intelligent Transportation Systems
Abstract: Moving object segmentation (MOS) is essential for autonomous driving, enabling robust detection, tracking, and prediction of dynamic agents in complex traffic scenarios. Radar sensors offer notable advantages for long-range sensing, but their lower spatial resolution, measurement noise, and geometric distortions—particularly for distant targets—pose significant challenges for accurate MOS. These limitations are amplified when detecting small objects such as scooters. In this work, we present MotionNet-PGA, a Polar-guided Attention Framework designed specifically for scanning radar-based MOS. Our method builds on the multi-frame motion encoding backbone of MotionNet, and introduces a polar-guided attention module to suppress clutter, enhance motion feature representation, and improve segmentation of small and distant targets. For evaluation, we construct and annotate the ITRI Radar moving object segmentation Dataset. Experimental results demonstrate that our method surpasses state-of-the-art baseline, MotionNet, by 2.48% in overall IoU and achieves a 4.08% improvement in small-object segmentation. These results highlight the effectiveness of polar-guided attention in addressing scanning radar-specific challenges.
|
| |
| 15:00-16:30, Paper TuI2I.94 | Add to My Program |
| Grounding Language Models with Semantic Digital Twins for Robotic Planning |
|
| Naeem, Mehreen | University of Bremen |
| Melnik, Andrew | University of Bremen |
| Beetz, Michael | University of Bremen |
Keywords: Failure Detection and Recovery, Task Planning
Abstract: We introduce a novel framework that integrates Semantic Digital Twins (SDTs) with Large Language Models (LLMs) to enable adaptive and goal-driven robotic task execution in dynamic environments. The system decomposes natural language instructions into structured action triplets, which are grounded in contextual environmental data provided by the SDT. This semantic grounding allows the robot to interpret object affordances and interaction rules, enabling action planning and real-time adaptability. In case of execution failures, the LLM utilizes error feedback and SDT insights to generate recovery strategies and iteratively revise the action plan. We evaluate our approach using tasks from the ALFRED benchmark, demonstrating robust performance across various household scenarios. The proposed framework effectively combines high-level reasoning with semantic environment understanding, achieving reliable task completion in the face of uncertainty and failure.
|
| |
| 15:00-16:30, Paper TuI2I.95 | Add to My Program |
| Approximated Collision Detection for Contact-Rich Dexterous Manipulation with Nonnegative Least Squares |
|
| Li, Weibing | Sun Yat-Sen University |
| Luo, Jiajun | Sun Yat-Sen University |
| Yang, Lei | Sun Yat-Sen University |
| Li, Yehui | Sun Yat-Sen University |
| Huang, Kai | Sun Yat-Sen University |
Keywords: Dexterous Manipulation, In-Hand Manipulation, Optimization and Optimal Control
Abstract: Collision detection between robotic hands and manipulated objects is crucial to model predictive control (MPC) for contact-rich dexterous manipulation. Based on the Gilbert-Johnson-Keerthi (GJK) algorithm and the expanding polytope algorithm (EPA), the GJK-EPA method has achieved success while requiring iterative optimizations. Recently, a signed distance function (SDF) based collision detection (C-SDF) method is used to estimate the contact information, which avoids iterations at the cost of matrix derivative operations. Inspired by this, in this paper, a simplified nonnegative least squares (NNLS) based quadratic programming (QP) algorithm is used to construct an approximated solution to the QP formulation of collision detection, for estimating collision points. Then, contact distances and Jacobians are calculated via physics computations and differentiable kinematics. Consequently, a C-NNLS method is proposed, which uses NNLS formulation to approximate the collision detection routine in the MPC while avoiding iterative optimizations and matrix derivatives. The C-NNLS method is applied to extensive simulative tasks, achieving lower average error while consuming 45.59% less time on average compared with the C-SDF method. Furthermore, the C-NNLS method is deployed on a real Allegro hand for on-palm reorientation. Results show that the C-NNLS method reduces average task time by 30.33% compared with the C-SDF method while maintaining high-quality dexterous manipulation.
|
| |
| 15:00-16:30, Paper TuI2I.96 | Add to My Program |
| GeoISF: Instance Semantic Forest Inspired Large-Scale Cross-View Geo-Localization Via Ground LiDAR-To-Satellite Image |
|
| Hu, Di | Nanjing University of Science and Technology |
| Yuan, Xia | Nanjing University of Science and Technology |
| Zhao, Chun-xia | Nanjing University of Science and Technology |
Keywords: Semantic Scene Understanding, Localization, Range Sensing
Abstract: The problem of localization on a large-scale satellite image given a frame of query ground view point clouds remains challenging. Existing LiDAR-to-image cross-view localization methods struggle in large-scale scenarios due to limited semantic alignment and the modality gap between point clouds and satellite images. This paper introduces the large-scale LiDAR-to-image geo-localization pipeline called GeoISF. GeoISF introduces an instance semantic forest constructed using WordNet, which enhances temporal semantic representation and discriminative power by integrating semantic trees from multiple frames. By leveraging environmental semantic representation as a shared medium, GeoISF effectively bridges the modality gap and improves semantic matching accuracy. Extensive experiments demonstrate the superior performance of GeoISF in large-scale cross-view localization, which achieves a 13.22-fold improvement compared to the parallel LiDAR-to-image method in the R@10 metric on KITTI dataset. The proposed method addresses the existing gap in large-scale LiDAR-to-image cross-view localization, offering a robust solution to the computational and accuracy challenges inherent in such scenarios. We will release the code as an open-source resource available online for the broader research community.
|
| |
| 15:00-16:30, Paper TuI2I.97 | Add to My Program |
| Voice-Driven Assistance and Resistance Modulation in a Soft Hip Exosuit Using a Transformer-Based Speech Recognition Model |
|
| Tricomi, Enrica | Technical University of Munich (TUM) |
| Lindner, Daniel | Technical University of Munich (TUM) |
| Zhang, Xiaohui | Technical University of Munich (TUM) |
| Miskovic, Luka | Technical University of Munich (TUM) |
| Masia, Lorenzo | Technical University of Munich (TUM) |
Keywords: Wearable Robotics, Physically Assistive Devices, Modeling, Control, and Learning for Soft Robots
Abstract: Intuitive human–robot interfaces are essential to increase usability and personalization in wearable robotic assistive technologies. However, most current systems rely on pre-programmed or sensor-driven strategies that offer limited active user control online. To address this limitation, we present a voice-driven control framework for a soft hip exosuit, enabling on-demand modulation of assistance and resistance via short spoken commands. The system combines a fully embedded transformer-based automatic speech recognition model (Whisper) with a gait-phase estimator to synchronize actuation with the user’s motion. Users can switch between assistive and resistive modes and select discrete gain levels (low, medium, high). Experiments with six healthy participants demonstrate high recognition accuracy (95-100%) and low latency (∼9 ms). Metabolic measurements show that assistive commands reduced walking energy cost by 20.9±4.8% (LOW) and 9.7±5.5% (MEDIUM) relative to baseline, while resistive commands increased cost by 13.1±3.5% (MEDIUM) and 14.9±5.1% (HIGH). These results highlight the feasibility of intuitive, voice-driven modulation in wearable robotics.
|
| |
| 15:00-16:30, Paper TuI2I.98 | Add to My Program |
| Tuning Hydrodynamic Coefficients Using a Genetic Algorithm for a Numerical Model of a Bio-Robotic Sea Lion |
|
| Kadapa, Shraman | Drexel University |
| Marcouiller, Nicholas | Drexel University |
| Drago, Anthony | Drexel University |
| Tangorra, James | Drexel University |
| Kwatny, Harry | Drexel University |
Keywords: Biologically-Inspired Robots, Simulation and Animation, Dynamics
Abstract: Bio-inspired swimming vehicles are increasingly being developed to understand the locomotion strategies of aquatic animals to expand the performance envelope of engineered systems. However, the increasing complexity of these multi-segmented vehicles makes it challenging to understand and optimize their performance. Accurate numerical models of these systems can provide a pathway forward, but it depends critically on reliable estimation of hydrodynamic coefficients. Traditional approaches to estimate these coefficients, such as tow-tank testing can be costly and often impractical. In this work, a numerical model of a bio-robotic sea lion was developed and validated, in which hydrodynamic coefficients critical for estimating fluid forces were first obtained through computational fluid dynamics (CFD) simulations and analytical methods such as strip theory. These coefficients were then refined using a genetic algorithm to improve agreement with experimental trials of the robot. This hybrid framework bridges the gap between simulation and reality, enabling accurate force estimation across different body segments. Validation experiments showed a close alignment between the numerical model and the physical robot's performance in position and orientation during various trials. The validated model could enable large-scale parametric studies to evaluate the effectiveness of different control surfaces, optimize gaits, and explore control strategies without extensive prototyping of the bio-robotic platform. Beyond design and analysis, the model can also provide a high-fidelity environment for the application of reinforcement learning, supporting the development of adaptive controllers and advancing bio-inspired robots toward autonomous operation.
|
| |
| 15:00-16:30, Paper TuI2I.99 | Add to My Program |
| Semantically-Aware Diver Activity Recognition Framework for Effective Underwater Multi-Human-Robot Collaboration |
|
| Enan, Sadman Sakib | University of Minnesota, Twin Cities |
| Sattar, Junaed | University of Minnesota |
Keywords: Marine Robotics, Human-Robot Collaboration, Human-Robot Teaming
Abstract: Effective multi-human-robot collaboration is essential for expanding human-led operations in the challenging and high-risk underwater environment. For autonomous underwater vehicles (AUVs) to become true teammates, they must be able to comprehend their surroundings and recognize a diver's activities to offer assistance and ensure safety. Towards this goal, we introduce DAR-Net, a novel transformer-based framework that analyzes complex underwater scenes to classify diver activities. Our contribution lies in a semantically-guided learning formulation that couples transformer-based temporal reasoning with pixel-level scene supervision. This multi-loss training strategy explicitly aligns global activity recognition with local human–robot interaction semantics, which is particularly critical in low-visibility underwater conditions. To address the significant challenge of data scarcity in this domain, we present the first-ever Underwater Diver Activity (UDA) dataset, a foundational resource containing over 2,600 annotated images with pixel-level masks. Through rigorous experimental evaluations in a controlled environment, we demonstrate that DAR-Net achieves promising accuracy in recognizing six distinct diver activities, outperforming state-of-the-art models. While this dataset provides a crucial baseline, our work serves as a pioneering step, laying the groundwork for future research and facilitating the development of more intelligent, collaborative underwater robotic systems.
|
| |
| 15:00-16:30, Paper TuI2I.100 | Add to My Program |
| GaussianFormer3D: Multi-Modal Gaussian-Based Semantic Occupancy Prediction with 3D Deformable Attention |
|
| Zhao, Lingjun | Georgia Institute of Technology |
| Wei, Sizhe | Georgia Institute of Technology |
| Hays, James | Georgia Institute of Technology, Argo AI |
| Gan, Lu | Georgia Institute of Technology |
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, Sensor Fusion
Abstract: 3D semantic occupancy prediction is essential for achieving safe, reliable autonomous driving and robotic navigation. Compared to camera-only perception systems, multi-modal pipelines, especially LiDAR-camera fusion methods, can produce more accurate and fine-grained predictions. Although voxel-based scene representations are widely used for semantic occupancy prediction, 3D Gaussians have emerged as a continuous and significantly more compact alternative. In this work, we propose a multi-modal Gaussian-based semantic occupancy prediction framework utilizing 3D deformable attention, namely GaussianFormer3D. We introduce a voxel-to-Gaussian initialization strategy that provides 3D Gaussians with accurate geometry priors from LiDAR data, and design a LiDAR-guided 3D deformable attention mechanism to refine these Gaussians using LiDAR-camera fusion features in a lifted 3D space. Extensive experiments on real-world on-road and off-road autonomous driving datasets demonstrate that GaussianFormer3D achieves state-of-the-art prediction performance with reduced memory consumption and improved efficiency. Project website: https://lunarlab-gatech.github.io/GaussianFormer3D/.
|
| |
| 15:00-16:30, Paper TuI2I.101 | Add to My Program |
| TERDNet: Transformer Encoder-Recurrent Decoder Network for Scene Change Detection |
|
| Yoon, Jiae | Gwangju Institute of Science and Technology |
| Kim, Ue-Hwan | Gwangju Institute of Science and Technology (GIST) |
Keywords: Semantic Scene Understanding, Recognition
Abstract: In this work, we address the challenge of Scene Change Detection (SCD), where the goal is to identify variations between two images of the same location captured at different times. Existing SCD models often overlook the varying importance of features across layers, employ single-step decoders that confine refinement, and provide limited insight into encoder pretraining strategies. We propose TERDNet, a Transformer Encoder–Recurrent Decoder Network designed to overcome these limitations. TERDNet consists of a transformer-based encoder that extracts multi-level representations, a feature fusion module that integrates correlation volumes with these features, a recurrent 3-gate GRU decoder that performs iterative refinement, and a combined convolution–interpolation upsampler that restores fine-grained resolution. Extensive experiments on four public benchmarks show that TERDNet consistently outperforms prior approaches and produces more accurate and detailed change masks. Ablation studies confirm the benefit of segmentation-based pretraining and the effectiveness of our fusion design. In addition, robustness tests under viewpoint misalignment confirm TERDNet’s potential for deployment in real-world robotic systems, where reliable perception is critical. Our code is at https://github.com/AutoCompSysLab/TERDNet.
|
| |
| 15:00-16:30, Paper TuI2I.102 | Add to My Program |
| SPREAD: Subspace Representation Distillation for Lifelong Imitation Learning |
|
| Roy, Kaushik | CSIRO |
| D'urso, Giovanni Salvatore | University of Technology Sydney |
| Lawrance, Nicholas | CSIRO Data61 |
| Tidd, Brendan | CSIRO |
| Moghadam, Peyman | CSIRO |
Keywords: Continual Learning, Incremental Learning, Imitation Learning
Abstract: A central challenge in lifelong imitation learning (LIL) is enabling agents to acquire new skills from expert demonstrations while retaining knowledge of previously learned tasks. Achieving this requires preserving the low-dimensional manifolds and geometric structures that underlie task representations across sequential learning. However, existing distillation methods, which rely on L2-norm feature matching in the raw feature space, are highly sensitive to noise and high-dimensional variations, often failing to preserve the intrinsic task manifolds. To overcome these limitations, we introduce SPREAD, a geometry-preserving framework that leverages singular value decomposition (SVD) to align the representations of policies from consecutive tasks within low-rank subspaces. This subspace alignment preserves the intrinsic low-dimensional geometry of multimodal features, thereby facilitating stable knowledge transfer, enhancing robustness, and improving generalization across tasks. In addition, we propose a confidence-guided policy distillation strategy that applies a Kullback–Leibler divergence loss restricted to the top-M most confident action samples, emphasizing reliable action modes and improving optimization stability. Empirical results on the LIBERO benchmark demonstrate that SPREAD significantly improves knowledge transfer across tasks, mitigates catastrophic forgetting, and achieves superior overall performance compared to state-of-the-art LIL methods.
|
| |
| 15:00-16:30, Paper TuI2I.103 | Add to My Program |
| DSSM-SG: Dynamic 3D Scene Graphs with Spatio-Semantic Memory for Long-Term Indoor Navigation Tasks |
|
| Ruan, Yi | Beijing Institute of Technology |
| Zhang, Yaowen | Beijing Institute of Technology |
| Pan, Miaoxin | Beijing Institute of Technology |
| Yang, Yi | Beijing Institute of Technology |
| Fu, Mengyin | Beijing Institute of Technology |
Keywords: Semantic Scene Understanding, Embodied Cognitive Science, Motion and Path Planning
Abstract: Dynamic indoor environments pose significant challenges for autonomous robots, as objects frequently move and scenes continuously change, requiring robust scene representation and adaptive navigation strategies. In this work, we introduce DSSM-SG, a dynamic open-vocabulary 3D scene graph framework enhanced with spatial-semantic memory, to support complex language instruction parsing and goal navigation in dynamic environments. First, we construct a multi-layered scene graph by combining waypoint topology with semantic object information, and propose a viewpoint-based mechanism to model object dynamics and detect scene changes, enabling more precise semantic-geometric representation. Second, we design an efficient incremental graph update strategy that adapts to object-level dynamics and navigation-observed obstacles, thereby maintaining graph consistency and alleviating mismatch during re-navigation. Finally, we introduce a subgraph generation and matching approach driven by large language models, significantly improving the system's ability to interpret and ground ambiguous goal descriptions. Experimental results demonstrate that DSSM-SG achieves superior performance in scene graph accuracy, update efficiency, and language goal navigation success compared to existing baselines in dynamic indoor environments.
|
| |
| 15:00-16:30, Paper TuI2I.104 | Add to My Program |
| DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning |
|
| Yuan, Tianyuan | Tsinghua University |
| Liu, Yicheng | Tsinghua University |
| Lu, Chenhao | Tsinghua University |
| Chen, Zhuoguang | Tsinghua University |
| Jiang, Tao | Tsinghua |
| Zhao, Hang | Tsinghua University |
Keywords: AI-Enabled Robotics, Learning from Demonstration, Perception for Grasping and Manipulation
Abstract: Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive action-data pretraining to ground VLMs in 3D space, which reduces training efficiency and is still insufficient for accurate spatial understanding. In this work, we present DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA adopts a mixture-of-transformers design that unifies a VLM, a depth transformer, and an action expert with fully shared attentions, forming an end-to-end model with enhanced spatial reasoning. Extensive evaluations in both real-world and simulated environments show that DepthVLA outperforms state-of-the-art approaches, achieving 78.5% vs. 65.0% progress in real-world tasks, 94.9% vs. 93.6% in the LIBERO simulator, and 74.8% vs. 58.8% in the Simpler simulator. Our code will be made publicly available.
|
| |
| 15:00-16:30, Paper TuI2I.105 | Add to My Program |
| MO-SeGMan: Rearrangement Planning Framework for Multi-Objective Sequential and Guided Manipulation in Constrained Environments |
|
| Tuncer, Cankut Bora | Bilkent University |
| Toussaint, Marc | TU Berlin |
| Oguz, Ozgur S. | Bilkent University |
Keywords: Task and Motion Planning, Motion and Path Planning, Manipulation Planning
Abstract: In this work, we introduce MO-SeGMan, a Multi-Objective Sequential and Guided Manipulation planner for highly constrained rearrangement problems. MO-SeGMan generates object placement sequences that minimize both replanning per object and robot travel distance while preserving critical dependency structures with a lazy evaluation method. To address highly cluttered, non-monotone scenarios, we propose a Selective Guided Forward Search (SGFS) that efficiently relocates only critical obstacles and to feasible relocation points. Furthermore, we adopt a refinement method for adaptive subgoal selection to eliminate unnecessary pick-and-place actions, thereby improving overall solution quality. Extensive evaluations on nine benchmark rearrangement tasks demonstrate that MO-SeGMan generates feasible motion plans in all cases, consistently achieving faster solution times and superior solution quality compared to the baselines. These results highlight the robustness and scalability of the proposed framework for complex rearrangement planning problems.
|
| |
| 15:00-16:30, Paper TuI2I.106 | Add to My Program |
| X-MOS: A Heterogeneous Cross-LiDAR Generalization Framework for Moving Object Segmentation |
|
| Lee, Minjae | Gyeongsang National University |
| Ha, Ilhwan | Gyeongsang National University |
| Choi, Sang-Min | Gyeongsang National University |
| Kim, Gun-Woo | Gyeongsang National University, South Korea |
| Lee, Suwon | Gyeongsang National University |
Keywords: Computer Vision for Transportation, Sensor Fusion, Object Detection, Segmentation and Categorization
Abstract: Moving object segmentation (MOS) is foundational for autonomous vehicle safety. However, the increasing diversity of LiDAR sensors creates a significant domain shift problem, causing models trained on one sensor to perform poorly when deployed on another. A naive approach of training on combined data from heterogeneous sensors leads to a biased model that favors high-density sensors while failing on sparse, low-resolution sensors. To address this issue, we propose X-MOS, a novel generalization framework based on multi-teacher knowledge distillation. X-MOS generates sensor-specific expert teacher models and employs a sensor-aware knowledge distillation strategy. This strategy uses the sensor type as privileged information to activate the most appropriate teacher at each training step, providing unambiguous learning signals to a single student model. Extensive experiments on the HeLiMOS dataset, which comprises four different LiDAR sensors, demonstrate the effectiveness of our framework. X-MOS mitigates training bias and achieves an overall test mIoU of 0.717, outperforming both naive training and the best individual expert teacher. Notably, it more than doubles the performance on the most challenging low-channel sensor. Furthermore, our model exhibits strong zero-shot generalization to unseen datasets with similar sensor types. This work provides a robust and scalable methodology for achieving cross-sensor generalization, which is foundational for more practical and adaptable perception systems in autonomous driving.
|
| |
| 15:00-16:30, Paper TuI2I.107 | Add to My Program |
| Fully Polar Coordinate Object Detection: A Constraint-Based Polar Bounding Box Approach for LiDAR and Scanning Radar |
|
| Lin, ShuHeng | National Yang Ming Chiao Tung University |
| Wang, Chieh-Chih | National Yang Ming Chiao Tung University |
| Lin, Wen-Chieh | National Yang Ming Chiao Tung University |
Keywords: Computer Vision for Transportation, Deep Learning Methods, Object Detection, Segmentation and Categorization
Abstract: Polar coordinates are widely used in segmentation tasks for range sensors such as LiDAR and radar, owing to their ability to naturally align with point cloud sparsity and distribution. However, their use in detection is limited by feature distortion. Existing polar-based detection works focused on undistorting features from the polar coordinates back to canonical Cartesian representations, but their results remain unsuccessful. In this work, we propose fully polar coordinate object detection, performing training and evaluation entirely in polar coordinates without relying on Cartesian metrics. To achieve this, we design a constraint-based polar bounding box representation, that enables the direct conversion of Cartesian bounding boxes via a constrained minimum bounding rectangle (MBR). Using the state-of-the-art polar-based detector as our baseline, we conduct experiments on the Boreas dataset. The results demonstrate that our approach improves the LiDAR detection AP30 metric by 2.88%, and yields a 2.17% gain over Cartesian-based detection methods. On more challenging scanning radar detection experiments, our method achieves an 13.11% improvement in AP30 compared to Cartesian-based detection methods. These findings validate the feasibility of fully polar coordinate object detection and demonstrate its robustness and generalizability across multiple range sensor modalities.
|
| |
| 15:00-16:30, Paper TuI2I.108 | Add to My Program |
| When Birds Meet Fish: Vision-Force Fusion for Autonomous Underwater Docking in Cross-Domain Avian-Aquatic Collaboration |
|
| Liu, Hongchang | Tongji University |
| Wang, Ruiheng | Tongji University |
| Jiang, Yongkang | Tongji University |
| Shenli, Zhang | Tongji University |
| Zhao, Xiangdan | Tongji |
| Xu, Xin | Tongji University |
| Ding, Yulong | Tongji University |
| Lyu, Feng | Tongji University |
| Wang, Zhipeng | Tongji University |
| Zhou, Yanmin | Tongji University |
| He, Bin | Tongji University |
Keywords: Field Robots, Soft Robot Applications, Marine Robotics
Abstract: Unmanned aerial–aquatic vehicles (UAAVs) provide cross-domain adaptability and broad visions, while autonomous underwater vehicles (AUVs) support long-duration operations. This work integrates the two by developing a rapid underwater docking and releasing system. An autonomous clamping mechanism is designed to anchor UAAVs under varied landing attitudes, and a vision–tactile state perception algorithm based on decision-level dual-modal fusion is proposed to enable reliable underwater docking with no need of communications between the UAAV and AUV. Experimental results validate autonomous perception and reliable docking in fully underwater environments, achieving a docking time of 6 s and a landing gear recognition accuracy of 3 mm. The proposed framework offers an efficient solution for aerial–aquatic cooperation, advancing cross-domain robotic platforms for ocean monitoring, emergency response, and underwater exploration.
|
| |
| 15:00-16:30, Paper TuI2I.109 | Add to My Program |
| TACC: Multi-Agent Reinforcement Learning for Task Allocation with Communication Coordination in UAV Swarms |
|
| Xiong, Zehao | National University of Defense Technology |
| Xi, YeXun | National University of Defense Technology |
| Cao, Yizhe | College of Intelligence Science and Technology, National University of Defense Technology, Changsha, Hunan |
| Li, Chuan | national university of defense technology |
| Li, Rong | National University of Defense Technology |
| Shen, Lixian | National University of Defense Technology |
| Li, Jie | National University of Defense Technology |
|
|
| |
| 15:00-16:30, Paper TuI2I.110 | Add to My Program |
| Soft Robots Grow a Spine: Origami-Inspired Folding Endoskeletal Support for Motion and Stiffness Control of Inflatable Robots |
|
| Min, Junil | Carnegie Mellon University |
| Robertson, Matthew | Queen's University |
Keywords: Soft Robot Materials and Design, Tendon/Wire Mechanism, Humanoid Robot Systems
Abstract: We present a deployable inflatable robotic torso with an origami-inspired spine, designed to combine the inherent compliance of soft robots with the controllability of skeletal structures. Unlike simple inflatable cylinders, which deform unpredictably through membrane buckling, our approach embeds a foldable spine that defines discrete bending axes and enables repeatable motion. Pneumatic inflation provides compact self-deployment, external stiffening, and a compliant outer shell that serves as a protective contact interface, while tendon actuation delivers precise, joint-level control. Experiments demonstrate that the torso replicates and in some cases exceeds human spinal range of motion, and that combined tendon–pneumatic actuation doubles lateral stiffness compared to pneumatics alone. We further characterize stiffness–motion trade-offs across pressures, showing tunable performance relevant to contact-rich operation. This integration of origami endoskeletons with inflatable bodies advances deployable humanoid-scale robots, addressing the gap between compliant contact behavior and controlled movement.
|
| |
| 15:00-16:30, Paper TuI2I.111 | Add to My Program |
| Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-The-Wild Human Demonstrations |
|
| Guzey, Irmak | New York University |
| Qi, Haozhi | UC Berkeley |
| Urain, Julen | TU Darmstadt |
| Wang, Changhao | University of California, Berkeley |
| Yin, Jessica | Meta |
| Bodduluri, Chaithanya Krishna | Meta Platforms |
| Lambeta, Mike Maroje | Facebook |
| Pinto, Lerrel | New York University |
| Rai, Akshara | Facebook AI Research |
| Malik, Jitendra | UC Berkeley |
| Wu, Tingfan | Meta AI |
| Sharma, Akash | Carnegie Mellon University |
| Bharadhwaj, Homanga | Meta Reality Labs |
Keywords: Dexterous Manipulation, Perception for Grasping and Manipulation, Human Detection and Tracking
Abstract: Learning multi-fingered robot policies from humans performing daily tasks in natural environments has long been a grand goal in the robotics community. Achieving this would mark significant progress toward generalizable robot manipulation in human environments, as it would reduce the reliance on labor-intensive robot interaction data collection. Despite substantial efforts, progress toward this goal has been bottle-necked by the embodiment gap between humans and robots, as well as by difficulties in extracting relevant contextual and motion cues that enable learning of autonomous policies from in-the-wild human videos. We claim that with simple yet sufficiently powerful hardware for obtaining human data and our proposed framework Aina, we are now one significant step closer to achieving this dream. Aina enables learning multi-fingered policies from in-the-wild data using Aria Gen 2 glasses. These glasses are lightweight and portable, feature a high-resolution RGB camera, provide accurate on-board 3D head and hand poses, and offer a wide stereo view that can be leveraged for depth estimation of the environment. This setup enables the learning of 3D point-based policies for multi-fingered hands that are robust to background changes and can be deployed directly without requiring any robot data (including online corrections, reinforcement learning, or simulation). We compare our framework against prior human-to-robot policy learning approaches, ablate our design choices, and demonstrate results across five everyday manipulation tasks. Robot rollouts can be best viewed on our website: aina-robot.github.io.
|
| |
| 15:00-16:30, Paper TuI2I.112 | Add to My Program |
| Overlapping Domain Decomposition for Distributed Pose Graph Optimization |
|
| Sonawalla, Aneesa | Massachusetts Institute of Technology |
| Tian, Yulun | University of Michigan |
| How, Jonathan | Massachusetts Institute of Technology |
Keywords: Multi-Robot SLAM, SLAM, Localization
Abstract: We present ROBO (Riemannian Overlapping Block Optimization), a distributed and parallel approach to multi-robot pose graph optimization (PGO) based on the idea of overlapping domain decomposition. ROBO offers a middle ground between centralized and fully distributed solvers, where the amount of pose information shared between robots at each optimization iteration can be set according to the available communication resources. Sharing additional pose information between neighboring robots effectively creates overlapping optimization blocks in the underlying pose graph, which substantially reduces the number of iterations required to converge. Through extensive experiments on benchmark PGO datasets, we demonstrate the applicability and feasibility of ROBO in different initialization scenarios, using various cost functions, and under different communication regimes. We also analyze the tradeoff between the increased communication and local computation required by ROBO’s overlapping blocks and the resulting faster convergence. We show that overlaps with an average inter-robot data cost of only 36 Kb per iteration can converge 3.1x faster in terms of iterations than state-of-the-art distributed PGO approaches. Furthermore, we develop an asynchronous variant of ROBO that is robust to network delays and suitable for real-world robotic applications.
|
| |
| 15:00-16:30, Paper TuI2I.113 | Add to My Program |
| Standing Tall: Sim to Real Fall Classification and Lead Time Prediction for Bipedal Robots |
|
| Prabhakaran, Gokul | Univeristy of Michigan |
| Grizzle, J.W | University of Michigan |
| Mungai, M. Eva | University of Michigan |
Keywords: Failure Detection and Recovery, Humanoid Robot Systems, Hardware-Software Integration in Robotics
Abstract: This paper extends a previously proposed fall prediction algorithm to a real-time (online) setting, with implementations in both hardware and simulation. The system is validated on the full-sized bipedal robot Digit, where the real-time version achieves performance comparable to the offline implementation while maintaining a zero false positive rate, an average lead time (defined as the difference between the true and predicted fall time) of 1.1s (well above the required minimum of 0.2s), and a maximum lead time error of just 0.03s. It also achieves a high recovery rate of 0.97, demonstrating its effectiveness in real-world deployment. In addition to the real-time implementation, this work identifies key limitations of the original algorithm, particularly under omnidirectional faults, and introduces a fine-tuned strategy to improve robustness. The enhanced algorithm shows measurable improvements across all evaluated metrics, including a 0.05 reduction in average false positive rate and a 1.19s decrease in the maximum error of the average predicted lead time.
|
| |
| 15:00-16:30, Paper TuI2I.114 | Add to My Program |
| Search Strategy for Layered Peg-In-Hole Using Dual Manipulator System |
|
| Lee, Haeseong | Seoul National University |
| Lee, Yonghee | Seoul National University |
| Yoon, Junheon | Seoul National University |
| Park, Jaeheung | Seoul National University |
Keywords: Assembly, Dual Arm Manipulation, Industrial Robots
Abstract: This study introduces a novel assembly, layered peg-in-hole, and suggests a search strategy. Distinct from the traditional peg-in-hole assembly, where two workpieces (i.e., peg/hole) are present, the layered peg-in-hole assembly contains three workpieces (i.e., peg, hole, and thru-hole). To handle the additional moving part, the through-hole, without relying on task-specific devices, a dual-manipulator system is preferable to a single manipulator system. However, existing research has primarily concentrated on the traditional peg-in-hole assembly regardless of the number of manipulators. In this respect, as the main contribution, a search strategy is proposed for the layered peg-in-hole assembly, which consists of two phases. In the first phase, both manipulators actively engage in the search task. In the second phase, the compliant behavior of the manipulator grasping the through-hole is advocated to assist the counterpart. Finally, the proposed search strategy is verified through the real robot experiment that replicates an industrial environment with two 7-DOF torque-controlled manipulators.
|
| |
| 15:00-16:30, Paper TuI2I.115 | Add to My Program |
| Semantically Consistent Language Gaussian Splatting for 3D Point-Level Open-Vocabulary Querying |
|
| Yin, Hairong | Purdue University |
| Zhan, Huangying | Goertek Alpha Labs |
| Xu, Yi | OPPO US Research Center |
| Yeh, Raymond | Purdue University |
Keywords: Semantic Scene Understanding, Object Detection, Segmentation and Categorization
Abstract: Open-vocabulary 3D scene understanding is crucial for robotics applications, such as natural language-driven manipulation, human-robot interaction, and autonomous navigation. Existing methods for querying 3D Gaussian Splatting often struggle with inconsistent 2D mask supervision and lack a robust 3D point-level retrieval mechanism. In this work, (i) we present a novel point-level querying framework that performs tracking on segmentation masks to establish a semantically consistent ground-truth for distilling the language Gaussians; (ii) we introduce a GT-anchored querying approach that first retrieves the distilled ground-truth and subsequently uses the ground-truth to query the individual Gaussians. Extensive experiments on three benchmark datasets demonstrate that the proposed method outperforms state-of-the-art performance. Our method achieves an mIoU improvement of +4.14, +20.42, and +1.7 on the LERF, 3D-OVS, and Replica datasets. These results validate our framework as a promising step toward open-vocabulary understanding in real-world robotic systems.
|
| |
| 15:00-16:30, Paper TuI2I.116 | Add to My Program |
| DarkDriving: A Real-World Day and Night Aligned Dataset for Autonomous Driving in the Dark Environment |
|
| Wang, Wuqi | Chang’an University |
| Yang, Haochen | Cleveland State University |
| Li, Baolu | Cleveland State University |
| Sun, Jiaqi | Chang'an University |
| Zhao, Xiangmo | School of Information Engineering, Chang 'an University |
| Xu, ZhiGang | Chang 'an University |
| Guo, Qing | Agency for Science, Technology and Research (A*STAR) |
| Min, Haigen | Chang'an University |
| Zhang, Tianyun | Cleveland State University |
| Yu, Hongkai | Cleveland State University |
Keywords: Intelligent Transportation Systems, Computer Vision for Transportation, Data Sets for Robotic Vision
Abstract: The low-light conditions are challenging to the vision-centric perception systems for autonomous driving in the dark environment. In this paper, we propose a new benchmark dataset (named DarkDriving) to investigate the low-light enhancement for autonomous driving. The existing real-world low-light enhancement benchmark datasets can be collected by controlling various exposures only in small-ranges and static scenes. The dark images of the current nighttime driving datasets do not have the precisely aligned daytime counterparts. The extreme difficulty to collect a real-world day and night aligned dataset in the dynamic driving scenes significantly limited the research in this area. With a proposed automatic day-night Trajectory Tracking based Pose Matching (TTPM) method in a large real-world closed driving test field (area: 69 acres), we collect the first real-world day and night aligned dataset for autonomous driving in the dark environment. The DarkDriving dataset has 11,408 day and night image pairs precisely aligned in location and spatial contents, whose alignment error is in just several centimeters. For each pair, we also manually label the object 2D bounding boxes. DarkDriving introduces four perception related tasks, including low-light enhancement, generalized low-light enhancement, and low-light enhancement for 2D detection and 3D detection of autonomous driving in the dark environment. The experimental results show that our DarkDriving dataset provides a comprehensive benchmark for evaluating low-light enhancement for autonomous driving and it can also be generalized to enhance dark images and promote detection in some other low-light driving environment, such as nuScenes.
|
| |
| 15:00-16:30, Paper TuI2I.117 | Add to My Program |
| Safe Whole-Body Loco-Manipulation Via Combined Model and Learning-Based Control |
|
| Schperberg, Alexander | Mitsubishi Electric Research Laboratories |
| Wang, Yeping | University of Wisconsin-Madison |
| Di Cairano, Stefano | Mitsubishi Electric Research Laboratories |
Keywords: Whole-Body Motion Planning and Control, Legged Robots, Physical Human-Robot Interaction
Abstract: Simultaneous locomotion and manipulation enables robots to interact with their environment beyond the constraints of a fixed base. However, coordinating legged locomotion with arm manipulation, while considering safety and compliance during contact interaction remains challenging. To this end, we propose a whole-body controller that combines a model-based admittance control for the manipulator arm with a Reinforcement Learning (RL) policy for legged locomotion. The admittance controller maps external wrenches—such as those applied by a human during physical interaction—into desired end-effector velocities, allowing for compliant behavior. The velocities are tracked jointly by the arm and leg controllers, enabling a unified 6-DoF force response. The model-based design permits accurate force control and safety guarantees via a Reference Governor (RG), while robustness is further improved by a Kalman filter enhanced with neural networks for reliable base velocity estimation. We validate our approach in both simulation and hardware using the Unitree Go2 quadruped robot with a 6-DoF arm and wrist-mounted 6-DoF Force/Torque sensor. Results demonstrate accurate tracking of interaction-driven velocities, compliant behavior, and safe, reliable performance in dynamic settings.
|
| |
| 15:00-16:30, Paper TuI2I.118 | Add to My Program |
| Learning Quadruped Walking from Seconds of Demonstration |
|
| Zhang, Ruipeng | University of California, San Diego |
| Yu, Hongzhan | University of California San Diego |
| Chang, Ya-Chien | University of California San Diego |
| Li, Chenghao | University of California, San Diego |
| Christensen, Henrik I. | University of California, San Diego |
| Gao, Sicun | UCSD |
Keywords: Imitation Learning, Legged Robots
Abstract: Quadruped locomotion provides a natural setting for understanding when model-free learning can outperform model-based control design, by exploiting data patterns to bypass the difficulty of optimizing over discrete contacts and the combinatorial explosion of mode changes. We give a principled analysis of why imitation learning with quadrupeds can be inherently effective in a small data regime, based on the structure of its limit cycles, Poincaré return maps, and local numerical properties of neural networks. The understanding motivates a new imitation learning method that regulates the alignment between variations in a latent space and those over the output actions. Hardware experiments confirm that a few seconds of demonstration is sufficient to train various locomotion policies from scratch entirely offline with reasonable robustness.
|
| |
| 15:00-16:30, Paper TuI2I.119 | Add to My Program |
| GS-UVCE: Gaussian Splatting-Driven Unsupervised Visual Consistency Enhancement for Underwater 3D Scene Reconstruction |
|
| Li, Xiang | Dalian University of Technology |
| Li, Chi | Dalian University |
| Xu, Yiming | Dalian University of Technology |
| Zhuang, Yan | Dalian University of Technology |
Keywords: Visual Learning, Deep Learning for Visual Perception, RGB-D Perception
Abstract: Underwater 3D scene reconstruction is critical for the operation of underwater robotics, yet remains highly challenging due to the semi-transparent water medium, which introduces optical distortions, light scattering, and severe visibility degradation. Therefore, effective underwater image enhancement is a prerequisite for reliable reconstruction. However, existing approaches typically enhance individual views with pre-trained models before reconstruction, leading to poor generalization and inconsistent multi-view results. To address these limitations, we propose GS-UVCE, an end-to-end framework for Gaussian Splatting-driven Unsupervised Visual Consistency Enhancement. GS-UVCE incorporates a Medium-MLP to model water-medium effects and a Light-MLP to adaptively correct illumination, ensuring illumination consistency. Furthermore, depth regularization is introduced to preserve geometric consistency under varying scene conditions. Extensive experiments on multiple underwater datasets show that GS-UVCE consistently outperforms SOTA methods, achieving superior reconstruction fidelity and visual consistency enhancement.
|
| |
| 15:00-16:30, Paper TuI2I.120 | Add to My Program |
| AWENet: A Self-Supervised Network for Efficient Interest Point Detection and Description |
|
| Jia, Pengwei | Inner Mongolia University |
| Li, Kang | Inner Mongolia University |
| Batu, Siren | Inner Mongolia University |
Keywords: Deep Learning for Visual Perception, Deep Learning Methods, Audio-Visual SLAM
Abstract: We introduce AWENet(Attention-guided Wavelet Enhancement Network), an efficient self-supervised network for joint interest point detection and description that balances com putational speed with feature accuracy. The network preserves f ine structural details while employing multi-scale attention to enhance the discriminability of descriptors, leading to more precise and reliable interest point correspondences. Evaluations on the HPatches dataset demonstrate that AWENet achieves competitive performance in repeatability, localization accuracy, and matching robustness. Its lightweight design ensures fast processing and low computational cost, making it well-suited for applications where efficiency is critical. Qualitative results show that the network generates dense and accurate correspondences under diverse transformations, including changes in viewpoint and illumination. Overall, AWENet provides a practical and effective solution for learning local features, achieving strong matching performance without relying on heavy computation.
|
| |
| 15:00-16:30, Paper TuI2I.121 | Add to My Program |
| Metric, Inertially Aligned Monocular State Estimation Via Kinetodynamic Priors |
|
| Liu, Jiaxin | ShanghaiTech University |
| Li, Min | ShanghaiTech University |
| Xu, Wanting | ShanghaiTech University |
| Li, Liang | The University of Hong Kong |
| Yang, Jiaqi | ShanghaiTech University |
| Kneip, Laurent | ShanghaiTech University |
Keywords: SLAM, Flexible Robotics, Kinematics
Abstract: Accurate state estimation for flexible robotic systems poses significant challenges, particularly for platforms with dynamically deforming structures that invalidate rigid-body assumptions. This paper addresses this problem and enables the extension of existing rigid-body pose estimation methods to non-rigid systems. Our approach integrates two core components: first, we capture elastic properties using a deformation-force model, efficiently learned via a Multi-Layer Perceptron; second, we resolve the platform's inherently smooth motion using continuous-time B-spline kinematic models. By continuously applying Newton's Second Law, our method formulates the relationship between visually-derived trajectory acceleration and predicted deformation-induced acceleration. We demonstrate that our approach not only enables robust and accurate pose estimation on non-rigid platforms, but also shows that the properly modeled platform physics allow for the recovery of inertial sensing properties. We validate this feasibility on a simple spring-camera system, showing how it robustly resolves the typically ill-posed problem of metric scale and gravity recovery in monocular visual odometry.
|
| |
| 15:00-16:30, Paper TuI2I.122 | Add to My Program |
| A Tri-Axial FBG-Based Force Sensor at the Tool Tip of a Continuum Manipulator for Single-Port Access Surgery |
|
| Zhang, Jiading | Beijing Jiaotong University |
| Hao, Lijun | Beijing Jiaotong Uinversity |
| Liu, Siqi | Beijing Jiaotong University |
| Zhao, Jiangran | Shanghai Jiao Tong University |
| Gong, Jiamiao | National Center for Cardiovascular Diseases and Fuwai Hospital |
| Xu, Kai | Shanghai Jiao Tong University |
| Liu, Sheng | Fuwai Hospital |
| Zheng, Zhe | Fuwai Hospital |
| Yang, Tangwen | Beijing Jiaotong University |
Keywords: Force and Tactile Sensing, Sensor Fusion, Surgical Robotics: Laparoscopy
Abstract: Abstract— The absence of force feedback remains a major bottleneck in the development of robotic laparoendoscopic single-site (R-LESS) surgery, reducing the control precision of surgical instruments and increasing the risk of tissue damage. To address this challenge, we propose a miniature triaxial force sensor based on Fiber Bragg Grating (FBG), featuring high precision, nonlinear decoupling capability, and seamless integration with the tool tip of a continuum manipulator for singleport access surgery. The sensor features a monolithic elastic body with a dumbbell-shaped groove, where four FBGs are symmetrically arranged at 90◦ intervals around the circumference to form a redundant measurement unit, thereby enhancing sensing accuracy. A novel Whale Migration Algorithm Based Kernel Extreme Learning Machine (WMA-KELM) is introduced to address the nonlinear coupling influences arising from manipulator integration, demonstrating superior accuracy and robustness compared to conventional methods. Experimental results show that within the ranges of axial force [0 N, 5 N] and radial force [-2.5 N, 2.5 N], the maximum full-scale (FS) error is less than 1% in all dimensions, the maximum RMSE is 0.0308 N, and the maximum repeatability error is within ±0.24%. These results validate the force sensor integrated with the continuum manipulator, and the proposed algorithm is effective and reliable.
|
| |
| 15:00-16:30, Paper TuI2I.123 | Add to My Program |
| Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow |
|
| Dharmarajan, Karthik | Stanford University |
| Huang, Wenlong | Stanford University |
| Wu, Jiajun | Stanford University |
| Fei-Fei, Li | Stanford University |
| Zhang, Ruohan | Stanford University |
Keywords: Machine Learning for Robot Control, Big Data in Robotics and Automation, Perception for Grasping and Manipulation
Abstract: Generative video modeling has emerged as a compelling tool to zero-shot reason about plausible physical interactions for open-world manipulation. Yet, it remains a challenge to translate such human-led motions into the low-level actions demanded by robotic systems. We observe that given an initial image and task instruction, these models excel at synthesizing sensible object motions. Thus, we introduce Dream2Flow, a framework that bridges video generation and robotic control through 3D object flow as an intermediate representation. Our method reconstructs 3D object motions from generated videos and formulates manipulation as object trajectory tracking. By separating the state changes from the actuators that realize those changes, Dream2Flow overcomes the embodiment gap and enables zero-shot guidance from pre-trained video models to manipulate objects of diverse categories—including rigid, articulated, deformable, and granular. Through trajectory optimization or reinforcement learning, Dream2Flow converts reconstructed 3D object flow into executable low-level commands without task-specific demonstrations. Simulation and real-world experiments highlight 3D object flow as a general and scalable interface for adapting video generation models to open-world robotic manipulation. Videos, visualizations, and appendix are available at https://dream2flow.github.io/.
|
| |
| 15:00-16:30, Paper TuI2I.124 | Add to My Program |
| A Data-Driven Approach for Control and Stabilization of a Single Actuator Monocopter |
|
| Sufiyan, Danial | Singapore University of Technology & Design |
| Win, Luke Soe Thura | Singapore University of Technology & Design |
| Win, Shane Kyi Hla | Singapore University of Technology & Design |
| Tan, Tee Meng | Singapore University of Technology & Design |
| Foong, Shaohui | Singapore University of Technology and Design |
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications
Abstract: In this paper, we use a machine learning approach to stabilize a Single Actuator Monocopter (SAM), showing its ability to operate autonomously outdoors utilizing an onboard Inertial Measurement Unit (IMU). We introduce a neural network-based proportional stabilizer that works in parallel to cascaded P/PID controllers. This network uses the IMU’s data to predict the world frame angular velocity, which is then used to stabilize the SAM. Training data was collected to establish correspondences between the IMU readings and the world frame angular velocity from flights conducted within an indoor motion capture environment. We used data augmentation to improve the network’s generalization and prediction performance by 9%. Once trained, the neural network was deployed on the SAM to estimate its angular velocity in real time. We then tested the SAM’s autonomous capabilities in a large semi-outdoor space of approximately 16,000 m3 with wind disturbances of up to 1.5 m/s. We demonstrate position hold, waypoint, and continuous tracking tests, achieving median position errors of 0.5 m, 1.05 m, and 2.22 m, respectively, where no stabilization would result in failure of the defined tests.
|
| |
| 15:00-16:30, Paper TuI2I.125 | Add to My Program |
| Robotic Cell Manipulation at the Solid-Liquid Interface for Cryopreservation |
|
| Jiang, Aojun | Univ. of Toronto |
| Yang, Han | Dalian Univ. of Tech.; CUHK-Shenzhen |
| Song, Haocong | Univ. of Toronto |
| Chen, Wenyuan | Univ. of Toronto |
| Sun, Yu | Univ. of Toronto |
| Zhang, Zhuoran | Dalian Univ. of Tech.; CUHK-Shenzhen |
Keywords: Biological Cell Manipulation, Automation at Micro-Nano Scales, Robotics and Automation in Life Sciences
Abstract: Automating cell manipulation at a solid-liquid interface is a critical challenge for biomedical applications such as embryo cryopreservation. Unlike manipulation in a full liquid medium, the cell-substrate contact creates a significant static friction force that is not readily measurable with current sensing or vision technologies. This unpredictability poses a high risk of cell loss, as the cell can transition abruptly from a static to a high-velocity state when the applied hydrodynamic force breaches the friction threshold. Existing methods fail to estimate this hidden friction parameter and cannot anticipate the sudden dynamic shift. To address this challenge, this paper proposes a worst-case predictive control approach with online adversarial parameter estimation (WPC-OAPE). The core innovation is the inference of the static friction barrier from observations made while the cell is still stationary. This estimate then informs a predictive controller that proactively plans against the worst-case scenario, to select the optimal action. The WPC-OAPE scheme was validated in robotic embryo vitrification experiments, where it achieved a 100% success rate with zero cell loss. This performance significantly surpassed open-loop (66.6% success) and standard predictive control (83.3% success) methods, proving its potential for clinical applications.
|
| |
| 15:00-16:30, Paper TuI2I.126 | Add to My Program |
| From Manual to Operation: A Home Appliance Agent |
|
| Mao, Bo | Beijing University of Posts and Telecommunications |
| Huang, Yuming, Troy | Beijing University of Posts and Telecommunications |
| Chai, Jiayang | Beijing University of Posts and Telecommunications |
| Liu, Huaping | Tsinghua University |
| Guo, Di | Beijing University of Posts and Telecommunications |
Keywords: AI-Enabled Robotics, Domestic Robotics, Agent-Based Systems
Abstract: Operating household appliances by reading and understanding user manuals remains a fundamental and challenging problem in robotics. Recent works leverage large language models (LLMs) and vision-language models (VLMs) to interpret manuals, improving appliance operation success. However, these approaches fail when manuals are unavailable or incomplete. In this paper, we introduce an autonomous assistant for robotic appliance operation, built upon an LLMs/VLMs-powered multi-agent collaborative framework. Our system can read, comprehend, and summarize manuals, autonomously infer operational logic, and execute actions on appliances with a robotic arm. Importantly, for unseen appliances without manuals, it can acquire operational knowledge from generalized manuals and on-demand web search. Extensive evaluations on over one thousand tasks show that our framework substantially outperforms baselines and achieves robust performance in simulation and real-world experiments.
|
| |
| 15:00-16:30, Paper TuI2I.127 | Add to My Program |
| An End-To-End Trajectory Planner for Safe and Efficient Navigation in Crowded Dynamic Environments |
|
| Zhang, Shuting | Peking University |
| Wang, Haowen | Peking University |
| Li, Guangchen | Peking University |
Keywords: Collision Avoidance, Motion and Path Planning, Aerial Systems: Perception and Autonomy
Abstract: This paper presents a novel end-to-end trajectory planning framework that integrates LiDAR-based perception with trajectory optimization, enabling safe and efficient navigation in dynamic environments without relying on semantic detection or explicit kinematic modeling. Learning-based dynamic collision avoidance methods often depend on reinforcement learning, which introduces challenges related to training efficiency, model generalization, and deployment safety. To address these limitations, we propose a lightweight map representation for temporally continuous dynamic obstacles, facilitating unsupervised network training with physically simulated data. Additionally, a repulsion-based adjustment method built upon motion primitives allows adaptive trajectory planning in highly crowded scenarios where no feasible trajectory exists, balancing target-reaching objectives with motion safety. Extensive simulations and real-world experiments demonstrate that the proposed framework achieves millisecond-level planning latency while ensuring high safety, trajectory smoothness, and flight efficiency. The demonstration video is available on the project website: https://swift520.github.io/Dynamic-Planner/.
|
| |
| 15:00-16:30, Paper TuI2I.128 | Add to My Program |
| VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning Via Online Monte Carlo Tree Search |
|
| Guo, Wenkai | Nanyang Technological University |
| Lu, Guanxing | Tsinghua Shenzhen International Graduate School, Tsinghua University |
| Deng, Haoyuan | Nanyang Technological University |
| Wu, Zhenyu | Beijing University of Posts and Telecommunications |
| Tang, Yansong | Tsinghua University |
| Wang, Ziwei | Nanyang Technological University |
|
|
| |
| 15:00-16:30, Paper TuI2I.129 | Add to My Program |
| Multi-Horizon Lane Change Maneuver Prediction Using Multi-Modal Transformers |
|
| Rama, Petrit | Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau |
| Gummadi, Praveen | Technical University of Kaiserslautern |
| Bajcinca, Naim | RPTU Kaiserslautern-Landau |
Keywords: Intelligent Transportation Systems, Deep Learning Methods, Constrained Motion Planning
Abstract: Predicting lane change maneuvers is essential for ensuring safe autonomous driving, especially in complex urban environments. Building upon prior multi-modal and graph-based approaches, this work introduces a novel transformer-based architecture for multi-horizon lane change prediction that jointly estimates the lane change maneuver and the lane change phase. The proposed model integrates visual information from surround-view cameras, semantic masks for free space and lane markings, interaction-aware graph representations, and ego-vehicle state signals, within a unified transformer framework to capture spatial-temporal dependencies. In addition, a multi-level uncertainty estimation branch quantifies confidence at the level of modality, fusion, and prediction, to enhance interpretability and reliability. Experiments are conducted on WylonSet++, an extended in-house dataset collected using an instrumented test vehicle, annotated for lane change behavior analysis and maneuver phase transitions. The dataset comprises synchronized front-facing camera images, left and right surround-view camera images, together with vehicle state data. The dataset contains approximately 600 lane change sequences, providing the foundation for this study. Extensive evaluations demonstrate strong performance in anticipating lane change maneuvers and phase progression across short- and long-term prediction horizons in diverse real-world traffic scenarios.
|
| |
| 15:00-16:30, Paper TuI2I.130 | Add to My Program |
| EC3R-SLAM: Efficient and Consistent Monocular Dense SLAM with Feed-Forward 3D Reconstruction |
|
| Hu, Lingxiang | Paris Saclay University |
| Ait Oufroukh, Naima | University of Paris-Saclay |
| Bonardi, Fabien | Université D’Évry |
| Raymond, Ghandour | American University of the Middle East |
Keywords: SLAM, Mapping, Localization
Abstract: The application of monocular dense Simultaneous Localization and Mapping (SLAM) is often hindered by high latency, large GPU memory consumption, and reliance on camera calibration. To relax this constraint, we propose EC3R-SLAM, a novel calibration-free monocular dense SLAM framework that jointly achieves high localization and mapping accuracy, low latency, and low GPU memory consumption. This enables the framework to achieve efficiency through the coupling of a tracking module, which maintains a sparse map of feature points, and a mapping module based on a feed-forward 3D reconstruction model that simultaneously estimates camera intrinsics. In addition, both local and global loop closures are incorporated to ensure mid-term and long-term data association, enforcing multi-view consistency and thereby enhancing the overall accuracy and robustness of the system. Experiments across multiple benchmarks show that EC3R-SLAM achieves competitive performance compared to state-of-the-art methods, while being faster and more memory-efficient. Moreover, it runs effectively even on resource-constrained platforms such as laptops and Jetson Orin NX, highlighting its potential for real-world robotics applications.
|
| |
| 15:00-16:30, Paper TuI2I.131 | Add to My Program |
| Adaptive Dynamics Planning for Robot Navigation |
|
| Lu, Yuanjie | George Mason University |
| Mao, Mingyang | University of South Florida |
| Wang, Linji | George Mason University |
| Xu, Tong | George Mason University |
| Lin, Xiaomin | University of South Florida |
| Xiao, Xuesu | George Mason University |
Keywords: Motion and Path Planning, Reinforcement Learning, Dynamics
Abstract: Autonomous robot navigation systems often rely on hierarchical planning, where global planners compute collision-free paths without considering dynamics, and local planners enforce dynamics constraints to produce executable commands. This discontinuity in dynamics often leads to trajectory tracking failure in highly constrained environments. Recent approaches integrate dynamics within the entire planning process by gradually decreasing its fidelity, e.g., increasing integration steps and reducing collision checking resolution, for real-time planning efficiency. However, they assume that the fidelity of the dynamics should decrease according to a manually designed scheme. Such static settings fail to adapt to environmental complexity variations, resulting in computational overhead in simple environments or insufficient dynamics consideration in obstacle-rich scenarios. To overcome this limitation, we propose Adaptive Dynamics Planning (ADP), a learning-augmented paradigm that uses reinforcement learning to dynamically adjust robot dynamics properties, enabling planners to adapt across diverse environments. We integrate ADP into three different planners and further design a standalone ADP-based navigation system, benchmarking them against other baselines. Experiments in both simulation and real-world tests show that ADP consistently improves navigation success, safety, and efficiency.
|
| |
| 15:00-16:30, Paper TuI2I.132 | Add to My Program |
| Coverage Path Planning for Holonomic UAVs Via Uniaxial-Feasible, Gap-Severity Guided Decomposition |
|
| Alarcon Granadeno, Pedro Antonio | University of Notre Dame |
| Cleland-Huang, Jane | University of Notre Dame |
Keywords: Motion and Path Planning, Search and Rescue Robots, Aerial Systems: Applications
Abstract: Modern coverage path planning (CPP) for holonomic UAVs in emergency response must contend with diverse environments where regions of interest (ROIs) often take the form of highly irregular polygons, characterized by asymmetric shapes, dense clusters of concavities, and multiple internal holes. Modern CPP pipelines typically rely on decomposition strategies that overfragment such polygons into numerous subregions. This increases the number of sweep segments and connectors, which in turn adds inter-region travel and forces more frequent reorientation. These effects ultimately result in longer completion times and degraded trajectory quality. We address this with a decomposition strategy that applies a recursive dual-axis monotonicity criterion, with cuts guided by a cumulative gap severity metric. This approach distributes clusters of concavities more evenly across subregions and produces a minimal set of partitions that remain sweepable under a parallel-track maneuver. We pair this with a global optimizer that jointly selects sweep paths and inter-partition transitions to minimize total path length, transition overhead, and turn count. We demonstrate that our proposed approach achieves the lowest mean path-length and completion-time overhead among 15 other CPP pipelines.
|
| |
| 15:00-16:30, Paper TuI2I.133 | Add to My Program |
| Fast Motion Planning for Non-Holonomic Mobile Robots Via a Rectangular Corridor Representation of Structured Environments |
|
| Gonzalez-Garcia, Alejandro | KU Leuven |
| Wyns, Sebastiaan | KU Leuven |
| De Santis, Sonia | KU Leuven |
| Swevers, Jan | KU Leuven |
| Decré, Wilm | Katholieke Universiteit Leuven |
Keywords: Motion and Path Planning, Nonholonomic Motion Planning, Autonomous Vehicle Navigation
Abstract: We present a complete framework for fast motion planning of non-holonomic autonomous mobile robots in highly complex but structured environments. Conventional grid-based planners struggle with scalability, while many kinematically-feasible planners impose a significant computational burden due to their search space complexity. To overcome these limitations, our approach introduces a deterministic free-space decomposition that creates a compact graph of overlapping rectangular corridors. This method enables a significant reduction in the search space, without sacrificing path resolution. The framework then performs online motion planning by finding a sequence of rectangles and generating a near-time-optimal, kinematically-feasible trajectory using an analytical planner. The result is a highly efficient solution for large-scale navigation. We validate our framework through extensive simulations and on a physical robot. The implementation will be made publicly available as open-source software.
|
| |
| 15:00-16:30, Paper TuI2I.134 | Add to My Program |
| AnyThermal: Towards Learning Universal Representations for Thermal Perception |
|
| Maheshwari, Parv | Carnegie Mellon University |
| Karhade, Jay | Carnegie Mellon University |
| Chawla, Yogesh | University of Nebraska Lincoln |
| Adu, Isaiah | Pennsylvania State University |
| Heisen, Florian | Technical University of Munich |
| Porco, Andrew | Carnegie Mellon University |
| Jong, Andrew | Carnegie Mellon University |
| Liu, Yifei | Carnegie Mellon University |
| Pitla, Santosh | University of Nebraska Lincoln |
| Scherer, Sebastian | Carnegie Mellon University |
| Wang, Wenshan | Carnegie Mellon University |
Keywords: Deep Learning for Visual Perception, Visual Learning, Data Sets for Robotic Vision
Abstract: We present AnyThermal, a thermal backbone that captures robust task-agnostic thermal features suitable for a variety of tasks such as cross-modal place recognition, thermal segmentation, and monocular depth estimation using thermal images. Existing thermal backbones that follow task-specific training from small-scale data result in utility limited to a specific environment and task. Unlike prior methods, AnyThermal can be used for a wide range of environments (indoor, aerial, off-road, urban) and tasks, all without task-specific training. Our key insight is to distill the feature representations from visual foundation models such as DINOv2 into a thermal encoder using thermal data from these multiple environments. To bridge the diversity gap of the existing RGB-Thermal datasets, we introduce the TartanRGBT platform, the first open-source data collection platform with synced RGB-Thermal image acquisition. We use this payload to collect the TartanRGBT dataset - a diverse and balanced dataset collected in 4 environments. We demonstrate the efficacy of AnyThermal and TartanRGBT, achieving state-of-the-art results with improvements of up to 36% across diverse environments and downstream tasks on existing datasets
|
| |
| 15:00-16:30, Paper TuI2I.135 | Add to My Program |
| Adaptive Linear Path Model-Based Diffusion |
|
| Shimizu, Yutaka | University of California, Berkeley |
| Tomizuka, Masayoshi | University of California |
Keywords: Machine Learning for Robot Control, Optimization and Optimal Control, Motion and Path Planning
Abstract: The interest in combining model-based control approaches with diffusion models has been growing. Although we have seen many impressive robotic control results in difficult tasks, the performance of diffusion models is highly sensitive to the choice of scheduling parameters, making parameter tuning one of the most critical challenges. We introduce Linear Path Model-Based Diffusion (LP-MBD), which replaces the variance-preserving schedule with a flow-matching–inspired linear probability path. This yields a geometrically interpretable and decoupled parameterization that reduces tuning complexity and provides a stable foundation for adaptation. Building on this, we propose Adaptive LP-MBD (ALP-MBD), which leverages reinforcement learning to adjust diffusion steps and noise levels according to task complexity and environmental conditions. Across numerical studies, Brax benchmarks, and mobile-robot trajectory tracking, LP-MBD simplifies scheduling while maintaining strong performance, and ALP-MBD further improves robustness, adaptability, and real-time efficiency.
|
| |
| 15:00-16:30, Paper TuI2I.136 | Add to My Program |
| ImaginationPolicy: Towards Generalizable, Precise and Reliable End-To-End Policy for Robotic Manipulation |
|
| Lu, Dekun | South China University of Technology |
| Gao, Wei | Massachusetts Institute of Technology |
| Jia, Kui | The Chinese University of Hong Kong, Shenzhen |
| |
| 15:00-16:30, Paper TuI2I.137 | Add to My Program |
| Bridging Perception and Planning: Towards End-To-End Planning for Signal Temporal Logic Tasks |
|
| Ye, Bowen | Shanghai Jiao Tong University |
| Huang, Junyue | Shanghai Jiao Tong University |
| Liu, Yang | University of Minnesota |
| Qiao, Xiaozhen | University of Science and Technology of China |
| Yin, Xiang | Shanghai Jiao Tong Univ |
Keywords: Task Planning, Planning, Scheduling and Coordination, Reactive and Sensor-Based Planning
Abstract: We investigate the task and motion planning problem for Signal Temporal Logic (STL) specifications in robotics. Existing STL methods rely on pre-defined maps or mobility representations, which are ineffective in unstruc- tured real-world environments. We propose the Structured- MoE STL Planner (S-MSP), a differentiable framework that maps synchronized multi-view camera observations and an STL specification directly to a feasible trajectory. S-MSP integrates STL constraints within a unified pipeline, trained with a composite loss that combines trajectory reconstruction and STL robustness. A structure-aware Mixture-of-Experts (MoE) model enables horizon-aware specialization by projecting sub-tasks into temporally anchored embeddings. We evaluate S-MSP using a high-fidelity simulation of factory-logistics scenarios with temporally constrained tasks. Experiments show that S- MSP outperforms single-expert baselines in STL satisfaction and trajectory feasibility. A rule-based safety filter at inference improves physical executability without compromising logical correctness, showcasing the practicality of the approach.
|
| |
| 15:00-16:30, Paper TuI2I.138 | Add to My Program |
| Long-Horizon Planning with Large Language Models for Indoor Assistive Navigation of the Visually Impaired |
|
| Sun, Yiyang | Beijing University of Chemical Technology |
| Lin, Chengran | Harbin Institute of Technology |
| Xia, Ji | Beijing University of Chemical Technology |
| Cao, Zhengcai | Harbin Institute of Technology |
Keywords: AI-Based Methods, Agent-Based Systems, Wearable Robotics
Abstract: For visually impaired individuals, assistive navigation systems play a crucial role in enabling independent mobility. However, long-horizon planning based on natural language (NL) instructions in complex indoor environments remains a significant challenge. Recent studies show the strong potential of Large Language Models (LLMs) in NL understanding and task-level planning. Yet, the inherent limitations of LLMs in mathematical reasoning and their susceptibility to hallucination hinder their reliability in low-level path planning. In this paper, we introduce an LLM-based indoor assistive navigation system that interprets NL instructions from visually impaired users for autonomous navigation. At its core is a novel planning agent that grounds instructions to the environment's topological map and generates optimal route plans. To avoid hallucination in geometric reasoning, the LLM handles only high-level semantic planning, while precise node-level paths are delegated to a classical graph search algorithm. We further implement a wearable assistive device that provides voice and vibrotactile feedback to deliver hands-free navigation. Offline evaluations and real-world experiments demonstrate that our system can reliably plan grounded routes and enable visually impaired users to autonomously complete long-horizon navigation tasks. Anonymous project page is available at https://lhp-ian.github.io.
|
| |
| 15:00-16:30, Paper TuI2I.139 | Add to My Program |
| Viper: Verifiable Imitation Learning Policy for Efficient Robotic Manipulation |
|
| Cheng, Xianfeng | Sun Yat-Sen University |
| Gao, Qing | Sun Yat-Sen University |
| Chen, Guangyu | Sun Yat-Sen University |
| Xiong, Rui | Sun Yat-Sen University |
| Hu, Junjie | The Chinese University of Hong Kong, Shenzhen |
| Guo, Yulan | Sun Yat-Sen University |
| Ju, Zhaojie | University of Portsmouth |
Keywords: Imitation Learning, Learning from Demonstration
Abstract: Imitation learning (IL) presents a promising paradigm for enabling embodied robots to efficiently acquire human-like manipulation skills. However, prevailing methods face a persistent trade-off between motion precision and computational tractability. To resolve this fundamental challenge, this paper introduces Viper, a framework for Verifiable Imitation learning Policy for Efficient Robotic manipulation. Viper integrates principles of Nonlinear Model Predictive Control (NMPC) within a learning-based model. Grounded in an NMPC-style closed-loop architecture, the proposed method unifies the modeling of nonlinear system dynamics with online, multi-horizon optimization of state-action predictions, while intrinsically embedding physical constraints. This co-design enables both smooth trajectory generation and fast execution. Furthermore, a theoretical stability analysis for the Viper framework is provided. Extensive evaluations, from simulated benchmarks to real-world manipulation tasks, demonstrate that Viper effectively reconciles the competing demands of precision and speed inherent in existing robotic IL paradigms. The source code will be released upon paper acceptance.
|
| |
| 15:00-16:30, Paper TuI2I.140 | Add to My Program |
| NavCrafter: Exploring 3D Scenes from a Single Image |
|
| Duan, Hongbo | Tsinghua University |
| Zhuang, Peiyu | Huawei |
| Liu, Yi | Tsinghua University |
| Zhang, Zhengyang | Tsinghua University |
| Zhang, Yuxin | Tsinghua University |
| Luo, Pengting | Huawei |
| Liu, Fangming | Huazhong University of Science and Technology |
| Wang, Xueqian | Center for Artificial Intelligence and Robotics, Graduate School at Shenzhen, Tsinghua University, 518055 Shenzhen, China |
|
|
| |
| 15:00-16:30, Paper TuI2I.141 | Add to My Program |
| DextrAH-RGB: Visuomotor Policies to Grasp Anything with Dexterous Hands |
|
| Singh, Ritvik | NVIDIA, UC Berkeley |
| Allshire, Arthur | UC Berkeley |
| Handa, Ankur | NVIDIA |
| Ratliff, Nathan | NVIDIA |
| Van Wyk, Karl | NVIDIA |
Keywords: Perception for Grasping and Manipulation, Multifingered Hands, Grippers and Other End-Effectors
Abstract: One of the most important, yet challenging, skills for a dexterous robot is grasping a diverse range of objects. Much of the prior work has been limited by speed, generality, or reliance on depth maps and object poses. In this paper, we introduce DextrAH-RGB, a system that can perform dexterous arm-hand grasping end-to-end from RGB image input. We train a policy in simulation through reinforcement learning that acts on a geometric fabric controller to dexterously grasp a wide variety of objects. We then distill this into an RGB-based policy strictly in simulation using photorealistic tiled rendering. To our knowledge, this is the first work that is able to demonstrate robust sim-to-real transfer of an end-to-end (monocular or stereo) RGB-based policy for complex, dynamic, contact-rich tasks such as dexterous grasping with multi-fingered hands. Unlike previous methods, DextrAH-RGB requires no explicit depth or CAD models, making it significantly more practical and robust in varied real-world lighting and texture conditions. It generalizes to novel objects and scenes, offering a strong step toward deployable, vision-based dexterous manipulation.
|
| |
| 15:00-16:30, Paper TuI2I.142 | Add to My Program |
| Learning Composable Skills by Discovering Spatial and Temporal Structure with Foundation Models |
|
| Nie, Neil | Stanford University |
| Huang, Wenlong | Stanford University |
| Mao, Jiayuan | MIT |
| Fei-Fei, Li | Stanford University |
| Liu, Weiyu | Stanford University |
| Wu, Jiajun | Stanford University |
Keywords: Manipulation Planning, Integrated Planning and Learning, Learning from Demonstration
Abstract: We present STACK, a framework for discovering and learning composable manipulation skills from unsegmented demonstrations by leveraging spatial and temporal structure extracted from foundation models. STACK automatically extracts temporal structure by segmenting raw demonstrations into short-horizon skills using a video-language model, and spatial structure by identifying skill-relevant elements in 3D point cloud observations. For each discovered skill, we learn a diffusion-based trajectory sampler and a skill effect model, both of which operate in the reference frame of the relevant scene element. At test time, given a language goal, STACK segments the 3D scene, samples skill trajectories, and composes them by simulating geometric effects. This enables generalization to new scene configurations, geometric constraints, and longer task horizons beyond training across diverse real-world manipulation tasks. Project page: https://icra-stack.github.io
|
| |
| 15:00-16:30, Paper TuI2I.143 | Add to My Program |
| Enhancing Classical Motion Planners Using RL with Safety Guarantees |
|
| Goldsztejn, Elias | Ben Gurion University |
| Brafman, Ronen | Ben-Gurion University |
Keywords: Reinforcement Learning, Motion and Path Planning, Imitation Learning
Abstract: Classical algorithms for autonomous navigation, while well-understood and safe, require manual parameter tuning by experts to perform well. APPL and similar methods use machine learning to dynamically adjust planner parameters during deployment. This approach maintains the safety of classical systems but remains constrained by the under lying algorithm. Instead of parameter tuning, we suggest using classical planners to regulate action selection of a reinforcement learning (RL) algorithm. The resulting policy is provably similar to the well-understood classical algorithm, performs better than both a well-tuned classical planner and an unregularized RL based policy, and can be shown to respect a user-controlled trust region even during training. In experiments, our method reduces traversal time by 8% (vs. DWA) and 43% (vs. TEB), and lowers proximity risk by 24% and 17%, respectively, while matching or surpassing learning-based baselines and aligning more closely with user preferences.
|
| |
| 15:00-16:30, Paper TuI2I.144 | Add to My Program |
| Benchmarking and Experimental Validation of a Real-Time Multi-Robot Hybrid Coverage Algorithm for Known Environments |
|
| Wälti, Lucas | EPFL |
| Martinoli, Alcherio | EPFL |
Keywords: Aerial Systems: Applications, Distributed Robot Systems, Multi-Robot Systems
Abstract: In the context of asset inspection, the size of the environment to be covered can be large. Mobile robotic systems are capable of acquiring more extensive data than static sensors, but the capacity of the robotic platform used can be limited by its autonomy and sensing capabilities. This is why multi-robot systems are interesting in such applications. However, scaling up to larger robot team sizes requires coordination among robots to be carried out efficiently. In this work, we investigate a hybrid coordination strategy with a team of micro-aerial vehicles, where a ground station centrally assigns tasks in real time to the robots, and the robots distributively coordinate their trajectories to carry out the coverage of a known asset. In particular, we perform the benchmarking and experimental validation of such a strategy. Several variants of the strategy are implemented by adapting existing state-of-the-art solutions to this context. Extensive simulation experiments are carried out in various environments to benchmark each variant and evaluate how their performance scales with the robot team size. The results show that the strategy scales well for larger robot teams, thanks to its efficient task generation process. Notably, despite its relatively simple but efficient task generation technique, it outperforms or is comparable to other methods employing more complex schemes (such as information gain or frontiers). Finally, we validated the proposed strategy with teams of up to three robots in physical experiments.
|
| |
| 15:00-16:30, Paper TuI2I.145 | Add to My Program |
| D-CLING: Prior-Preserving Depth-Conditioned Fine-Tuning for Navigation Foundation Models |
|
| Nakaoka, Shintaro | Toyota Motor Corporation |
| Kanai, Takayuki | Toyota Motor Corporation |
| Tanaka, Kazuhito | Toyota Motor Corporation |
Keywords: Vision-Based Navigation, Transfer Learning, Imitation Learning
Abstract: Navigation Foundation Models (NFMs) trained on large, cross-embodied datasets have demonstrated powerful generalizability on various scenarios. Adopting in-domain fine-tuning upon an NFM efficiently calibrates the visuomotor policy, promising further improvement even in a novel scenario. However, the fine-tuned models still suffer from poor obstacle avoidance or fail to properly reach the provided goals. Furthermore, such model updates in a small subset of data typically erode the pretrained prior, compromising the pretraining generalization. Consequently, fine-tuning rather deteriorates the model’s capability of robust and accurate navigation. In this work, we present a novel fine-tuning method that leverages large-scale pretraining while efficiently learning in novel setups, such as environments or camera configurations. In particular, inspired by ControlNet, we fine-tune an NFM by attaching a trainable copy of the pretrained backbone using zero-initialized residual pathways, thereby learning geometric cues. This design enables the model to efficiently acquire in-domain geometry while preserving pretrained knowledge across various behaviors. Despite its simplicity, our comprehensive evaluation of real-world navigation suggests that our proposal effectively enables robust long-horizon navigation with minimal collisions and human intervention. Additionally, our offline analysis exhibits that the proposed strategy maintains or further improves action prediction capability beyond the fine-tuned dataset, providing a key insight into continual learning towards general navigation. The project page: https://toyotafrc.github.io/DCLING-Proj/
|
| |
| 15:00-16:30, Paper TuI2I.146 | Add to My Program |
| UNIC: Learning Unified Multimodal Extrinsic Contact Estimation |
|
| Xu, Zhengtong | Purdue University |
| Shirai, Yuki | Samsung Research America |
Keywords: Force and Tactile Sensing, Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation
Abstract: Contact-rich manipulation requires reliable estimation of extrinsic contacts—the interactions between a grasped object and its environment—which provide essential contextual information for planning, control, and policy learning. However, existing approaches often rely on restrictive assumptions, such as predefined contact types, fixed grasp configurations, or camera calibration, that hinder generalization to novel objects and deployment in unstructured environments. In this paper, we present UNIC, a unified multimodal framework for extrinsic contact estimation that operates without any prior knowledge or camera calibration. UNIC directly encodes visual observations in the camera frame and integrates them with proprioceptive and tactile modalities in a fully data-driven manner. It introduces a unified contact representation based on scene affordance maps that captures diverse contact formations and employs a multimodal fusion mechanism with random masking, enabling robust multimodal representation learning. Extensive experiments demonstrate that UNIC performs reliably. It achieves a 9.6 mm average Chamfer distance error on unseen contact locations, performs well on unseen objects, remains robust under missing modalities, and adapts to dynamic camera viewpoints. These results establish extrinsic contact estimation as a practical and versatile capability for contact-rich manipulation.
|
| |
| 15:00-16:30, Paper TuI2I.147 | Add to My Program |
| M2oE: Modular Mixture of Experts for Multi-Morphology Reinforcement Learning of Modular Robots |
|
| Liu, Chang | Kyoto University |
| Xu, Qinchao | Kyoto University |
| Yagi, Satoshi | Kyoto University |
| Yamamori, Satoshi | Kyoto University |
| Zhu, Yaonan | The University of Tokyo |
| Iwasawa, Yusuke | The University of Tokyo |
| Yoshida, Kazuya | Tohoku University |
| Morimoto, Jun | Kyoto University |
Keywords: Cellular and Modular Robots, Reinforcement Learning
Abstract: Modular robots offer a promising solution for building versatile and adaptable robotic systems. For instance, space exploration robots can be designed to reconfigure to meet diverse task demands across varying environments. However, training such systems by Reinforcement Learning (RL) remains challenging due to the diversity of morphologies and the lack of simulation environments that support simultaneous multi-morphology learning. We present Modular Mixture of Experts (M2oE), a novel reinforcement learning backbone network that imitates the modular structure of robots to enable efficient and module-wise parallelizable policy learning for modular robots. In M2oE, the shared pool of experts, combined with an attention-based gating mechanism that dynamically selects experts based on inter-module correlations, enables both specialization and generalization. This structure supports training across multiple morphologies within a single framework, avoiding gradient conflicts and enhancing experience sharing across modules and morphologies. To support training, we also extend the Isaac Lab simulator with multi-morphology extensions that enable concurrent training across diverse robot configurations. Experiments on a space-exploration-inspired modular robot, Moonbot, demonstrate that M2oE significantly improves learning efficiency and achieves superior performance compared to both MLP and Transformer baselines. More information and the project video are available on the project website: https://ryuuchou17.github.io/m2oe
|
| |
| 15:00-16:30, Paper TuI2I.148 | Add to My Program |
| EMG-Based Torque Prediction for Assistive Exoskeleton Control Using Neural Networks with Bounded Generalization Error |
|
| Hoang, Duy | Université Paris-Saclay, LMF |
| Quesada, Lucas | Ecole Normale Supérieure Paris-Saclay |
| Berret, Bastien | Université Paris-Saclay |
| Bruneau, Olivier | ENS CACHAN |
| Fribourg, Laurent | Université Paris-Saclay, CNRS, ENS Paris-Saclay, LMF |
Keywords: Prosthetics and Exoskeletons, Human Performance Augmentation, Acceptability and Trust
Abstract: Electromyography (EMG) signals are widely used in assistive exoskeleton control for predicting human joint torque due to their ability to extract muscle activations before movement onset. The standard procedure for learning the EMG-to-torque model involves training the model on a training set of EMG-torque data, followed by validating the model on a separate test set. The comparison between models is generally undertaken on the test set. However, the analysis of model performance on the data outside the test set remains unaddressed. The lack of a guarantee for unseen data reduces the reliability of EMG-to-torque models in practical exoskeleton control. In this paper, we address this issue by proposing a bounded-generalization-error neural network (BGNN) for EMG-based torque prediction. Using gradient descent to train the network, we formulate at each training step a theoretical upper bound on the generalization error, reflecting the prediction error across the entire data distribution, including unseen data beyond the test set. The NN training is terminated when this upper bound reaches its minimum, thereby achieving the tightest guarantee on the generalization error. Experimental results on torque prediction demonstrated that, while ensuring such a bounded generalization error, our method still gave results comparable to those of classical models. The use of our BGNN in assistive exoskeleton control was also tested with 13 participants on a pick-and-place task with an upper limb exoskeleton. Experimental results on assistive control revealed that our method can reduce human physical fatigue without compromising movement speed or accuracy compared to natural human movement characteristics, particularly for generalization in novel tasks.
|
| |
| 15:00-16:30, Paper TuI2I.149 | Add to My Program |
| GRU-Based Kalman Filtering for 3D Multi-Object Tracking |
|
| Yuan, Zikang | Huazhong University, Wuhan, 430073, China |
| Wang, Xiaoxiang | Huazhong University of Science and Technology |
| Liu, Jiaxin | Huazhong University of Science and Technology |
| Feng, Miaojie | Huazhong University of Science and Technology |
| Zhang, Zhaoxing | Huazhong University of Science and Technology |
| Yang, Xin | Huazhong University of Science and Technology |
|
|
| |
| 15:00-16:30, Paper TuI2I.150 | Add to My Program |
| Resource-Constrained Robotic Planning in the Face of Mixed Uncertainty |
|
| Yin, Yihao | University of Chinese Academy of Sciences |
| Yu, Pian | University College London |
| Turrini, Andrea | Institute of Software, Chinese Academy of Sciences |
| Chi, Zhiming | Institute of Software, Chinese Academy of Sciences |
| Li, Yong | Institute of Software Chinese Academy of Sciences |
| Lijun, Zhang | Institute of Software Chinese Academy of Sciences, Beijing, China |
|
|
| |
| 15:00-16:30, Paper TuI2I.151 | Add to My Program |
| Complete Multi-Domain Decoupled Fusion Model for EEG-Based Person Identification |
|
| Wang, Zhixun | Taiyuan University of Technology |
| Lu, Jiayu | Taiyuan University of Technology |
| Liu, Tianyang | Taiyuan University of Technology |
| Zhu, Ziteng | Taiyuan University of Technology |
| Liu, Xiaofeng | Taiyuan University of Technology |
| Wang, Bin | Taiyuan University of Technology |
Keywords: Brain-Machine Interfaces, Deep Learning Methods, Medical Robots and Systems
Abstract: Electroencephalogram (EEG) signals have unique individual characteristics and have broad application prospects in identity authentication. At present, person identification (PI) based on EEG using the temporal-spatial-spectral feature extraction framework has achieved remarkable success. However, the existing methods suffer from coupled cross-domain feature parameters and insufficient feature fusion during feature extraction, which limits the recognition ability. Moreover, fixed-scale feature extractors can hardly exploit the subject-specific multi-scale information. To address these challenges, we propose CMDFM: a complete multi-domain decoupled fusion model for EEG-based PI. Firstly, we design an independent temporal-spatial-spectral attention mechanism to eliminate cross-domain parameter coupling. Secondly, a full-domain fusion mechanism is designed to comprehensively integrate the features of the temporal domain, spatial domain and spectral domain. Finally, an adaptive multi-scale CNN is designed to adjust the contribution of the multi-scale convolution kernel, thereby making full use of individual-specific multi-scale information. We use four datasets to verify our method. The experimental results show that our method is superior to all the state-of-the-art methods. The code of CMDFM is at https://github.com/2538441690/CMDFM.
|
| |
| 15:00-16:30, Paper TuI2I.152 | Add to My Program |
| Image-Level Domain Alignment for Real-Time Underwater Crack Detection Using YOLO with an ROV |
|
| Negue Kala, Pachelle Carelle | Lab-STICC, LS2N |
| Viel, Christophe | CNRS, Lab-STICC |
| Bergantin, Lucia | IMT Atlantique, Lab-STICC, Brest, France |
Keywords: Marine Robotics, Deep Learning for Visual Perception, Robotics and Automation in Construction
Abstract: Underwater concrete infrastructure plays a crucial role in energy and water systems. However, it requires regular inspections to ensure structural integrity. Remotely Operated Vehicles (ROVs) offer a safer and more cost-effective alternative to diver-based inspections. The data collected during inspections often require extensive post-mission processing, either manually or through computationally intensive algorithms. This limitation makes real-time damage detection during inspections impossible. In this study, we present a real-time image-level domain alignment pipeline suitable for deployment on resource-constrained hardware. It combines image enhancement with crack detection using a YOLO11n-seg model fine-tuned on a publicly available aerial concrete crack dataset. The model was quantized and deployed on a Jetson Nano, which was connected to an ROV for real-time inference. To reduce the domain gap between the raw underwater images captured by the ROV and the aerial training data, a Contrast Limited Adaptive Histogram Equalization (CLAHE)-based strategy was applied. Field tests were conducted on a submerged concrete embankment in a turbid lake environment. A validation dataset was developed to evaluate performance offline and is publicly available.
|
| |
| 15:00-16:30, Paper TuI2I.153 | Add to My Program |
| Integrated Hydrogel Patterning and Dynamic Microparticle Manipulation Using Optoelectronic Tweezers |
|
| Huang, Shunxiao | Beihang University |
| Ye, Jingwen | Beihang University |
| Niu, Wenyan | Beihang University |
| Wang, Ao | Beihang University |
| Zeng, Zijin | Beihang University |
| Li, Chan | Beihang University |
| Chen, Zaiyang | The Chinese University of Hong Kong |
| Sun, Hongyan | Beihang University |
| Guo, Yingjian | Beihang University |
| Feng, Lin | Beihang University |
Keywords: Micro/Nano Robots, Automation at Micro-Nano Scales, Biological Cell Manipulation
Abstract: This paper presents an integrated optoelectronic tweezers platform that unifies hydrogel microstructure fabrication with subsequent dynamic microsphere manipulation, enabled by a dual-wavelength optical strategy for seamless and programmable control. Initial tests in low-conductivity aqueous media confirmed high-fidelity OET patterning, with edge roughness below 20 μm. Using a biocompatible GelMA--LAP system, we achieved: (i) precise single-region hydrogel structures (hexagram patterns, average deviation 2.779 μm); (ii) scalable, customizable assemblies of complex topologies, including gear and maze arrays; and (iii) coordinated navigation of single, double, and triple microspheres within hydrogels, with triple-sphere velocity differences <0.9 μm/s. This approach overcomes the conventional OET limitation of decoupled photolithography and particle manipulation, integrating structure fabrication with dynamic control. It provides a versatile, reproducible platform for tissue engineering scaffolds, targeted drug delivery, and multi-particle coordination.
|
| |
| 15:00-16:30, Paper TuI2I.154 | Add to My Program |
| Mis: Light Response Agent for Video Comment with Multimodal Informative Seeking |
|
| Zhang, Dong | Soochow University |
| Shen, Tongfei | Soochow University |
| Tang, Zhiyu | The University of Queenland |
| Li, Shoushan | Soochow University |
| Zhou, Guodong | Soochow University |
Keywords: Representation Learning, Deep Learning Methods, AI-Based Methods
Abstract: Automatic response generation of video comments (RGVC) aims to generate a target reply to the content of the target comment based on the video context. Existing works for RGVC normally rely on large language models (LLMs), and mostly neglect the importance of extracting key information from both linguistic and visual perspectives, thereby limiting the potential to generate fluent and targeted responses in real applications. In this work, we introduce a lightweight response agent with a novel multimodal informative seeking approach (textsc{Mis}), which includes a Comment Context Retrieval (CCR) module and a Key Vision Selection (KVS) module to simultaneously seek essential information from both textual and visual modalities. Specifically, the CCR module enriches the dialogue context by retrieving relevant comments from other comment blocks, while the KVS module utilizes a spatial-temporal Transformer with cross-modal attention to highlight the most crucial information in the video. Moreover, we also build a large-scale user-level multimodal chitchat (UMC) dataset with exact comment-response interactions to better investigate RGVC. Extensive experiments demonstrate that our model effectively captures human points of interest and generates more fluent and diverse responses than state-of-the-art methods in both open and closed resources.
|
| |
| 15:00-16:30, Paper TuI2I.155 | Add to My Program |
| MIND-Calib: Multi-View, Intensity and Depth-Driven Dense 2D–3D Alignment for Single-Frame LiDAR–Camera Extrinsic Calibration |
|
| Liu, Shezhong | Technische Universität Berlin |
| Chen, Zibin | Technical University of Berlin / LiangDao GmbH |
Keywords: Calibration and Identification, Sensor Fusion, Hardware-Software Integration in Robotics
Abstract: Extrinsic calibration between LiDAR and camera is a crucial step in multi-sensor fusion, where targetless approaches have attracted increasing attention for their flexibility and reusability. However, existing methods still suffer from three major limitations: time-consuming data preparation, lack of robustness under sparse single-frame input, and limited generalization across diverse LiDAR architectures. We propose MIND-Calib, a truly single-frame, targetless calibration framework. The method generates depth and intensity images through virtual multi-view projection, and performs image-domain completion and back-projection to densify the point cloud and construct sub-pixel 2D--3D correspondences. High-precision extrinsics are then estimated via dual-channel cross-modal matching that leverages both depth and intensity modalities. Experiments on three representative LiDAR types (MEMS-based, solid-state, and mechanical spinning) as well as on public datasets demonstrate an average accuracy of 2.85 cm (with respect to an average scene depth of 40 meters) in translation and 0.20°in rotation. More importantly, MIND-Calib not only achieves true single-frame calibration without any additional preparation, but also maintains stable accuracy under sparse inputs and exhibits strong generalization and robustness across devices and challenging environments.
|
| |
| 15:00-16:30, Paper TuI2I.156 | Add to My Program |
| Air Compressor Control Optimization in Commercial Vehicles Using Reinforcement Learning |
|
| Schuppert, Fabian | Institute of Information Processing, Leibniz Universität Hannover; ZF CV Systems Hannover GmbH |
| Gorek, Norman-Raul | Leibniz University Hanover, ZF CV Systems Hannover GmbH |
| von Marcard, Timo | Hochschule Hannover - University of Applied Sciences and Arts |
| Rosenhahn, Bodo | Institute of Information Processing, Leibniz Universität Hannover |
|
|
| |
| 15:00-16:30, Paper TuI2I.157 | Add to My Program |
| CPBA-LIWO: Continuous-Time LiDAR-Inertial-Wheel Odometry Based on Probabilistic Bundle Adjustment |
|
| Wu, Song | Northeastern University |
| Zhang, Yunzhou | Northeastern University |
| Lv, Yuezhang | Northeastern University |
| Li, Wu | Northeastern University |
| Wang, Sizhan | Northeastern University |
| Yan, Su | Northeastern University |
| Ding, Hengwang | Northeastern University |
Keywords: SLAM, Wheeled Robots
Abstract: LiDAR-based odometry is widely used in ground robot localization. However, current methods encounter challenges in accuracy and robustness due to structural degradation, system observational error, and accumulated error. To address the above issues, we propose CPBA-LIWO, a continuous-time LiDAR-Inertial-Wheel (LIW) odometry based on probabilistic bundle adjustment (PBA) within a sliding window. This method constructs a general wheel model, which is used for the complementary fusion of LiDAR, IMU and wheel data through a continuous-time trajectory using a B-spline curve, thereby improving the robustness of the system in structurally degraded environments. Furthermore, to improve the accuracy of long-distance odometry, we propose a probabilistic model for the voxel plane and implement a sliding-window voxel PBA backend based on this model. The experimental results on the M2DGR-plus and KAIST datasets demonstrate that our method outperforms state-of-the-art LiDAR-based odometry in terms of accuracy and robustness.
|
| |
| 15:00-16:30, Paper TuI2I.158 | Add to My Program |
| MAKP: Multi-Mode Accurate Kicking Policy for Humanoid Robots |
|
| Zhang, Zheng | Shanghai Jiao Tong University |
| Xu, Kaiyang | Shanghai Jiao Tong University |
| Cao, Zhanxiang | Shanghai Jiao Tong University |
| Chen, Yizhi | Tongji University |
| Wang, Peng | Shanghai Jiao Tong University |
| Li, Haoyang | Shanghai Jiao Tong University |
| Zhang, Yang | Shanghai Jiao Tong University |
| Fu, Shengcheng | Tongji University |
| Shen, Xin | Shanghai Jiao Tong University |
| Yang, Xiaokang | Shanghai Jiao Tong University |
| Gao, Yue | Shanghai JiaoTong University |
Keywords: Imitation Learning, Humanoid Robot Systems, Whole-Body Motion Planning and Control
Abstract: Humanoid robot soccer players face fundamental challenges in achieving stable motion execution and ball trajectory control, particularly under balance constraints during single-leg support phases. In this paper, we introduce MAKP (Multi-mode Accurate Kicking Policy), a novel motion generation-based end-to-end kicking paradigm that enables humanoid robots to perform accurate ball kicking while executing diverse kicking motions. MAKP uniquely integrates a diffusion-based motion generator to produce varied kicking trajectories and employs a three-stage learning strategy to address the inherent trade-off between motion similarity and kicking performance. Stage I focuses on stable motion tracking and single-leg balance maintenance, while Stage II optimizes ball kicking capabilities. In Stage III, we introduce a Multi-Critic mechanism combined with curriculum learning to further enhance the balance between kicking accuracy, motion similarity and robot stability. Real-world experiments on the Booster T1 platform validate the effectiveness of our approach.
|
| |
| 15:00-16:30, Paper TuI2I.159 | Add to My Program |
| Orbital Stabilization and Time Synchronization of Unstable Periodic Motions in Underactuated Robots |
|
| Surov, Maksim | Sirius University of Science and Technology |
| Grigorov, Maksim | Sirius University of Science and Technology |
| Gusev, Sergei V. | Sirius University of Science and Technology |
| Sumenkov, Oleg Yu. | Sirius University of Science and Technology |
Keywords: Underactuated Robots, Motion Control, Multi-Robot Systems
Abstract: This paper presents a control methodology for achieving orbital stabilization with simultaneous time synchronization of periodic trajectories in underactuated robotic systems. The proposed approach extends the classical transverse linearization framework to explicitly incorporate time-desynchronization dynamics. To stabilize the resulting extended transverse dynamics, we employ a combination of time-varying LQR and sliding-mode control. The theoretical results are validated experimentally through the implementation of both centralized and decentralized control strategies on a group of six Butterfly robots.
|
| |
| 15:00-16:30, Paper TuI2I.160 | Add to My Program |
| NavDP: Learning Sim-To-Real Navigation Diffusion Policy with Privileged Information Guidance |
|
| Cai, Wenzhe | Shanghai AI Laboratory |
| Peng, Jiaqi | Tsinghua University |
| Yang, Yuqiang | Shanghai AI Laboratory |
| Zhang, Yujian | Zhejiang University |
| Wei, Meng | The University of Hong Kong |
| Wang, Hanqing | Shanghai AI Laboratory |
| Chen, Yilun | Shanghai AI Laboratory |
| Wang, Tai | Shanghai AI Laboratory |
| Pang, Jiangmiao | Shanghai AI Laboratory |
Keywords: Vision-Based Navigation, Imitation Learning, Collision Avoidance
Abstract: Learning to navigate in dynamic and complex open-world environments is a critical yet challenging capability for autonomous robots. Existing approaches often rely on cascaded modular frameworks, which require extensive hyperparameter tuning or learning from limited real-world demonstration data. In this paper, we propose Navigation Diffusion Policy (NavDP), an end-to-end network trained solely in simulation that enables zero-shot sim-to-real transfer across diverse environments and robot embodiments. The core of NavDP is a unified transformer-based architecture that jointly learns trajectory generation and trajectory evaluation, both conditioned solely on local RGB-D observations. By learning to predict critic values for contrastive trajectory samples, our proposed approach effectively leverages supervision from privileged information available in simulation, thereby fostering accurate spatial understanding and enabling the distinction between safe and dangerous behaviors. To support this, we develop an efficient data generation pipeline in simulation and construct a large-scale dataset encompassing over one million meters of navigation experience across 3,000 scenes. Empirical experiments in both simulated and real-world environments demonstrate that NavDP significantly outperforms prior state-of-the-art methods. Furthermore, we identify key factors influencing the generalization performance of NavDP. The dataset and code are publicly available at href{https://wzcai99.github.io/navigation-diffusion-policy.github.io}{textbf{https://wzcai99.github.io/navigation-diffusion-policy.github.io}}.
|
| |
| 15:00-16:30, Paper TuI2I.161 | Add to My Program |
| Calibration-Free Gas Source Localization with Mobile Robots: Source Term Estimation Based on Concentration Measurement Ranking |
|
| Jin, Wanting | EPFL |
| Duranceau, Agatha | EPFL |
| Erunsal, Izzet Kagan | EPFL |
| Martinoli, Alcherio | EPFL |
Keywords: Environment Monitoring and Management, Search and Rescue Robots, Probability and Statistical Methods
Abstract: Efficient Gas Source Localization (GSL) in real-world settings is crucial, especially in emergency scenarios. Mobile robots equipped with low-cost, in-situ gas sensors offer a safer alternative to human inspection in hazardous environments. Probabilistic algorithms enhance GSL efficiency with scattered gas measurements by comparing gas concentration measurements gathered by robots to physical dispersion models. However, accurately deriving gas concentrations from data acquired with low-cost sensors is challenging due to the nonlinear sensor response, environmental dependencies (e.g., humidity, temperature, and other gas influences), and robot motion. Mitigating these disturbance factors requires frequent sensor calibration in controlled environments, which is often impractical for real-world deployments. To overcome these issues, we propose a novel feature extraction algorithm that leverages the relative ranking of gas measurements within the dynamically accumulated dataset. By comparing the rank differences between gathered and modeled values, we estimate the probabilistic distribution of source locations across the entire environment. We validate our approach in high-fidelity simulations and physical experiments, demonstrating consistent localization accuracy with uncalibrated gas sensors. Compared to existing methods, our technique eliminates the need for gas sensor calibration, making it well-suited for real-world applications.
|
| |
| 15:00-16:30, Paper TuI2I.162 | Add to My Program |
| Planning of Robotic High Intensity Focused Ultrasound Ablation Via Predictive Models of Thermal Lesions |
|
| Parrotta, Francesca | Scuola Superiore Sant’Anna |
| Papagiannaki, Iro | Scuola Superiore Sant' Anna |
| Tognarelli, Selene | Scuola Superiore Sant'Anna |
| Diodato, Alessandro | Scuola Superiore Sant'Anna, the BioRobotics Institute |
| Menciassi, Arianna | Scuola Superiore Sant'Anna - SSSA |
Keywords: Surgical Robotics: Planning, Medical Robots and Systems
Abstract: High-Intensity Focused Ultrasound (HIFU) is a non-invasive therapeutic technology enabling precise energy delivery for the selective ablation of tumors, while preserving surrounding healthy tissue. Currently, there is no gold standard for defying sonication parameters to cover a tumor surface and volume, as this decision relies solely on the physician’s experience. This work proposes a novel planning algorithm for robotic HIFU procedures that ensure automatic and optimized tumor coverage. The approach relies on a predictive model that estimates the dimensions of HIFU-induced thermal lesions based on sonication parameters (source pressure amplitude and sonication time) and leverages genetic algorithms to compute single lesion placement over the treatment area. The optimization function primarily aims to maximize the surface coverage over a defined target area and then integrates motion planning algorithms. In addition to planar lesions, a volumetric ablation composed by a set of co-planar surface lesions was also evaluated. The method was experimentally validated on ex-vivo tissues through a robotic ultrasound-guided (USg) HIFU platform. This study bridges pre-operative lesion prediction and intra-operative robotic execution, supporting standardized and effective HIFU therapy.
|
| |
| 15:00-16:30, Paper TuI2I.163 | Add to My Program |
| Magnet-Based Soft Robotic Skin Using a 3D-Printed Multi-Lattice Structure and CNN-Based Tactile Super-Resolution |
|
| Bang, Yunseong | DGIST |
| Park, Joowon | University of Ulsan |
| Sim, Suan | DGIST |
| Ryu, Youngjun | DGIST |
| Park, Sukho | DGIST |
| Park, Kyungseo | DGIST |
Keywords: Force and Tactile Sensing, Additive Manufacturing, Physical Human-Robot Interaction
Abstract: This paper presents a magnet-based robotic skin that integrates a multilayer soft lattice with distributed Hall-effect sensor arrays and a tactile super-resolution model. External contact forces are converted to magnetic field changes by embedded permanent magnets, and the lattice spreads these changes across the sensing domain. This gives each sensor a large, overlapping receptive field and enables a large sensing area with minimal blind spots. Lattice parameters are tunable, enabling joint adjustment of mechanical compliance and transduction characteristics. An implicit modeling workflow and selective laser sintering (SLS) 3D printing support rapid fabrication of conformal, high-complexity structures. A convolutional neural network trained on experimental measurements estimates contact location and normal force in real time. Experiments validate localization accuracy and indicate scalability to larger surfaces, suggesting applicability to whole-body robotic skin and safe human-robot interaction.
|
| |
| 15:00-16:30, Paper TuI2I.164 | Add to My Program |
| OCLPlace: Online Continual Learning on LiDAR Streams for Place Recognition |
|
| Liu, BinHong | Northwestern Polytechnical University |
| Ye, Kaixiao | Northwestern Polytechnical University |
| Fang, YangWang | Northwestern Polytechnical University |
| Yan, Zhi | École Nationale Supérieure De Techniques Avancées (ENSTA) |
| Yang, Tao | Northwestern Polytechnical University |
Keywords: Localization, Incremental Learning, SLAM
Abstract: LiDAR place recognition is a critical component of LiDAR-based localization pipelines, tasked with identifying previously visited places across diverse environments and temporal conditions. A growing body of deep learning–based approaches has recently tackled this problem. However, their performance often degrades when the models are deployed in unseen environments. Although offline fine-tuning can partly recover performance, it is prone to catastrophic forgetting of previously acquired knowledge and cannot respond quickly enough to rapidly changing data distributions. In this paper, we introduce OCLPlace, an online continual learning framework that learns directly from highly temporally correlated LiDAR streams and strikes a trade-off between rapid domain adaptation and resistance to catastrophic forgetting. To the best of our knowledge, OCLPlace is the first LiDAR place-recognition approach enhanced by online continual learning that can automatically adapt to new environments while mitigating catastrophic forgetting. Experimental results on six large-scale datasets, which cover both ground-view and aerial-view scenarios, demonstrate the effectiveness and robustness of our method. The source code will be publicly available at: https://github.com/npu-ius-lab/OCLPlace.
|
| |
| 15:00-16:30, Paper TuI2I.165 | Add to My Program |
| Reliable LiDAR Loop Detection through Structural Descriptors and Semantic Graph Matching |
|
| Tang, Yujie | Beijing Institute of Technology |
| Zuo, Sibo | Beijing Institute of Technology |
| Wang, Meiling | Beijing Institute of Technology |
| Dou, Jianyu | Beijing Institute of Technology |
| Wang, Jiahui | Beijing Institute of Technology |
| Yue, Yufeng | Beijing Institute of Technology |
Keywords: Mapping, Localization
Abstract: Outdoor loop closure detection is essential for mitigating accumulated drift in SLAM and generating a global consistent map. Semantic graph matching methods utilize object-level topology for distinctive scene representation but rely on environments with rich and distinguishable objects. Moreover, accurately matching nodes remains difficult due to ambiguities among same-class semantic nodes. These challenges limit their effectiveness in varied road environments, highlighting the need for representations that are both robust and adaptable. To address this, we introduce SD-SGM, a novel loop closure detection framework combining the powerful context-adaptation capabilities of structural descriptors with the high-level semantic reasoning abilities of semantic graphs. Initially, we extract semantic graphs alongside global structural descriptors from point clouds. Distinctive local graph features are then used to generate candidate node pairs, and the maximal clique algorithm identifies correspondences that are globally consistent. The similarity scores of both methods are then evaluated and a cross-validation mechanism assesses their reliability and adaptively weights them. Extensive loop closure detection experiments on various datasets demonstrate that SD-SGM achieves state-of-the-art (SOTA) performance compared to strong baselines. Additionally, we verify its effectiveness in improving SLAM trajectory accuracy. We provide the code at: https://github.com/BIT-TYJ/SD-SGM.
|
| |
| 15:00-16:30, Paper TuI2I.166 | Add to My Program |
| M4Diffuser: Multi-View Diffusion Policy with Manipulability-Aware Control for Robust Mobile Manipulation |
|
| Dong, Ju | Technical University of Munich |
| Zhang, Lei | University of Hamburg |
| Zhang, Liding | Technical University of Munich |
| Ling, Yao | Technical University of Munich (TUM) |
| Fu, Yu | Technical University of Munich |
| Bai, Kaixin | University of Hamburg |
| Marton, Zoltan-Csaba | Agile Robots SE |
| Bing, Zhenshan | Technical University of Munich |
| Chen, Zhaopeng | University of Hamburg |
| Knoll, Alois | Tech. Univ. Muenchen TUM |
| Zhang, Jianwei | University of Hamburg |
Keywords: Mobile Manipulation, Task and Motion Planning, Deep Learning Methods
Abstract: Mobile manipulation requires the coordinated control of a mobile base and a robotic arm while simultaneously perceiving both global scene context and fine-grained object details. Existing single-view approaches often fail in unstructured environments due to limited fields of view, exploration, and generalization abilities. Moreover, classical controllers, although stable, struggle with efficiency and manipulability near singularities. To address these challenges, we propose M4Diffuser, a hybrid framework that integrates a Multi-View Diffusion Policy with a novel Reduced and Manipulability-aware QP (ReM-QP) controller for mobile manipulation. The diffusion policy leverages proprioceptive states and complementary camera perspectives with both close-range object details and global scene context to generate task-relevant end-effector goals in the world frame. These high-level goals are then executed by the ReM-QP controller, which eliminates slack variables for computational efficiency and incorporates manipulability-aware preferences for robustness near singularities. Comprehensive experiments in simulation and real-world environments show that M4Diffuser achieves 8%–56% higher success rates and reducing collisions by 3%–32% over baselines. Our approach demonstrates robust performance for smooth whole-body coordination, and strong generalization to unseen tasks, paving the way for reliable mobile manipulation in unstructured environments. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/m4diffuser-anonymous.
|
| |
| 15:00-16:30, Paper TuI2I.167 | Add to My Program |
| HeRO: Hierarchical 3D Semantic Representation for Pose-Aware Object Manipulation |
|
| Xu, Chongyang | Sichuan University |
| Cheng, Shen | Megvii.com; Dexmal.com |
| Haipeng, Li | University of Electronic Science and Technology of China |
| Fan, Haoqiang | Megvii Inc |
| Feng, Ziliang | Sichuan University |
| Liu, Shuaicheng | University of Electronic Science and Technology of China |
Keywords: Imitation Learning, Representation Learning, Dual Arm Manipulation
Abstract: Imitation learning for robotic manipulation has progressed from 2D image policies to 3D representations that explicitly encode geometry. Yet purely geometric policies often lack explicit part-level semantics, which are critical for pose-aware manipulation (e.g., distinguishing a shoe's toe from heel). In this paper, we present HeRO, a diffusion-based policy that couples geometry and semantics via hierarchical semantic fields. HeRO employs dense semantics lifting to fuse discriminative, geometry-sensitive features from DINOv2 with the smooth, globally coherent correspondences from Stable Diffusion, yielding dense features that are both fine-grained and spatially consistent. These features are processed and partitioned to construct a global field and a set of local fields. A hierarchical conditioning module conditions the generative denoiser on global and local fields using permutation-invariant network architecture, thereby avoiding order-sensitive bias and producing a coherent control policy for pose-aware manipulation. In various tests, HeRO establishes a new state-of-the-art, improving success on Place Dual Shoes by 12.3% and averaging 6.5% gains across six challenging pose-aware tasks. Code is available at https://github.com/Chongyang-99/HeRO.
|
| |
| 15:00-16:30, Paper TuI2I.168 | Add to My Program |
| History-Aware Visuomotor Policy Learning Via Point Tracking |
|
| Chen, Jingjing | Shanghai Jiao Tong University |
| Fang, Hongjie | Shanghai Jiao Tong University |
| Wang, Chenxi | Shanghai Noematrix Intelligence Technology Ltd |
| Wang, Shiquan | Shanghai Flexiv Robotics Technology CO., LTD |
| Lu, Cewu | ShangHai Jiao Tong University |
Keywords: Imitation Learning, Learning from Demonstration, Deep Learning in Grasping and Manipulation
Abstract: Many manipulation tasks require memory beyond the current observation, yet most visuomotor policies rely on the Markov assumption and thus struggle with repeated states or long-horizon dependencies. Existing methods attempt to extend observation horizons but remain insufficient for diverse memory requirements. To this end, we propose an object-centric history representation based on point tracking, which abstracts past observations into a compact and structured form that retains only essential task-relevant information. Tracked points are encoded and aggregated at the object level, yielding a compact history representation that can be seamlessly integrated into various visuomotor policies. Our design provides full history-awareness with high computational efficiency, leading to improved overall task performance and decision accuracy. Through extensive evaluations on diverse manipulation tasks, we show that our method addresses multiple facets of memory requirements - such as task stage identification, spatial memorization, and action counting, as well as longer-term demands like continuous and pre-loaded memory - and consistently outperforms both Markovian baselines and prior history-based approaches. Project website: https://tonyfang.net/history/.
|
| |
| 15:00-16:30, Paper TuI2I.169 | Add to My Program |
| One-Policy-Fits-All: Geometry-Aware Action Latents for Cross-Embodiment Manipulation |
|
| Mu, Juncheng | Tsinghua University |
| Yang, Sizhe | The Chinese University of Hong Kong |
| Bae, Hojin | Tsinghua University |
| Jia, Feiyu | University of Science and Technology of China |
| Ben, Qingwei | The Chinese University of Hong Kong |
| Li, Boyi | NVIDIA |
| Xu, Huazhe | Tsinghua University |
| Pang, Jiangmiao | Shanghai AI Laboratory |
Keywords: Grippers and Other End-Effectors, Dexterous Manipulation, Multifingered Hands
Abstract: Cross-embodiment manipulation is crucial for enhancing the scalability of robot manipulation and reducing the high cost of data collection. However, the significant differences between embodiments, such as variations in action spaces and structural disparities, pose challenges for joint training across multiple sources of data. To address this, we propose One-Policy-Fits-All (OPFA), a framework that enables learning a single, versatile policy across multiple embodiments. We first learn a Geometry-Aware Latent Representation (GaLR), which leverages 3D convolution networks and transformers to build a shared latent action space across different embodiments. Then we design a unified latent retargeting decoder that extracts embodiment-specific actions from the latent representations, without any embodiment-specific decoder tuning. OPFA enables end-to-end co-training of data from any embodiment, including various grippers and dexterous hands with arbitrary degrees of freedom, significantly improving data efficiency and reducing the cost of skill transfer. We conduct extensive experiments across 11 different end-effectors. The results demonstrate that OPFA significantly improves policy performance in diverse settings by leveraging heterogeneous embodiment data. For instance, cross-embodiment co-training can improve success rates by more than 50% compared to single-source training. Moreover, by adding only a few demonstrations from a new embodiment (e.g., eight), OPFA can achieve performance comparable to that of a well-trained model with 72 demonstrations.
|
| |
| 15:00-16:30, Paper TuI2I.170 | Add to My Program |
| SaLF: Sparse Local Fields for Multi-Sensor Rendering in Real-Time |
|
| Chen, Yun | Waabi, University of Toronto |
| Haines, Matthew | University of Waterloo |
| Wang, Jingkang | Waabi, University of Toronto |
| Jain, Sahil | Waabi |
| Baron-Lis, Krzysztof | Waabi |
| Manivasagam, Siva | Waabi, University of Toronto |
| Yang, Ze | Waabi, University of Toronto |
| Urtasun, Raquel | Waabi, University of Toronto |
Keywords: Computer Vision for Automation, Autonomous Vehicle Navigation, Simulation and Animation
Abstract: High-fidelity sensor simulation of light-based sen- sors such as cameras and LiDARs is critical for safe and accurate autonomy testing. Neural radiance field (NeRF)-based methods that reconstruct sensor observations via ray-casting of implicit representations have demonstrated accurate simulation of driving scenes, but are slow to train and render, hampering scalability. 3D Gaussian Splatting (3DGS) has demonstrated faster training and rendering times through rasterization, but is primarily restricted to pinhole camera sensors, preventing usage for realistic multi-sensor autonomy evaluation. Moreover, both NeRF and 3DGS couple the representation with the rendering procedure (implicit networks for ray-based evaluation, particles for rasterization), preventing interoperability, which is key for general usage. In this work, we present Sparse Local Fields (SaLF), a novel volumetric representation that supports rasterization and raytracing. SaLF represents volumes as a sparse set of 3D voxel primitives, where each voxel is a local implicit field. SaLF has fast training (<30 min) and rendering capabilities (50+ FPS for camera and 600+ FPS for LiDAR), has adaptive pruning and densification to easily handle large scenes, and can support non-pinhole cameras and spinning LiDARs. We demonstrate that SaLF delivers realism comparable to existing self-driving sensor simulation methods while improving efficiency and enhancing capabilities, thereby enabling more scalable simulation.
|
| |
| 15:00-16:30, Paper TuI2I.171 | Add to My Program |
| Multi-View Gating Unit with KL-Based Alignment Toward Real-World Robot Control |
|
| Igarashi, Kei | Keio University |
| Murata, Shingo | Keio University |
Keywords: Cognitive Control Architectures, Machine Learning for Robot Control, Representation Learning
Abstract: This paper proposes a framework for integrating latent representations from multi-view images, using adaptive weighting based on situational context to facilitate the generation of robot actions. Specifically, we introduce the multi-view gating unit (MGU), which assigns context-dependent weights to each dimension of the latent representations extracted from different viewpoints. By summing the corresponding dimensions across all viewpoints, we construct a fused latent representation that serves as input to a policy model. To enhance the effectiveness of the MGU and improve the accuracy of action generation, we incorporate a Kullback–Leibler (KL)-based alignment objective that encourages consistency between individual viewpoint representations and the fused representation. We evaluate the proposed framework through imitation-learning experiments in a kitchen-like real-robot environment across five tasks. The experimental results show that the MGU dynamically adapts to different contexts, thereby enabling successful task execution. Additionally, we compare our approach with a modified Action Chunking with Transformers (ACT) baseline and conduct an ablation study to assess the contribution of each component. The results show that our method achieves a task success rate of 84%, outperforming all baseline methods and validating the effectiveness of both the individual components and their integration within the proposed framework.
|
| |
| 15:00-16:30, Paper TuI2I.172 | Add to My Program |
| Validation of Space Robotics in Underwater Environments Via Disturbance Robustness Equivalency |
|
| Verhagen, Joris | KTH Royal Institute of Technology |
| Krantz, Elias | KTH Royal Institute of Technology |
| Sidrane, Chelsea | KTH Royal Institute of Technology |
| Dörner, David | KTH Royal Institute of Technology |
| De Carli, Nicola | KTH Royal Institute of Technology |
| Roque, Pedro | Caltech |
| Mao, Huina | KTH Royal Institute of Technology |
| Tibert, Gunnar | KTH Royal Institute of Technology |
| Stenius, Ivan | KTH Royal Institute of Technology |
| Fuglesang, Christer | KTH Royal Institute of Technology |
| Dimarogonas, Dimos V. | KTH Royal Institute of Technology |
| Tumova, Jana | KTH Royal Institute of Technology |
Keywords: Space Robotics and Automation, Integrated Planning and Control, Software-Hardware Integration for Robot Systems
Abstract: We present an experimental validation framework for space robotics that leverages underwater environments to approximate microgravity dynamics. While neutral buoyancy conditions make underwater robotics an excellent platform for space robotics validation, there are still dynamical and environmental differences that need to be overcome. Given a high-level space mission specification, expressed in terms of a Signal Temporal Logic specification, we overcome these differences via the notion of maximal disturbance robustness of the mission. We formulate the motion planning problem such that the original space mission and the validation mission achieve the same disturbance robustness degree. The validation platform then executes its mission plan using a near-identical control strategy to the space mission where the closed-loop controller considers the spacecraft dynamics. Evaluating our validation framework relies on estimating disturbances during execution and comparing them to the disturbance robustness degree, providing practical evidence of operation in the space environment. Our evaluation features a dual-experiment setup: an underwater robot operating under near-neutral buoyancy conditions to validate the planning and control strategy of either an experimental planar spacecraft platform or a CubeSat in a high-fidelity space dynamics simulator.
|
| |
| 15:00-16:30, Paper TuI2I.173 | Add to My Program |
| Learning Generalizable Robot Policy with Human Demonstration Video As a Prompt |
|
| Zhu, Xiang | Tsinghua University |
| Liu, Yichen | Tsinghua University |
| Li, Hezhong | ShanghaiTech University |
| Chen, Jianyu | Tsinghua University |
Keywords: Learning from Demonstration, Dexterous Manipulation
Abstract: Recent robot learning methods commonly rely on imitation learning from massive robotic dataset collected with teleoperation. When facing a new task, such methods generally require collecting a set of new teleoperation data and finetuning the policy. Furthermore, the teleoperation data collection pipeline is also tedious and expensive. Instead, human is able to efficiently learn new tasks by just watching others do. In this paper, we introduce a novel two-stage framework that utilizes human demonstrations to learn a generalizable robot policy. Such policy can directly take human demonstration video as a prompt and perform new tasks without any new teleoperation data and model finetuning at all. In the first stage, we train video generation model that captures a joint representation for both the human and robot demonstration video data using cross-prediction. In the second stage, we fuse the learned representation with a shared action space between human and robot using a novel prototypical contrastive loss. Empirical evaluations on real‑world dexterous manipulation tasks show the effectiveness and generalization capabilities of our proposed method.
|
| |
| 15:00-16:30, Paper TuI2I.174 | Add to My Program |
| MOSAIC: Multi-Objective Optimization from Zero-Shot Language Reasoning in Preference-Based RL |
|
| Marta, Daniel | KTH Royal Institute of Technology |
| Holk, Simon | The University of Toyko |
| Leite, Iolanda | KTH Royal Institute of Technology |
Keywords: Human Factors and Human-in-the-Loop, Machine Learning for Robot Control, Reinforcement Learning
Abstract: Preference-based Reinforcement Learning (RL) enables humans to shape complex goals via preference comparisons between sequences of state-action pairs. Most of the existing approaches focus on a singular objective, overlooking the complex causal reasoning that underpins preferences. However, many real-world challenges are multi-dimensional, and individuals can have different reasons behind their preferences. In this work, we rethink preference-based RL from a multi-objective perspective by distilling human preferences into multiple components. We leverage the zero-shot capabilities of large language models (LLMs) to infer preferences and better align various objectives from text prompts. This allows us to train an ensemble of reward functions, each optimizing for a specific objective. We demonstrate that our approach can address a variety of multi-objective control tasks, improving on approaches that consider a single preference per objective. We show the effectiveness of our approach in better shaping reward functions by utilizing real human preferences and prompts. Our code for the benchmarks, along with additional supplementary details, is available at https://sites.google.com/view/multi-pref/.
|
| |
| 15:00-16:30, Paper TuI2I.175 | Add to My Program |
| Audio-2-Shape: 3D Generation from What You Hear |
|
| He, Xuran | SouthWest University |
| Han, Xian-Feng | Southwest University |
| Sun, Shi-Jie | Changan University |
Keywords: Deep Learning for Visual Perception, Visual Learning
Abstract: Audio serves as an important bridge connecting humans to their surroundings, providing a unique modality for perceiving the world. For embodied AI systems, such as robots and autonomous vehicles, enabling them to understand the world through sound is a promising and significant research direction. In this paper, we explore the underexplored domain of audio-driven 3D shape generation and propose a novel architecture for audio-conditioned 3D shape synthesis. Specifically, our framework comprises three key modules: cross-modal alignment, a latent diffusion model for generation, and a 3D Gaussian Splatting (3DGS) based optimization module. We first align audio and 3D shape representations within a unified embedding space using a contrastive learning strategy, which conditions a latent diffusion model to generate an initial coarse 3D structure. Subsequently, we introduce a refinement stage utilizing 3D Gaussian Splatting to produce high-fidelity 3D shapes. Extensive qualitative and quantitative experiments validate the effectiveness of our proposed method, demonstrating its capability to generate semantically coherent 3D shapes from audio input.
|
| |
| 15:00-16:30, Paper TuI2I.176 | Add to My Program |
| A Multisensory Neurofeedback–Based Immersive BCI Paradigm for Emotion Regulation |
|
| Zhang, Lianchi | University of Electronic Science and Technology of China |
| Fu, Bowen | University of Electronic Science and Technology of China |
| Yang, Shoucheng | University of Electronic Science and Technology of China |
| Zhang, Jingting | University of Electronic Science and Technology of China |
| Huang, Zonghai | University of Electronic Science and Technology of China |
| Huang, Rui | University of Electronic Science and Technology of China |
| Cheng, Hong | University of Electronic Science and Technology |
Keywords: Brain-Machine Interfaces, Virtual Reality and Interfaces, Human-Centered Automation
Abstract: Enhancing brain activation efficiency is crucial in developing brain computer interface (BCI) paradigm for cognitive rehabilitation. However, the existing BCI paradigms mostly achieved limited sensory-activation without sufficient feedback of mind and body, significantly limiting the user engagement and training efficiency. In this study, we propose a novel multisensory neurofeedback framework to develop an immersive BCI paradigm for emotion regulation, supported by a novel panoramic motion-based virtual reality system. The paradigm is designed to promote deeper cognitive and physical involvement in functional brain training.It delivers multisensory neurofeedback—visual, auditory, and motor—through the Gait Real-time Analysis Interactive Lab system and incorporates cognitive reappraisal from the modified Gross procedure for emotion regulation. Its effectiveness is validated through three experimental studies, including event-related potential analysis, power spectral density analysis, and brain network analysis. The results demonstrate that the paradigm enhances motor–cognitive interaction and multisensory coordination, effectively increasing brain activation in visual, auditory, and motor processing regions, and further promoting stronger engagement of emotion regulation-related areas such as the prefrontal cortex. Compared with conventional paradigms, the proposed paradigm increases the number of high-intensity functional connections by 28.6% (from 42 to 54) and the number of effective functional connections by 42.3% (from 71 to 101).
|
| |
| 15:00-16:30, Paper TuI2I.177 | Add to My Program |
| FreeTacMan: Robot-Free Visuo-Tactile Data Collection System for Contact-Rich Manipulation |
|
| Wu, Longyan | Shanghai Innovation Institute, Fudan University |
| Yu, Checheng | The University of Hong Kong |
| Ren, Jieji | Shanghai Jiao Tong University |
| Chen, Li | The University of Hong Kong |
| Jiang, Yufei | ShangHai University |
| Huang, Ran | Fudan University |
| Gu, Guoying | Shanghai Jiao Tong University |
| Li, Hongyang | The University of Hong Kong, Shanghai Innovation Institute |
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation, Data Sets for Robot Learning
Abstract: Enabling robots with contact-rich manipulation remains a pivotal challenge in robot learning, which is substantially hindered by the data collection gap, including its inefficiency and limited sensor setup. While prior work has explored handheld paradigms, their rod-based mechanical structures remain rigid and unintuitive, providing limited tactile feedback and posing challenges for operators. Motivated by the dexterity and force feedback of human motion, we propose FreeTacMan, a human-centric and robot-free data collection system for accurate and efficient robot manipulation. Concretely, we design a wearable gripper with visuo-tactile sensors for data collection, which can be worn by human fingers for intuitive control. A high-precision optical tracking system is introduced to capture end-effector poses while synchronizing visual and tactile feedback simultaneously. We leverage FreeTacMan to collect a large-scale multimodal dataset, comprising over 3000k paired visuo–tactile images with end-effector poses, 10k demonstration trajectories across 50 diverse contact-rich manipulation tasks. FreeTacMan achieves multiple improvements in data collection performance over prior works and enables effective policy learning from self-collected datasets. By open-sourcing the hardware and the dataset, we aim to facilitate reproducibility and support research in visuo-tactile manipulation.
|
| |
| 15:00-16:30, Paper TuI2I.178 | Add to My Program |
| Manufacturing Micro-Patterned Surfaces with Multi-Robot Systems |
|
| Taylor, Annalisa T. | Northwestern University |
| Landis, Malachi | Northwestern University |
| Guo, Ping | Northwestern University |
| Murphey, Todd D. | Northwestern University |
Keywords: Multi-Robot Systems, Intelligent and Flexible Manufacturing, Agent-Based Systems
Abstract: Applying micro-patterns to surfaces has been shown to impart useful physical properties such as drag reduction and hydrophobicity. However, current manufacturing techniques cannot produce micro-patterned surfaces at scale due to high-cost machinery and inefficient coverage techniques such as raster-scanning. In this work, we use multiple robots, each equipped with a patterning tool, to manufacture these surfaces. To allow these robots to coordinate during the patterning task, we use the ergodic control algorithm, which specifies coverage objectives using distributions. We demonstrate that robots can divide complicated coverage objectives by communicating compressed representations of their trajectory history both in simulations and experimental trials. Further, we show that robot-produced patterning can lower the coefficient of friction of metallic surfaces. This work demonstrates that distributed multi-robot systems can coordinate to manufacture products that were previously unrealizable at scale.
|
| |
| 15:00-16:30, Paper TuI2I.179 | Add to My Program |
| Multi-Depth Uniform Coverage Path Planning for Unmanned Surface Vehicle Surveying |
|
| Larrazabal, Maider | AZTI Foundation |
| Yang, Tong | Tsinghua University |
| Goienetxea, Izaro | Department of Computer Languages and Systems, University of the Basque Country (UPV/EHU) |
| Valls Miro, Jaime | University of Technology Sydney |
Keywords: Marine Robotics, Autonomous Vehicle Navigation, Task and Motion Planning
Abstract: This paper introduces a novel automatic coverage path planning algorithm for bathymetry surveying with unmanned surface vehicles. The detection range of the mapping sensor employed -- a multibeam echo sounder -- is heavily influenced by local seafloor depths. Hence, a path designed to uniformly cover the sea surface does not guarantee uniform coverage of the seafloor. Yet this is currently the typical process for bathymetric surveys, with the simplistic boustrophedon scheme along manually selected waypoints at constant depths being the most widespread planner used. The proposed scheme incorporates coarse prior depth information to pre-process the target region and adaptively guide path generation and sensing range configuration. By explicitly accounting for depth variations, the proposed algorithm designs a coverage path with optimised spacing between survey passes that adjusts the sensing beam density to achieve more consistent seafloor coverage. The proposed method is shown to offer significant improvements in both controlled and real-world scenarios. Validations in challenging synthetic terrains achieves coverage ratios beyond 99%, a marked improvement when compared with traditional boustrophedon paths revealing a maximum 75% coverage. The same trend appears in realistic simulations using real bathymetric data from a coastal harbour, with coverage reaching over 92%, and significantly surpassing boustrophedon sweeps with coverage rates below 65%. Beyond improved performance, the scheme also brings a fully automated design, suitable for autonomous marine vehicles, thus offering practical utilities for real-world applications.
|
| |
| 15:00-16:30, Paper TuI2I.180 | Add to My Program |
| LATIOS: Latency-Aware Telemonitoring for Injection in Ophthalmic Surgery - an Adaptive Motion Scaling Approach |
|
| Hoxha, Korab | Technische Universität München |
| Henriques, Angelo | Technical University of Munich |
| Yang, Junjie | TUM |
| Ahnaf, Naheen | University of Alberta |
| Tavakoli, Mahdi | University of Alberta |
| Nasseri, M. Ali | Technische Universitaet Muenchen |
Keywords: Medical Robots and Systems, Surgical Robotics: Planning, Telerobotics and Teleoperation
Abstract: Communication latency in long-distance telerobotic surgery poses a critical safety risk, particularly in high-precision procedures like retinal surgery where tool overshoots can cause irreversible patient injury. This paper introduces the Latency-Aware Telemonitoring for Injection in Ophthalmic Surgery (LATIOS) framework, which enhances safety by adaptively scaling the surgical robot's velocity along the critical axis of tool insertion. Our core contribution is a control algorithm that dynamically modulates the velocity scaling factor based on two real-time, coupled variables: the tool-tip-to-retina distance, estimated via a non-contact, shadow-based method, and the measured communication delay. We validated this system in a transatlantic user study where six participants in North America teleoperated a surgical robot in Europe to perform a series of simulated retinal punctuation tasks. The results demonstrate that LATIOS provides a statistically significant reduction in applied punctuation forces compared to constant control (p = 0.022). This objective safety improvement is achieved through a deliberate safety-efficiency trade-off, with the system enforcing a more cautious pace under high-latency conditions. Our work presents a robust, context-aware safety framework that addresses a key barrier to the clinical adoption of long-distance telerobotic surgery.
|
| |
| 15:00-16:30, Paper TuI2I.181 | Add to My Program |
| Keypoint-Based Dynamic Object 6-DoF Pose Tracking Via Event Camera |
|
| Wang, Zhe | ShanghaiTech University |
| Song, Qijin | ShanghaiTech University |
| Li, Zihao | ShanghaiTech University |
| Xiao, Jingyu | WLSA Shanghai Academy |
| Bai, Weibang | ShanghaiTech University |
Keywords: Visual Tracking, Deep Learning for Visual Perception
Abstract: Accurate 6-DoF pose estimation of objects is critical for robots to perform precise manipulation tasks. However, for dynamic object pose estimation, conventional camera-based approaches face several major challenges, such as motion blur, sensor noise, and low-light limitation. To address these issues, we employ event cameras, whose high dynamic range and low latency offer a promising solution. Furthermore, we propose a keypoint-based detection and tracking approach for dynamic object pose estimation. Firstly, a keypoint detection network is constructed to extract keypoints from the time surface generated by the event stream. Subsequently, the polarity and spatial coordinates of the events are leveraged, and the event density in the vicinity of each keypoint is utilized to achieve continuous keypoint tracking. Finally, a hash mapping is established between the 2D keypoints and the 3D model keypoints, and the EPnP algorithm is employed to estimate the 6-DoF pose. Experimental results demonstrate that, whether in simulated or real event environments, the proposed method outperforms the event-based state-of-the-art methods in terms of both accuracy and robustness.
|
| |
| 15:00-16:30, Paper TuI2I.182 | Add to My Program |
| MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behavior Understanding |
|
| Paturkar, Varun | IIIT Hyderabad |
| Gangisetty, Shankar | IIIT Hyderabad |
| Jawahar, C.V. | IIIT, Hyderabad |
Keywords: Data Sets for Robotic Vision, Computer Vision for Transportation, Intelligent Transportation Systems
Abstract: Two-wheelers account for a disproportionately high share of road fatalities in the Global South. Research on two-wheeler rider behavior, however, lags far behind fourwheelers, where multimodal datasets have driven major advances in Advanced Driver Assistance Systems (ADAS). To address this gap, we present the MOtorized TwO-wheeler Rider (MOTOR) dataset, the first large-scale, multi-view, multimodal resource dedicated to two-wheelers in dense, unstructured traffic. MOTOR comprises 2,500 sequences (25+ hours of video data) collected from 16 riders and integrates synchronized front, rear, and helmet videos, rider eye-gaze from wearable trackers, on-road audio, and telemetry (GPS, accelerometer, gyroscope). Rich annotations capture traffic context, rider state, 12 riding maneuvers spanning conventional and unconventional behaviors, and legality labels (Legal, Illegal, Unspecified). We benchmark rider behavior recognition and maneuver legality classification using state-of-the-art video action recognition backbones (CNN and Transformer-based), extended with multimodal fusion, and find that combining RGB, gaze, and telemetry consistently yields the best performance. MOTOR thus provides a unique foundation for advancing safety-critical understanding of two-wheeler riding. It offers the research community a benchmark to develop and evaluate models for behavior analysis, legality-aware prediction, and intelligent transportation systems. Dataset and code is available at https://varuniiith.github.io/MOTOR-Dataset/
|
| |
| 15:00-16:30, Paper TuI2I.183 | Add to My Program |
| Simplified 3D Control of Magnetic Objects by a Triple-Coil Static Unit on a Robotic Arm |
|
| Cinus, Luca | Scuola Superiore Sant'Anna Pisa |
| De Oliveira Santana Alves, Jessé | Université De Rennes |
| Srymbetov, Tamerlan | The BioRobotics Institute, Scuola Superiore Sant’Anna |
| Ferro, Marco | CNRS |
| Pacchierotti, Claudio | Centre National De La Recherche Scientifique (CNRS) |
| Menciassi, Arianna | Scuola Superiore Sant'Anna - SSSA |
| Iacovacci, Veronica | Scuola Superiore Sant'Anna |
Keywords: Motion Control, Micro/Nano Robots, Optimization and Optimal Control
Abstract: Magnetic actuation is a powerful, non-contact method for controlling milli-scale robots. However existing mobile magnetic field sources face a difficult trade-off. Single-coil end effectors are simple but underactuated, forcing complex and inefficient robot motions to steer objects. Conversely, multi-coil systems improve dexterity but introduce significant mechanical and control complexity. To cope with this challenge, we present a compact, fixed-configuration triple-coil electromagnetic end effector mounted on a 7-DOF robotic arm. Our innovation lies in a hierarchical control strategy that decouples global and local actuation. A Control Lyapunov Function-based Quadratic Programming (QP-CLF) controller guides the robotic arm for large-scale repositioning, extending the workspace and minimizing required currents. Simultaneously, modulating the currents through the three coils provides fine, high-bandwidth electrical control over the local magnetic field and gradient. We validated this approach by steering 1, 2, and 3 mm magnetic spheres along complex spiral trajectories inside fluid-filled phantoms (water and oil). Our system was teleoperated under operator vision and demonstrated highly repeatable path passing performance, proving that this synergistic robot-electromagnet control provides a compelling balance of dexterity, compactness, and simplicity for advanced magnetic manipulation tasks.
|
| |
| 15:00-16:30, Paper TuI2I.184 | Add to My Program |
| Model Reconciliation through Explainability and Collaborative Recovery in Assistive Robotics |
|
| Besch, Britt | German Aerospace Center (DLR) |
| Mai, Tai | German Aerospace Center (DLR) |
| Thun, Jeremias | University Bremen |
| Huff, Markus | Leibniz-Institut für Wissensmedien |
| Vogel, Jörn | German Aerospace Center (DLR) |
| Stulp, Freek | DLR - Deutsches Zentrum für Luft- und Raumfahrt e.V. |
| Bustamante, Samuel | German Aeroespace Center (DLR), Robotics and Mechatronics Center (RMC) |
|
|
| |
| 15:00-16:30, Paper TuI2I.185 | Add to My Program |
| Diff-VIO: A Diffusion Model-Based Pose Optimizer for Visual Inertial Odometry |
|
| Qin, Wenyuan | Beihang University |
| Kong, Xiangxi | Beihang University |
| Zhang, Sizhuo | Beihang University |
| Xu, Hao | Beihang Unverisity |
| Dong, Xiwang | Beihang University |
Keywords: Visual-Inertial SLAM, Sensor Fusion, Computer Vision for Automation
Abstract: Visual inertial odometry (VIO) serves as a cornerstone of environmental perception and spatial localization, with broad applications in autonomous driving, robotic navigation, and embodied intelligence. Although recent deep learning based VIO methods have achieved impressive accuracy and computational efficiency, most approaches optimize errors within a maximum a posteriori (MAP) framework, often overlooking explicit prior modeling which constrains the upper bounds of achievable performance. To address this challenge, Diff-VIO is introduced, which is a VIO optimization framework grounded in diffusion models. An end-to-end coarse pose generator is first employed. It outputs an initial pose estimate and supplies priors for the diffusion refinement. To constrain the solution space, a diffusion-based refinement module injects pose priors during generation. This process is supported by a global context transformer encoder and a conditional decoder, which model long-range dependencies and predict residual noise for precise pose refinement. Experiments conducted on the KITTI benchmark demonstrate that the proposed method outperforms state-of-the-art VIO techniques in both accuracy and robustness. Additional evaluations on a dataset collected with an Intel RealSense D435i further validate the strong generalization capability of the proposed method across diverse hardware platforms. As the first diffusion-based VIO framework, Diff-VIO introduces a novel optimization paradigm for learning-based visual-inertial odometry systems.
|
| |
| 15:00-16:30, Paper TuI2I.186 | Add to My Program |
| Polymander II: An Amphibious Salamander-Inspired Robot with Contact and Flow Sensors |
|
| Fu, Qiyuan | EPFL |
| Lee, Sudong | EPFL |
| Grillo, Andrea | EPFL |
| Arreguit, Jonathan | EPFL |
| Gevers, Louis | EPFL |
| Hughes, Josie | EPFL |
| Ijspeert, Auke | EPFL |
Keywords: Biologically-Inspired Robots, Mechanism Design, Robotics and Automation in Life Sciences
Abstract: Robots benefit from sensory information to coordinate body movement, gain robustness against perturbations, and transit between different modes to adapt to various terrains. However, few amphibious robots can sense interactions with both terrestrial and aquatic environments. In this paper, we present a solution that uses Hall-effect sensors to sense foot contact forces and lateral hydrodynamic forces on a salamander-inspired amphibious robot. With two bus lines, the robot can simultaneously acquire this exteroceptive information at more than 500 Hz and proprioceptive information, such as joint positions and loads, at 100 Hz. The Hall-effect sensors used are compact, making them suitable for embedding in multiple positions within a robot, and exhibit high sensitivity to small forces. Moreover, because the sensor can be positioned separately from the measured object, waterproofing can be implemented with relative ease. Our tests demonstrate the robot's capabilities in traversing amphibious environments and its potential in using feedback control for more complex locomotion tasks.
|
| |
| 15:00-16:30, Paper TuI2I.187 | Add to My Program |
| VGGT-Long: Chunk It, Loop It, Align It -- Pushing VGGT's Limits on Kilometer-Scale Long RGB Sequences |
|
| Deng, Kai | Nankai University |
| Ti, Zexin | Nanjing University |
| Jiawei, Xu | Nankai University |
| Yang, Jian | Nanjing University |
| Xie, Jin | Nanjing University |
Keywords: Deep Learning for Visual Perception, Localization, Mapping
Abstract: Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. In this work, we propose VGGT-Long, a simple yet effective system that pushes the limits of monocular 3D reconstruction to kilometer-scale, unbounded outdoor environments. Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization. Without requiring camera calibration, depth supervision or model retraining, VGGT-Long achieves trajectory and reconstruction performance comparable to traditional methods. We evaluate our method on KITTI, Waymo, and Virtual KITTI datasets. VGGT-Long not only runs successfully on long RGB sequences where foundation models typically fail, but also produces accurate and consistent geometry across various conditions. Our results highlight the potential of leveraging foundation models for scalable monocular 3D scene in real-world settings, especially for autonomous driving scenarios. Code is available at https://github.com/DengKaiCQ/VGGT-Long.
|
| |
| 15:00-16:30, Paper TuI2I.188 | Add to My Program |
| ManeuverNet: A Soft Actor-Critic Framework for Precise Maneuvering of Double-Ackermann-Steering Robots with Optimized Reward Functions |
|
| Deflesselle, Kohio | LaBRI - Université De Bordeaux |
| Daniel, Mélodie | LaBRI - Université De Bordeaux |
| Magassouba, Aly | University of Nottingham |
| Aranda, Miguel | Universidad De Zaragoza |
| Ly, Olivier | LaBRI - Bordeaux University |
Keywords: Reinforcement Learning, Machine Learning for Robot Control, Motion and Path Planning
Abstract: Autonomous control of double-Ackermann-steering robots is essential in agricultural applications, where robots must execute precise and complex maneuvers within a limited space. Classical methods, such as the Timed Elastic Band (TEB) planner, can address this problem, but they rely on parameter tuning, making them highly sensitive to changes in robot configuration or environment and impractical to deploy without constant recalibration. At the same time, end-to-end deep reinforcement learning (DRL) methods often fail due to unsuitable reward functions for non-holonomic constraints, resulting in sub-optimal policies and poor generalization. To address these challenges, this paper presents ManeuverNet, a DRL framework tailored for double-Ackermann systems, combining Soft Actor-Critic with CrossQ. Furthermore, ManeuverNet introduces four specifically designed reward functions to support maneuver learning. Unlike prior work, ManeuverNet does not depend on expert data or handcrafted guidance. We extensively evaluate ManeuverNet against both state-of-the-art DRL baselines and the TEB planner. Experimental results demonstrate that our framework substantially improves maneuverability and success rates, achieving more than a 40% gain over DRL baselines. Moreover, ManeuverNet effectively mitigates the strong parameter sensitivity observed in the TEB planner. In real-world trials, ManeuverNet achieved up to a 90% increase in maneuvering trajectory efficiency, highlighting its robustness and practical applicability.
|
| |
| 15:00-16:30, Paper TuI2I.189 | Add to My Program |
| TEMPO-VINE: A Multi-Temporal Sensor Fusion Dataset for Localization and Mapping in Vineyards |
|
| Martini, Mauro | Politecnico di Torino |
| Ambrosio, Marco | Politecnico di Tornio |
| Vilella-Cantos, Judith | University Institute for Engineering Research, Miguel Hernández University |
| Navone, Alessandro | Politecnico di Torino |
| Chiaberge, Marcello | Politecnico di Torino |
|
|
| |
| 15:00-16:30, Paper TuI2I.190 | Add to My Program |
| Frequency-Guided 3D Gaussian Splatting for Challenging Low-Light View Synthesis |
|
| Mai, Zhaoyuan | Guangdong University of Technology |
| Zeng, Bi | Guangdong University of Technology |
| Zhang, Boquan | Guangdong University of Technology |
| Zeng, Tianle | Guangdong University of Technology |
| Lu, Jingxuan | Guangdong University of Technology |
| Feng, Jiarong | Guangdong University of Technology |
Keywords: Computer Vision for Automation, Computer Vision for Transportation, Visual Learning
Abstract: Robust 3D scene understanding is crucial for autonomous robots, but degrades sharply in low-light environments where sensor noise and illumination inconsistencies corrupt visual inputs. Even 3D Gaussian Splatting (3DGS), while efficient for real-time reconstruction, produces unstable and artifact-prone results under such conditions, limiting its reliability for navigation and mapping. To address these challenges, we propose a 3DGS-based framework for reconstructing clear scenes under low-light conditions. Firstly, We employ a frequency-aware modulator that operates on spectral components to decouple and suppress sensor noise from structural signals, providing a clean input for reconstruction. To refine the 3D model and ensure its compactness for onboard deployment, we introduce an adaptive denoising mask guided by dynamically updated statistics of rendering contribution and stability, which filters transient artifacts caused by sensor noise. Finally, a multi-view frequency consistency constraint is enforced to ensure the global coherence of the reconstructed model's appearance, which is critical for consistent mapping. Experiments on challenging low-light datasets demonstrate that our method achieves state-of-the-art reconstruction quality while significantly reducing model storage by approximately 46.4% and maintaining real-time rendering speeds.
|
| |
| 15:00-16:30, Paper TuI2I.191 | Add to My Program |
| Annotation Free Spacecraft Detection and Segmentation Using Vision Language Models |
|
| Hicsonmez, Samet | University of Luxembourg |
| Sosa, Jose | SnT, University of Luxembourg |
| Pineau, Dan | Université Du Luxembourg |
| Singh, Inder Pal | University of Luxembourg |
| Rathinam, Arunkumar | University of Luxembourg |
| Shabayek, Abd El Rahman | SnT, University of Luxembourg, Luxembourg |
| Aouada, Djamila | SnT, University of Luxembourg |
Keywords: Space Robotics and Automation, Aerial Systems: Perception and Autonomy, Deep Learning for Visual Perception
Abstract: Vision Language Models (VLMs) have demonstrated remarkable performance in open-world zero-shot visual recognition. However, their potential in space-related applications remains largely unexplored. In the space domain, accurate manual annotation is particularly challenging due to factors such as low visibility, illumination variations, and object blending with planetary backgrounds. Developing methods that can detect and segment spacecraft and orbital targets without requiring extensive manual labeling is therefore of critical importance. In this work, we propose an annotation-free detection and segmentation pipeline for space targets using VLMs. Our approach begins by automatically generating pseudo-labels for a small subset of unlabeled real data with a pre-trained VLM. These pseudo-labels are then leveraged in a teacher–student label distillation framework to train lightweight models. Despite the inherent noise in the pseudo-labels, the distillation process leads to substantial performance gains over direct zero-shot VLM inference. Experimental evaluations on the SPARK-2024, SPEED+, and TANGO datasets on segmentation tasks demonstrate consistent improvements in average precision (AP) by up to 10 points. Code and models are available at https://github.com/giddyyupp/annotation-free-spacecraft-segmentation.
|
| |
| 15:00-16:30, Paper TuI2I.192 | Add to My Program |
| CLUE: Crossmodal Disambiguation Via Language-Vision Understanding with AttEntion |
|
| Abrini, Mouad | Sorbonne University |
| Chetouani, Mohamed | Sorbonne University |
Keywords: Multi-Modal Perception for HRI, Intention Recognition, Natural Dialog for HRI
Abstract: With the increasing integration of robots into daily life, human-robot interaction has become more complex and multifaceted. A critical component of this interaction is Interactive Visual Grounding (IVG), through which robots must interpret human intentions and resolve ambiguity. Existing IVG models generally lack a mechanism to determine when to ask clarification questions, as they implicitly rely on their learned representations. CLUE addresses this gap by converting the VLM's cross-modal attention into an explicit, spatially grounded signal for deciding when to ask. We extract text to image attention maps and pass them to a lightweight CNN to detect referential ambiguity, while a LoRA fine-tuned decoder conducts the dialog and emits grounding location tokens. We train on a real-world interactive dataset for IVG, and a mixed ambiguity set for the detector. With InViG-only supervision, our model surpasses a state-of-the-art method while using parameter-efficient fine-tuning. Similarly, the ambiguity detector outperforms prior baselines. Overall, CLUE turns the internal cross-modal attention of a VLM into an explicit, spatially grounded signal for deciding when to ask. The data and code are publicly available at: mouadabrini.github.io/clue/
|
| |
| 15:00-16:30, Paper TuI2I.193 | Add to My Program |
| Hyper-STTN: Hypergraph Augmented Spatial-Temporal Transformer for Trajectory Prediction |
|
| Wang, Weizheng | Purdue University |
| Yang, Baijian | Purdue University |
| Hong, Sungeun | Sungkyunkwan University |
| Sun, Wenhai | Purdue University |
| Min, Byung-Cheol | Indiana University |
Keywords: Human Detection and Tracking, Intention Recognition
Abstract: Predicting crowd intentions and trajectories is critical for a range of real-world applications, involving social robotics and autonomous driving. Accurately modeling such behavior remains challenging due to the complexity of pairwise spatial-temporal interactions and the heterogeneous influence of groupwise dynamics. To address these challenges, we propose Hyper-STTN, a Hypergraph-augmented Spatial-Temporal Transformer Network for crowd trajectory prediction. Hyper-STTN constructs crowd hypergraphs with multiscale group sizes to model groupwise correlations, captured through spectral hypergraph convolution based on hypergraph random walk. In parallel, a spatial-temporal transformer is employed to learn pedestrians’ pairwise latent interactions across multimodal dimensions. Eventually, above heterogeneous groupwise and pairwise features are subsequently incorporated and aligned via a multimodal transformer. Extensive experiments on public pedestrian motion datasets demonstrate that Hyper-STTN consistently outperforms state-of-the-art baselines and ablation models. The project website is available at https://sites.google.com/view/hypersttn.
|
| |
| 15:00-16:30, Paper TuI2I.194 | Add to My Program |
| Beyond Scalar Rewards: Distributional Reinforcement Learning with Preordered Objectives for Safe and Reliable Autonomous Driving |
|
| Abouelazm, Ahmed | FZI Forschungszentrum Informatik |
| Michel, Jonas | Karlsruhe Institute of Technology |
| Bogdoll, Daniel | FZI Research Center for Information Technoloy |
| Schörner, Philip | FZI Research Center for Information Technology |
| Zöllner, Johann Marius | FZI Forschungszentrum Informatik |
Keywords: Intelligent Transportation Systems, Reinforcement Learning, Motion and Path Planning
Abstract: Autonomous driving involves multiple, often conflicting objectives such as safety, efficiency, and comfort. In reinforcement learning (RL), these objectives are typically combined through weighted summation, which collapses their relative priorities and often yields policies that violate safety-critical constraints. To overcome this limitation, we introduce the Preordered Multi-Objective MDP (Pr-MOMDP), which augments standard MOMDPs with a preorder over reward components. This structure enables reasoning about actions with respect to a hierarchy of objectives rather than a scalar signal. To make this structure actionable, we extend distributional RL with a novel pairwise comparison metric, Quantile Dominance (QD), that evaluates action return distributions without reducing them into a single statistic. Building on QD, we propose an algorithm for extracting optimal subsets, the subset of actions that remain non-dominated under each objective, which allows precedence information to shape both decision-making and training targets. Our framework is instantiated with Implicit Quantile Networks (IQN), establishing a concrete implementation while preserving compatibility with a broad class of distributional RL methods. Experiments in Carla show improved success rates, fewer collisions and off-road events, and deliver statistically more robust policies than IQN and ensemble-IQN baselines. By ensuring policies respect rewards preorder, our work advances safer, more reliable autonomous driving systems.
|
| |
| 15:00-16:30, Paper TuI2I.195 | Add to My Program |
| MLA: A Multisensory Language–Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation |
|
| Liu, Zhuoyang | Peking University |
| Liu, Jiaming | Peking University |
| Xu, Jiadong | University of Chinese Academy of Sciences |
| Han, Nuowei | Beijing University of Posts and Telecommunications |
| Gu, Chenyang | Peking University |
| Chen, Hao | The Chinese University of Hong Kong |
| Zhou, Kaichen | University of Oxford |
| Zhang, Renrui | CUHK |
| Hsieh, Kai Chin | Peking University |
| Wu, Kun | Syracuse University |
| Che, Zhengping | X-Humanoid |
| Tang, Jian | Midea Group (Shanghai) Co., Ltd |
| Zhang, Shanghang | Peking University |
Keywords: Deep Learning in Grasping and Manipulation, Imitation Learning, Representation Learning
Abstract: Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and language to generate actions, whereas robots must perceive and interact within the spatial-physical world. This gap highlights the need for a comprehensive understanding of robotic-specific multisensory information, which is crucial for achieving complex and contact-rich control. To this end, we introduce a multisensory language–action (MLA) model that collaboratively perceives heterogeneous sensory modalities and predicts future multisensory objectives to facilitate physical world modeling. Specifically, to enhance perceptual representations, we propose an encoder-free multimodal alignment scheme that innovatively repurposes the large language model itself as a perception module, directly interpreting multimodal cues by aligning 2D images, 3D point clouds, and tactile tokens through positional correspondence. To further enhance MLA’s understanding of physical dynamics, we design a future multisensory generation post-training strategy that enables MLA to reason about semantic, geometric, and interaction information, providing more robust conditions for action generation. For evaluation, the MLA model outperforms the previous state-of-the-art 2D and 3D VLA methods by 12% and 24% in complex, contact-rich real-world tasks, respectively, while also demonstrating improved generalization to unseen configurations. Project website: sites.google.com/view/robot-mla.
|
| |
| 15:00-16:30, Paper TuI2I.196 | Add to My Program |
| UltraDexGrasp: Learning Universal Dexterous Grasping for Bimanual Robots with Synthetic Data |
|
| Yang, Sizhe | The Chinese University of Hong Kong |
| Xie, Yiman | Zhejiang University |
| Liang, Zhixuan | The University of Hong Kong |
| Tian, Yang | Peking University |
| Zeng, Jia | Shanghai AI Laboratory |
| Lin, Dahua | The Chinese University of Hong Kong |
| Pang, Jiangmiao | Shanghai AI Laboratory |
Keywords: Grasping, Dexterous Manipulation, Bimanual Manipulation
Abstract: Grasping is a fundamental capability for robots to interact with the physical world. Humans, equipped with two hands, autonomously select appropriate grasp strategies based on the shape, size, and weight of objects, enabling robust grasping and subsequent manipulation. In contrast, current robotic grasping remains limited, particularly in multi-strategy settings. Although substantial efforts have targeted parallel-gripper and single-hand grasping, dexterous grasping for bimanual robots remains underexplored, with data being a primary bottleneck. Achieving physically plausible and geometrically conforming grasps that can withstand external wrenches poses significant challenges. To address these issues, we introduce UltraDexGrasp, a framework for universal dexterous grasping with bimanual robots. The proposed data-generation pipeline integrates optimization-based grasp synthesis with planning-based demonstration generation, yielding high-quality and diverse trajectories across multiple grasp strategies. With this framework, we curate UltraDexGrasp-20M, a large-scale, multi-strategy grasp dataset comprising 20 million frames across 1,000 objects. Based on UltraDexGrasp-20M, we further develop a simple yet effective grasp policy that takes point clouds as input, aggregates scene features via unidirectional attention, and predicts control commands. Trained exclusively on synthetic data, the policy achieves robust zero-shot sim-to-real transfer and consistently succeeds on novel objects with varied shapes, sizes, and weights, attaining an average success rate of 81.2% in real-world universal dexterous grasping. To facilitate future research on grasping with bimanual robots, we open-source the data generation pipeline at https://github.com/InternRobotics/UltraDexGrasp.
|
| |
| 15:00-16:30, Paper TuI2I.197 | Add to My Program |
| SAFL-Geo: Structure-Aware Feature Learning with Fusion Loss for Infrared-Visible Geo-Localization |
|
| Shen, Jiabo | Northeastern University |
| Zhao, Shuying | Northeastern University |
| Zhang, Yunzhou | Northeastern University |
| Zhang, Tengda | Northeastern University |
| Zhou, Hongyu | Northeastern University |
| Zhang, Yu | Northeastern University |
| Gao, Jiaxu | Northeastern University |
Keywords: Localization, Deep Learning for Visual Perception
Abstract: Cross-modal Visual Geo-localization often aims to retrieve a satellite visible-light image of the same geographic lo cation from a large-scale database using an infrared image cap tured by an unmanned aerial vehicle (UAV), thereby achieving precise localization. This capability is crucial for autonomous drone localization and navigation in low-light conditions such as nighttime or smoky environments. However, research in this field is still in its nascent stage, with existing methods being few in number and limited in precision. To address these issues, this paper proposes a structure-aware and fusion-loss constrained cross-modal geo-localization network (SAFL-Geo), which enhances the accuracy of cross-modal image retrieval. Specifically, we design a structure-aware module embedded into the network backbone, substantially enhancing the model’s abil ity to perceive and extract cross-modally consistent structural features (such as road and building contours). Furthermore, we propose a feature enhancement and aggregation module that projects the refined multi-modal representations into a unified embedding space, effectively reducing the cross-modal representation gap while preserving discriminative semantic structures. Finally, we propose a fusion loss constraint strategy that constructs intermediate fused features as a “bridge” to constrain the distribution distances between infrared and fused features, as well as between visible and fused features, thereby indirectly mitigating the modality gap. Extensive experiments on the Boson datasets show that our SAFL-Geo achieves superior state-of-the-art performance.
|
| |
| 15:00-16:30, Paper TuI2I.198 | Add to My Program |
| Tactile Memory for Continuous Policy Blending in Unified Force-Impedance Control |
|
| Karacan, Kübra | Technical University of Munich |
| Demir, Ayça | Sabancı University |
| Tosun, Doğukan | Sabancı University |
| Söğüt, Feyza Nur | Munich Institute of Robotics and Machine Intelligence, Technical University of Munich |
| Kirschner, Robin Jeanne | TU Munich, Institute for Robotics and Systems Intelligence |
| Sadeghian, Hamid | Technical University of Munich |
| Haddadin, Sami | Mohamed Bin Zayed University of Artificial Intelligence |
Keywords: Intelligent and Flexible Manufacturing, Control Architectures and Programming, Perception for Grasping and Manipulation
Abstract: As of today, automating contact-rich industrial manipulation processes, such as insertion, plugging, and screw-driving, is tedious and requires expert knowledge. The processes consist of programmable, common action units, like moving to a pose and establishing contact. However, the user still has to decide on fixed transition conditions to successfully complete each sub-action. Instead, we introduce a tactile memory-driven policy blending framework based on unified force-impedance control to transition autonomously. At the core of our approach lies a structured representation of manipulation as a sequence of basic operations combined into any relevant process, each governed by real-time sensory feedback and annotated with process quality metrics (PQMs), capturing motion, force, and energy-level interactions. A bidirectional long-short term memory model (BiLSTM) encodes the recent PQM's histories to determine basic operation success. Later, soft blending weights are generated, allowing smooth, adaptive transitions between operations without manual phase definition. To ensure functional safety during contact, we integrate an energy tank mechanism that enforces passivity by regulating energy exchange. The resulting control scheme enables robust and continuous tactile manipulation across variations in object geometry and spatial configurations. Experimental validation across four processes over five objects and two position variants demonstrates successful transfer and resilience to position disturbances. Our findings highlight that learned tactile memory and quality feedback embedded in the control loop serve as a principled foundation for intelligent and transferable manipulation, allowing fully autonomous process planning and execution in the future.
|
| |
| 15:00-16:30, Paper TuI2I.199 | Add to My Program |
| Judo: A User-Friendly Open-Source Package for Sampling-Based Model Predictive Control |
|
| Li, Albert H. | California Institute of Technology |
| Zhang, John | Massachusetts Institute of Technology |
| Bruedigam, Jan | Technical University of Munich |
| Hung, Brandon | Robotics and AI Institute |
| Ames, Aaron | California Institute of Technology |
| Wang, Jiuguang | Boston Dynamics AI Institute |
| Le Cleac'h, Simon | Stanford University |
| Culbertson, Preston | Cornell University |
Keywords: Software Architecture for Robotic and Automation, Software Tools for Benchmarking and Reproducibility, Multi-Contact Whole-Body Motion Planning and Control
Abstract: Sampling-based model predictive control (MPC) is experiencing a resurgence in robotics following both recent hardware successes and advancements in parallelized physics simulation. However, to build on this progress, the robotics community needs to develop shared tools for prototyping, benchmarking, and deploying sampling-based controllers. We introduce judo, a software package designed to address this need. To facilitate rapid prototyping and evaluation, judo provides robust implementations of common sampling-based MPC algorithms and a comprehensive suite of benchmark tasks. It emphasizes usability with simple but extensible interfaces for controller and task definitions, asynchronous execution for straightforward simulation-to-hardware transfer, and a highly customizable interactive GUI for tuning controllers interactively. While the high-level library is written in Python, judo leverages MuJoCo as its physics backend to achieve real-time performance. We present example benchmarking results using judo to compare standard sampling-based controllers across its tasks. We also provide real-world case studies in deploying judo on hardware for two contact-rich tasks: in-hand cube rotation and quadrupedal loco-manipulation. Code at https://github.com/bdaiinstitute/judo.
|
| |
| 15:00-16:30, Paper TuI2I.200 | Add to My Program |
| Caspar: CUDA Accelerator for Symbolic Programming with Adaptive Reordering |
|
| Martens, Emil | Norwegian University of Science and Technology (NTNU), Skydio |
| Miller, Aaron | Skydio |
| Varnum, Matias | Skydio |
| Stahl, Annette | Norwegian University of Science and Technology (NTNU) |
Keywords: Mapping, Optimization and Optimal Control, Performance Evaluation and Benchmarking
Abstract: We present Caspar, a library that makes the power of modern GPUs more accessible in robotics and provides a state-of-the-art nonlinear GPU solver that can be applied to a wide range of different optimization problems. Caspar bridges the gap between expressive symbolic programming in Python and high-performance GPU runtimes in C++ by automatically generating optimized CUDA kernels from symbolic expressions. Building on the SymForce library, users can easily define and combine symbolic expressions, including Lie group operations, to generate custom CUDA kernels. To use Caspar as a solver, users need only define the symbolic residual functions; Caspar then uses symbolic differentiation to generate the necessary GPU kernels and interfaces to perform nonlinear optimization. In this paper, we present the core components of Caspar and showcase its performance by performing bundle adjustment on the Bundle Adjustment in the Large (BAL) dataset. We benchmark Caspar against other state-of-the-art bundle adjusters and show that it is 5 to 20 times faster than the best alternative, requires less memory, and achieves similar accuracy. This illustrates the benefit of our symbolic GPU programming approach. Caspar is released as part of SymForce and is freely available at https://github.com/symforce-org/symforce.
|
| |
| 15:00-16:30, Paper TuI2I.201 | Add to My Program |
| DynDLO: Learning-Based Trajectory Planning for Dynamic Robotic Manipulation of Deformable Linear Objects |
|
| Liuni, Daniele Maria | Politecnico Di Milano |
| Bartesaghi, Alessandro | Politecnico Di Milano |
| Monguzzi, Andrea | Leonardo, Innovation Hub |
| Miuccio, Alessandra | Royal Military Academy of Belgium |
| Zanchettin, Andrea Maria | Politecnico Di Milano |
| Rocco, Paolo | Politecnico Di Milano |
Keywords: Manipulation Planning, Reinforcement Learning, Deep Learning in Grasping and Manipulation
Abstract: The automatic manipulation of Deformable Linear Objects (DLOs) remains currently a challenge in robotics. Previous research on robotic DLOs manipulation has primarily addressed quasi-static DLO manipulation at low speeds, leaving the potential of dynamic DLO manipulation largely unexplored. This paper introduces DynDLO, a goal conditioned, 6-axis robot-independent Reinforcement Learning sandbox for training agents on a variety of DLO dynamic manipulation tasks. In DynDLO, a DLO attached to the robot Tool Center Point (TCP) is simulated in the MuJoCo environment. By employing a B-Spline based trajectory generation function, the agent is capable of learning single and multiple step trajectories for the TCP, which succeed in various DLO dynamic manipulation problems. Specifically, we propose tailored design strategies for the reward function according to the classification of tasks into implicit or explicit DLO shape control tasks. Experiments on four representative tasks demonstrate that DynDLO is capable of generating dynamic manipulation policies that transfer successfully from simulation to the real world, achieving high success rates without requiring real-world training.
|
| |
| 15:00-16:30, Paper TuI2I.202 | Add to My Program |
| HumanoidExo: Scalable Whole-Body Humanoid Manipulation Via Wearable Exoskeleton |
|
| Zhong, Rui | National University of Defense Technology |
| Sun, Yizhe | Midea Group |
| Wen, Junjie | Midea Group |
| Li, Jinming | Midea Group |
| Cheng, Chuang | National University of Defense Technology |
| Dai, Wei | National University of Defense Technology |
| Zeng, Zhiwen | National University of Defense Technology |
| Xu, Yi | Midea Group |
| Lu, Huimin | National University of Defense Technology |
| Zhu, Yichen | Midea Group |
Keywords: Humanoid Robot Systems
Abstract: A significant bottleneck in humanoid policy learning is the acquisition of large-scale, diverse datasets, as collecting reliable real-world data remains both difficult and cost-prohibitive. To address this limitation, we introduce HumanoidExo, a novel system that transfers human motion to whole-body humanoid data. HumanoidExo offers a high-efficiency solution that minimizes the embodiment gap between the human demonstrator and the robot, thereby tackling the scarcity of whole-body humanoid data. By facilitating the collection of more voluminous and diverse datasets, our approach significantly enhances the performance of humanoid robots in dynamic, real-world scenarios. We evaluated our method across three challenging real-world tasks: table-top manipulation, manipulation integrated with stand-squat motions, and whole-body manipulation. Our results empirically demonstrate that HumanoidExo is a crucial addition to real-robot data, as it enables the humanoid policy to generalize to novel environments, learn complex whole-body control from only five real-robot demonstrations, and even acquire new skills (i.e., walking) solely from HumanoidExo data.
|
| |
| 15:00-16:30, Paper TuI2I.203 | Add to My Program |
| Velocity-Based Admittance-Impedance Control with Contact Compliance Modeling for Robust Dual-Arm Manipulation |
|
| Dubey, Samriddhi | Indian Institute of Technology Gandhinagar |
| Kashiv, Yash | Indian Institute of Technology Gandhinagar |
| Kumar, Shreyas | Indian Institute of Science Bengaluru |
| Jain, Siddhi | Addverb Technologies |
| Kumar, Rajesh | Addverb Technologies |
| Palanthandalam-Madapusi, Harish | Indian Institute of Technology Gandhinagar |
Keywords: Dual Arm Manipulation, Force Control, Contact Modeling
Abstract: Many industrial and commercial manipulators provide only position and velocity control interfaces, making direct regulation of contact forces challenging. In dual-arm manipulation, this limitation prevents stable force closure and consistent control of the object wrench. We present a control framework that combines contact-level admittance and object- level impedance to compute velocity commands for both arms. The contact admittance law maps force errors into velocity corrections, while the object impedance relation regulates the net wrench on the object. Together, these laws generate joint velocities through the stacked Jacobian, ensuring consistent integration of force and motion objectives. Contact compliance is explicitly modeled using linear spring–damper elements. The analysis of closed-loop error dynamics shows how the stiffness and damping parameters of the contact compliance influence the frequency response of the error dynamics and explains the origin of high-frequency oscillations in the presence of sensor noise. Experiments with a dual-arm setup with two heteroge- nous velocity-controlled manipulators validate the framework. Results confirm accurate force regulation, disturbance rejection, and stable cooperative lifting under different contact padding conditions. The proposed approach establishes a velocity-based method for dual-arm force closure with contact compliance.
|
| |
| 15:00-16:30, Paper TuI2I.204 | Add to My Program |
| AutoPercep: A Pipeline for Onboard Neighbor Position Estimation Toward Large-Scale Swarm Robotics |
|
| Wu, Ruiheng | University of Konstanz |
| Atasoy Bingöl, Simay | University of Konstanz |
| Deussen, Oliver | University of Konstanz |
| Hamann, Heiko | University of Konstanz |
| Couzin, Iain D. | Max Planck Institute of Animal Behavior |
| Reina, Andreagiovanni | Universität Konstanz & Max Planck Institute of Animal Behavior |
| Li, Liang | Max-Planck Institute of Animal Behavior |
Keywords: Swarm Robotics, Embedded Systems for Robotic and Automation, Range Sensing
Abstract: Autonomous mobile robots must know each other's positions to coordinate their actions and motion. Beyond collision avoidance, relative position estimation is essential for spatial coordination tasks such as collective motion, leader–follower dynamics, or formation control.To overcome the scalability and resilience issues of centralized orchestrators that transmit real-time positional information to every robot, we study mechanisms of onboard vision sensing. Conventional localization methods, such as SLAM, are typically too computationally demanding for real-time use on small, resource-constrained mobile robots. Vision-based neural networks offer a promising alternative but often require large, high-quality datasets that are expensive to collect. We present AutoPercep, a~pipeline that automatically generates training data and trains a lightweight neural network to estimate neighbor positions. Robots capture camera images that are automatically labeled using ground-truth data from a motion-capture system. In our experiments, AutoPercep collected over 10,000 high-quality images within 10 minutes and trained a neural network in about 1 hour, which could be deployed on Raspberry Pi 4B–based robots for onboard neighbour detection. Moreover, we show that a network trained on five robots generalizes to seven-robot deployments. We finally evaluate the trained model in a sequential leader-follower case study. Our end-to-end pipeline demonstrates the feasibility and low cost of onboard, vision-based neighbor perception, supporting scalability to large robot swarms and opening opportunities for deployment beyond laboratory settings. The code for training and evaluation is available at https://github.com/preon7/autopercep
|
| |
| 15:00-16:30, Paper TuI2I.205 | Add to My Program |
| Benchmarking the Effects of Object Pose Estimation and Reconstruction on Robotic Grasping Success |
|
| Burde, Varun | Czech Institute of Informatics, Robotics and Cybernetics, CTU in Prague |
| Burget, Pavel | Czech Technical University in Prague |
| Sattler, Torsten | Czech Technical University in Prague |
|
|
| |
| 15:00-16:30, Paper TuI2I.206 | Add to My Program |
| PACE: Physics Augmentation for Coordinated End-To-End Reinforcement Learning Toward Versatile Humanoid Table Tennis |
|
| Hu, Muqun | Purdue University |
| Chen, Wenxi | Purdue University |
| Li, Wenjing | Purdue University |
| Mandali, Falak | Purdue University |
| He, Zijian | Purdue University |
| Zhang, Renhong | Purdue University |
| Krisna, Praveen | Purdue University |
| Christian, Katherine | Purdue University |
| Benaharon, Leo | Purdue University |
| Ma, Dizhi | Purdue University |
| Ramani, Karthik | Purdue University |
| Gu, Yan | Purdue University |
Keywords: Humanoid Robot Systems, Whole-Body Motion Planning and Control, Reinforcement Learning
Abstract: Humanoid table tennis (TT) demands rapid perception, proactive whole-body motion, and agile footwork under strict timing—capabilities that remain difficult for end-to-end control policies. We propose a reinforcement learning (RL) framework that maps ball-position observations directly to whole-body joint commands for both arm striking and leg locomotion, strengthened by predictive signals and dense, physics-guided rewards. A lightweight learned predictor, fed with recent ball positions, estimates future ball states and augments the policy’s observations for proactive decision-making. During training, a physics-based predictor supplies precise future states to construct dense, informative rewards that lead to effective exploration. The resulting policy attains strong performance across varied serve ranges (hit rate>=96% and success rate>=92%) in simulations. Ablation studies confirm that both the learned predictor and the predictive reward design are critical for end-to-end learning. Deployed zero-shot on a physical Booster T1 humanoid with 23 revolute joints, the policy produces coordinated lateral and forward–backward footwork with accurate, fast returns, suggesting a practical path toward versatile, competitive humanoid TT. We have open-sourced our RL training code at: https://github.com/purdue-tracelab/TTRL-ICRA2026
|
| |
| 15:00-16:30, Paper TuI2I.207 | Add to My Program |
| DAGDiff: Guiding Dual-Arm Grasp Diffusion to Stable and Collision-Free Grasps |
|
| Karim, Md Faizal | IIIT Hyderabad |
| Vembar, Vignesh | International Institute of Information Technology - Hyderabad |
| Patra, Keshab | Indian Institute of Technology Bombay |
| Singh, Gaurav | Brown University |
| Krishna, Madhava | IIIT Hyderabad |
Keywords: Deep Learning in Grasping and Manipulation, Grasping
Abstract: Reliable dual-arm grasping is essential for manipulating large and complex objects but remains a challenging problem due to stability, collision, and generalization requirements. Prior methods typically decompose the task into two independent grasp proposals, relying on region priors or heuristics that limit generalization and provide no principled guarantee of stability. We propose DAGDiff, an end-to-end framework that directly denoises to grasp pairs in the SE(3) x SE(3) space. Our key insight is that stability and collision can be enforced more effectively by guiding the diffusion process with classifier signals, rather than relying on explicit region detection or object priors. To this end, DAGDiff integrates geometry-, stability-, and collision-aware guidance terms that steer the generative process toward grasps that are physically valid and force-closure compliant. We comprehensively evaluate DAGDiff through analytical force-closure checks, collision analysis, and large-scale physics-based simulations, showing consistent improvements over previous work on these metrics. Finally, we demonstrate that our framework generates dualarm grasps directly on real-world point clouds of previously unseen objects, which are executed on a heterogeneous dualarm setup where two manipulators reliably grasp and lift them.
|
| |
| 15:00-16:30, Paper TuI2I.208 | Add to My Program |
| Experimental Comparison of Kinematic Task-Priority Control Methods for an Articulated Intervention-AUV |
|
| Sæbø, Bjørn Kåre | Norwegian University of Science and Technology (NTNU) |
| Iversflaten, Markus H. | Norwegian University of Science and Technology (NTNU) |
| Pettersen, Kristin Y. | Norwegian University of Science and Technology |
| Gravdahl, Jan Tommy | Norwegian University of Science and Technology |
Keywords: Redundant Robots, Marine Robotics, Kinematics
Abstract: This work revisits two classical closed-loop inverse kinematics (CLIK) formulations for hierarchical control and investigates their differences in the context of articulated intervention-autonomous underwater vehicles (AIAUVs). The class of AIAUVs consists of free-floating, slender, multi-link vehicles with distributed thrusters and no distinct base, allowing the entire vehicle to be modeled and controlled as a manipulator. The concept of body-velocity sharing, a phenomenon where different tasks depend on overlapping body-frame motions, is introduced and formalized through the notion of body-sensitivity subspaces. Changing the location of the system’s body-frame is shown to directly affect both controllers’ closed-loop performance, and it shown that due to body-velocity sharing, tasks for AIAUVs most often fall into an intermediate regime between orthogonal and strictly incompatible tasks, causing the two task-priority formulations to differ. The theory is validated through open-water field trials with the Eelume-M, a 6-meter-long AIAUV, comparing the two control laws. The experiments confirm the theoretical predictions: the projected-residual law improves secondary-task tracking but is more sensitive to algorithmic singularities, whereas the post-projection law remains robust to such singularities at the cost of reduced secondary-task performance. These results provide practical guidelines for selecting kinematic task-priority control laws and body-frame placement for AIAUVs.
|
| |
| 15:00-16:30, Paper TuI2I.209 | Add to My Program |
| Kilometer-Scale GNSS-Denied UAV Navigation Via Heightmap Gradients: A Winning System from the SPRIN-D Challenge |
|
| Werner, Michal | Czech Technical University in Prague |
| Čapek, David | Czech Technical University in Prague |
| Musil, Tomáš | Czech Technical University in Prague |
| Franek, Ondrej | Czech Technical University in Prague |
| Baca, Tomas | Czech Technical University in Prague |
| Saska, Martin | Czech Technical University in Prague |
Keywords: Field Robots, Aerial Systems: Perception and Autonomy, Localization
Abstract: Reliable long-range flight of unmanned aerial vehicles (UAVs) in GNSS-denied environments is challenging: integrating odometry leads to drift, loop closures are unavailable in previously unseen areas and embedded platforms provide limited computational power. We present a fully onboard UAV system developed for the SPRIN-D Funke Fully Autonomous Flight Challenge, which required 9~km long-range waypoint navigation below 25~m AGL (Above Ground Level) without GNSS or prior dense mapping. The system integrates perception, mapping, planning, and control with a lightweight drift-correction method that matches LiDAR-derived local heightmaps to a prior geodata heightmap via gradient-template matching and fuses the evidence with odometry in a clustered particle filter. Deployed during the competition, the system executed kilometer-scale flights across urban, forest, and open-field terrain and reduced drift substantially relative to raw odometry, while running in real time on CPU-only hardware. We describe the system architecture, the localization pipeline, and the competition evaluation, and we report practical insights from field deployment that inform the design of GNSS-denied UAV autonomy.
|
| |
| 15:00-16:30, Paper TuI2I.210 | Add to My Program |
| Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition |
|
| Xiao, Jiuhong | New York University |
| Zhou, Yang | New York University |
| Loianno, Giuseppe | UC Berkeley |
Keywords: Deep Learning for Visual Perception
Abstract: Deep learning methods for Visual Place Recognition (VPR) have advanced significantly, largely driven by large-scale datasets. However, most existing approaches are trained on a single dataset, which can introduce dataset-specific inductive biases and limit model generalization. While multi-dataset joint training offers a promising solution for developing universal VPR models, divergences among training datasets can saturate the limited information capacity in feature aggregation layers, leading to suboptimal performance. To address these challenges, we propose Query-based Adaptive Aggregation (QAA), a novel feature aggregation technique that leverages learned queries as reference codebooks to effectively enhance information capacity without significant computational or parameter complexity. We show that computing the Cross-query Similarity (CS) between query-level image features and reference codebooks provides a simple yet effective way to generate robust descriptors. Our results demonstrate that QAA outperforms state-of-the-art models, achieving balanced generalization across diverse datasets while maintaining peak performance comparable to dataset-specific models. Ablation studies further explore QAA's mechanisms and scalability. Visualizations reveal that the learned queries exhibit diverse attention patterns across datasets. Project page: http://xjh19971.github.io/QAA.
|
| |
| 15:00-16:30, Paper TuI2I.211 | Add to My Program |
| Hybrid Membrane–Solenoid Valves for High-Flow, Precisely Controlled Soft Robotic Actuation |
|
| Velimirovic, Nikola | University of Stuttgart |
| Remy, C. David | University of Stuttgart |
| Bruder, Daniel | University of Michigan |
Keywords: Soft Robot Applications, Modeling, Control, and Learning for Soft Robots, Soft Sensors and Actuators
Abstract: Fluid-driven soft robots promise high-dimensional motion and cost-effective scalability, but their performance is constrained by the limited flow capacity of compact solenoid valves and the integration challenges of large membrane valves. This paper introduces a modular hybrid valve architecture that couples a high-flow membrane valve with an integrated miniature solenoid pilot. The resulting composite element achieves both high mass flow rates and precise electronic control while maintaining a compact, lightweight, and fabrication-friendly design. We present the design, modeling, and control strategies for these valves, and evaluate their performance through three experiments: tank pressure regulation, actuation of a weight-curling robot, and integration into a planar two-stage tossing robot. Across all cases, the hybrid membrane valves significantly outperformed solenoid valves, exhibiting faster pressurization and venting, higher bandwidth, and over a five-fold increase in mechanical power output. These results demonstrate that membrane-solenoid hybrid valves provide a scalable and integrable solution for overcoming the “piping-problem,” enabling hyper-actuated soft robotic systems.
|
| |
| 15:00-16:30, Paper TuI2I.212 | Add to My Program |
| Learning Policies for Dynamic Coalition Formation in Multi-Robot Task Allocation |
|
| Bezerra, Lucas | King Abdullah University of Science and Technology |
| Santos, Ataide | Federal University of Sergipe |
| Park, Shinkyu | KAUST |
Keywords: Multi-Robot Systems, Cooperating Robots, Reinforcement Learning
Abstract: We propose a decentralized, learning-based framework for dynamic coalition formation in Multi-Robot Task Allocation (MRTA). Our approach extends MAPPO by integrating spatial action maps, robot motion planning, intention sharing, and task allocation revision to enable effective and adaptive coalition formation. Extensive simulation studies confirm the effectiveness of our model, enabling each robot to rely solely on local information to learn timely revisions of task selections and form coalitions with other robots to complete collaborative tasks. The results also highlight the proposed framework’s ability to handle large robot populations and adapt to scenarios with diverse task sets.
|
| |
| 15:00-16:30, Paper TuI2I.213 | Add to My Program |
| SLoFT: End-To-End Semantic Localization with Floorplan and Transformer |
|
| Min, Chaerin | Brown University |
| Yu, Hongsheng | Google |
| Fan, Fengtao | Google |
| Sridhar, Srinath | Brown University |
| Wu, Qiuxuan | Google LLC |
| Guo, Chao | Google |
Keywords: Localization, Visual Learning, Deep Learning for Visual Perception
Abstract: Visual localization is critical for AR navigation, AI-driven audio guidance, and mobile robot localization. How- ever, traditional SLAM methods that rely on pre-built 3D maps suffer from high costs, privacy concerns, and sensitivity to environmental changes. Recent floorplan-based localization methods attempt to addresses these challenges by using 2D floorplans, eliminating the need for 3D map construction. Still, existing approaches are often impractical for real-world applications, as they are limited to specific layouts and fail to generalize beyond their training domains. We propose a novel approach that learns to semantically match visual cues from a camera image to a floorplan image with texts and symbols, inspired by human ability to directly localize oneself using a complex floorplan image. To achieve this, we train a single, unified model on a diverse dataset of 1.2M images and 740K floorplans that we curated, which includes a new collection of semantically-rich, real-world floorplans. This allows our model to generalize effectively to previously unseen areas and demonstrates potential towards zero-shot capabilities. Without making assumptions about camera poses or floorplan structures, our end-to-end model significantly outperforms existing methods and exhibits strong robustness to floorplan rotations, lighting changes, and different camera intrinsics, while effectively leveraging semantic cues like text.
|
| |
| 15:00-16:30, Paper TuI2I.214 | Add to My Program |
| ARMOR: Attack-Resilient Reinforcement Learning Control for UAVs |
|
| Dash, Pritam | University of British Columbia |
| Chan, Ethan | The University of British Columbia |
| Lawrence, Nathan P. | UC Berkeley |
| Pattabiraman, Karthik | University of British Columbia |
Keywords: Aerial Systems: Applications, Robot Safety, Transfer Learning
Abstract: Unmanned Aerial Vehicles (UAVs) depend on onboard sensors for perception, navigation, and control. However, these sensors are susceptible to physical attacks, such as GPS spoofing, that can corrupt state estimates and lead to unsafe behavior. While reinforcement learning (RL) offers adaptive control capabilities, existing safe RL methods are ineffective against such attacks. We present ARMOR (Adaptive Robust Manipulation-Optimized State Representations), an attack-resilient, model-free RL controller that enables robust UAV operation under adversarial sensor manipulation. Instead of relying on raw sensor observations, ARMOR learns a robust latent representation of the UAV’s physical state via a two-stage training framework. In the first stage, a teacher encoder, trained with privileged attack information, generates attack aware latent states for RL policy training. In the second stage, a student encoder is trained via supervised learning to approximate the teacher’s latent states using only historical sensor data, enabling real-world deployment without privileged information. Our experiments show that ARMOR outperforms conventional methods, ensuring UAV safety. Additionally, ARMOR improves generalization to unseen attacks and reduces training cost by eliminating the need for iterative adversarial training.
|
| |
| 15:00-16:30, Paper TuI2I.215 | Add to My Program |
| Spiking-Refined 3D Object Detection through YOLO–SNN Fusion |
|
| Kusumo, Budiarianto Suryo | Chemnitz University of Technology |
| Thomas, Ulrike | Chemnitz University of Technology |
Keywords: Object Detection, Segmentation and Categorization, Visual Learning
Abstract: This paper presents Spiking-Refined 3D Object Detection through YOLO–SNN Fusion, a real-time pipeline that leverages both convolutional and spiking neural representations for enhanced scene perception. Our system integrates YOLOv11 for robust 2D detection, Depth Anything v2 for monocular depth inference, and geometry-based reasoning for 3D bounding box construction, while a Bird’s-Eye View visualizer provides spatial context. To further improve recognition consistency, we fuse the predictions of a trained Spiking Neural Network (SNN) with YOLO outputs, enabling class refinement that is more resilient to temporal noise and ambiguous appearances. Kalman filtering is employed to stabilize trajectories over time, ensuring coherent 3D tracking. Unlike sensor-heavy setups, our approach runs on a single RGB camera and lightweight models, making it suitable for robotic perception, AR/VR applications, and low-cost embedded platforms. Experiments on real-world video sequences demonstrate improved 3D detection accuracy, temporal stability, and cross-class discrimination compared to conventional monocular pipelines.
|
| |
| 15:00-16:30, Paper TuI2I.216 | Add to My Program |
| SARL: Spatially-Aware Self-Supervised Representation Learning for Visuo-Tactile Perception |
|
| Khurana, Gurmeher | Imperial College London |
| Wei, Lan | Imperial College London |
| Zhang, Dandan | Imperial College London |
Keywords: Representation Learning, Perception for Grasping and Manipulation, Force and Tactile Sensing
Abstract: Contact-rich robotic manipulation requires representations that encode local geometry. Vision provides global context but lacks direct measurements of properties such as texture and hardness, whereas touch supplies these cues. Modern visuo-tactile sensors capture both modalities in a single fused image, yielding intrinsically aligned inputs that are well suited to manipulation tasks requiring visual and tactile information. Most self-supervised learning (SSL) frameworks, however, compress feature maps into a global vector, discarding spatial structure and misaligning with the needs of manipulation. To address this, we propose SARL, a spatially-aware SSL framework that augments the Bootstrap Your Own Latent (BYOL) architecture with three map-level objectives, including Saliency Alignment (SAL), Patch-Prototype Distribution Alignment (PPDA), and Region Affinity Matching (RAM), to keep attentional focus, part composition, and geometric relations consistent across views. These losses act on intermediate feature maps, complementing the global objective. SARL consistently outperforms nine SSL baselines across six downstream tasks with fused visual-tactile data. On the geometry-sensitive edge-pose regression task, SARL achieves a Mean Absolute Error (MAE) of 0.3955, a 30% relative improvement over the next-best SSL method (0.5682 MAE) and approaching the supervised upper bound. These findings indicate that, for fused visual–tactile data, the most effective signal is structured spatial equivariance, in which features vary predictably with object geometry, which enables more capable robotic perception.
|
| |
| 15:00-16:30, Paper TuI2I.217 | Add to My Program |
| Dynamically Extensible and Retractable Robotic Leg Linkages for Multi-Task Execution in Search and Rescue Scenarios |
|
| Harris, William | University of California, San Diego |
| Yager, Lucas | University of California, San Diego |
| Sylvester, Syler | University of California, San Diego |
| Peiros, Lizzie | University of California, San Diego |
| Yip, Michael C. | University of California, San Diego |
Keywords: Legged Robots, Mechanism Design, Search and Rescue Robots
Abstract: Search and rescue (SAR) robots are required to quickly traverse terrain and perform high-force rescue tasks, necessitating both terrain adaptability and controlled high-force output. Few platforms exist today for SAR, and fewer still have the ability to cover both tasks of terrain adaptability and high-force output when performing extraction. While legged robots offer significant ability to traverse uneven terrain, they typically are unable to incorporate mechanisms that provide variable high-force outputs, unlike traditional wheel-based drive trains. This work introduces a novel concept for a dynamically extensible and retractable robot leg. Leveraging a dynamically extensible and retractable five-bar linkage design, it allows for mechanically switching between height-advantaged and force-advantaged configurations via a geometric transformation. A testbed evaluated leg performance across linkage geometries and operating modes, with empirical and analytical analyses conducted on stride length, force output, and stability. The results demonstrate that the morphing leg offers a promising path toward SAR robots that can both navigate terrain quickly and perform rescue tasks effectively.
|
| |
| 15:00-16:30, Paper TuI2I.218 | Add to My Program |
| Diffusing Trajectory Optimization Problems for Recovery During Multi-Finger Manipulation |
|
| Kumar, Abhinav | University of Michigan |
| Yang, Fan | University of Michigan |
| Aguilera, Sergio | Pontificia Universidad Catolica De Chile |
| Iba, Soshi | Honda Research Institute USA |
| Soltani Zarrin, Rana | Honda Research Institute - USA |
| Berenson, Dmitry | University of Michigan |
Keywords: Dexterous Manipulation, Deep Learning in Grasping and Manipulation, Failure Detection and Recovery
Abstract: Multi-fingered hands are emerging as powerful platforms for performing fine manipulation tasks, including tool use. However, environmental perturbations or execution errors can impede task performance, motivating the use of recovery behaviors that enable normal task execution to resume. In this work, we take advantage of recent advances in diffusion models to construct a framework that autonomously identifies when recovery is necessary and optimizes contact-rich trajectories to recover. We use a diffusion model trained on the task to estimate when states are not conducive to task execution, framed as an out-of-distribution detection problem. We then use diffusion sampling to project these states in-distribution and use trajectory optimization to plan contact-rich recovery trajectories. We also propose a novel diffusion-based approach that distills this process to efficiently diffuse the full parameterization, including constraints, goal state, and initialization, of the recovery trajectory optimization problem, saving time during online execution. We compare our method to a reinforcement learning baseline and other methods that do not explicitly plan contact interactions, including on a hardware screwdriver-turning task where we show that recovering using our method improves task performance by 96% and that ours is the only method evaluated that can attempt recovery without causing catastrophic task failure. Videos can be found at https://dtourrecovery.github.io/.
|
| |
| 15:00-16:30, Paper TuI2I.219 | Add to My Program |
| CloSE: A Geometric Shape-Agnostic Cloth State Representation |
|
| Kamat, Jay | Institut De Robòtica I Informàtica Industrial (CSIC-UPC) |
| Borras Sol, Julia | Institut De Robòtica I Informàtica Industrial, CSIC-UPC |
| Torras, Carme | Csic - Upc |
Keywords: Computational Geometry, Semantic Scene Understanding, Perception for Grasping and Manipulation
Abstract: Cloth manipulation is a difficult problem mainly because of the non-rigid nature of cloth, which makes a good representation of deformation essential. We present a new representation for the deformation-state of clothes. First, we propose the dGLI disk representation based on topological indices computed for edge segments of the cloth border that are arranged on a circular grid. The heat-map of the dGLI disk uncovers patterns that correspond to features of the cloth state that are consistent for different shapes, sizes or orientation of the cloth. We then abstract these important features from the dGLI disk into a circle, calling it the Cloth StatE representation (CloSE). This representation is compact, continuous, and general for different shapes. We show that this representation is able to accurately predict the fold locations for several simulation clothing datasets. Finally, we also show the strengths of this representation in two relevant applications: semantic labeling and high- and low-level planning. The code and the dataset can be accessed from: https://close-representation.github.io/
|
| |
| 15:00-16:30, Paper TuI2I.220 | Add to My Program |
| Exploring Single Domain Generalization of LiDAR-Based Semantic Segmentation under Imperfect Labels |
|
| Kong, Weitong | Karlsruhe Institute of Technology |
| Zeng, Zichao | University College London |
| Wen, Di | Karlsruhe Institute of Technology |
| Wei, Jiale | Karlsruhe Institut of Technology |
| Peng, Kunyu | Karlsruhe Institute of Technology |
| Goo, June Moh | University College London |
| Boehm, Jan | UCL |
| Stiefelhagen, Rainer | Karlsruhe Institute of Technology |
Keywords: Semantic Scene Understanding, Deep Learning Methods, Deep Learning for Visual Perception
Abstract: Accurate perception is critical for vehicle safety, with LiDAR as a key enabler in autonomous driving. To ensure robust performance across environments, sensor types, and weather conditions without costly re-annotation, domain generalization in LiDAR-based 3D semantic segmentation is essential. However, LiDAR annotations are often noisy due to sensor imperfections, occlusions, and human errors. Such noise degrades segmentation accuracy and is further amplified under domain shifts, threatening system reliability. While noisy-label learning is well-studied in images, its extension to 3D LiDAR segmentation under domain generalization remains largely unexplored, as the sparse and irregular structure of point clouds limits direct use of 2D methods. To address this gap, we introduce the novel task Domain Generalization for LiDAR Semantic Segmentation under Noisy Labels (DGLSS-NL) and establish the first benchmark by adapting three representative noisy-label learning strategies from image classification to 3D segmentation. However, we find that existing noisy-label learning approaches adapt poorly to LiDAR data. We therefore propose DuNe, a dual-view framework with strong and weak branches that enforce feature-level consistency and apply cross-entropy loss based on confidence-aware filtering of predictions. Our approach shows state-of-the-art performance by achieving 56.86% mIoU on SemanticKITTI, 42.28% on nuScenes, and 52.58% on SemanticPOSS under 10% symmetric label noise, with an overall Arithmetic Mean (AM) of 49.57% and Harmonic Mean (HM) of 48.50%, thereby demonstrating robust domain generalization in DGLSS-NL tasks. The code is available at https://github.com/MKong17/DGLSS-NL.git
|
| |
| 15:00-16:30, Paper TuI2I.221 | Add to My Program |
| MACE: Mixture-Of-Experts Accelerated Coordinate Encoding for Large-Scale Scene Localization and Rendering |
|
| Liu, Mingkai | Peking University |
| Fan, Dikai | PICO, ByteDance |
| Que, Haohua | Tsinghua University |
| Gao, Haojia | Beijing Univeristy of Technology |
| Liu, Xiao | PICO, ByteDance |
| Peng, Shuxue | PICO, ByteDance |
| Lin, Meixia | PICO, ByteDance |
| Gu, Shengyu | PICO, ByteDance |
| Ye, Ruicong | Beijing Forestry University |
| Qiu, Wanli | Peking University |
| Yao, Handong | University of Georgia |
| Zhang, Ruopeng | Chongqing Vocational Institute of Engineering,Big Data and Internet of Things School,Chongqing |
| Huang, Xianliang | Bytedance |
|
|
| |
| 15:00-16:30, Paper TuI2I.222 | Add to My Program |
| Characterization and Evaluation of Screw-Based Locomotion across Aquatic, Granular, and Transitional Media |
|
| Chen, Derek | University of California, San Diego |
| Samuels, Zoe | University of California, San Diego |
| Peiros, Lizzie | University of California, San Diego |
| Mukherjee, Sujaan | University of California, San Diego |
| Yip, Michael C. | University of California, San Diego |
Keywords: Performance Evaluation and Benchmarking, Field Robots, Marine Robotics
Abstract: Screw-based propulsion systems offer promising capabilities for amphibious mobility, yet face significant challenges in optimizing locomotion across water, granular materials, and transitional environments. This study presents a systematic investigation into the locomotion performance of various screw configurations in media such as dry sand, wet sand, saturated sand, and water. Through a principles-first approach to analyze screw performance, it was found that certain parameters are dominant in their impact on performance. Depending on the media, derived parameters inspired from optimizing heat sink design help categorize performance within the dominant design parameters. Our results provide specific insights into screw shell design and adaptive locomotion strategies to enhance the performance of screw-based propulsion systems for versatile amphibious applications.
|
| |
| 15:00-16:30, Paper TuI2I.223 | Add to My Program |
| Hydrodynamic Optimization of a Spherical Amphibious Robot's Paddle-Wheel for Effective Water Surface Locomotion |
|
| Arif, Muhammad Affan | Xi'an Jiaotong University |
| Song, Jiyuan | Guangming Laboratory, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) |
| Tu, Yao | Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) |
| Duan, YiHui | the Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) |
| Shuangge, Yang | the Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, 518107, China |
| Li, Qingquan | Shenzhen University |
|
|
| |
| 15:00-16:30, Paper TuI2I.224 | Add to My Program |
| ShapeForce: Low-Cost Soft Robotic Wrist for Contact-Rich Manipulation |
|
| Zhu, Jinxuan | National University of Singapore |
| Yan, Zihao | National University of Singapore |
| Xiao, Yangyu | National University of Singapore |
| Guo, Jingxiang | National University of Singapore |
| Tie, Chenrui | National University of Singapore |
| Cao, Xinyi | National University of Singapore |
| Zheng, Yuhang | National University of Singapore |
| Shao, Lin | National University of Singapore |
Keywords: Perception for Grasping and Manipulation, Compliant Joints and Mechanisms, Soft Sensors and Actuators
Abstract: Contact feedback is essential for contact-rich robotic manipulation, as it allows the robot to detect subtle interaction changes and adjust its actions accordingly. Six- axis force-torque sensors are commonly used to obtain contact feedback, but their high cost and fragility have discouraged many researchers from adopting them in contact-rich tasks. To offer a more cost-efficient and easy-accessible source of contact feedback, we present ShapeForce, a low-cost, plug- and-play soft wrist that provides force-like signals for contact-rich robotic manipulation. Inspired by how humans rely on relative force changes in contact rather than precise force magnitudes, ShapeForce converts external force and torque into measurable deformations of its compliant core, which are then estimated via marker-based pose tracking and converted into force-like signals. Our design eliminates the need for calibration or specialized electronics to obtain exact values, and instead focuses on capturing force and torque changes sufficient for enabling contact-rich manipulation. Extensive experiments across diverse contact-rich tasks and manipulation policies demonstrate that ShapeForce delivers performance comparable to six-axis force-torque sensors at an extremely low cost. More details of this project can be found at our project page:https://shapeforce.github.io/.
|
| |
| 15:00-16:30, Paper TuI2I.225 | Add to My Program |
| Energy-Based Injury Protection Database: Including Shearing Contact Thresholds for Hand and Finger Using Porcine Surrogates |
|
| Kirschner, Robin Jeanne | TU Munich, Institute for Robotics and Systems Intelligence |
| Huber, Anna | Technical University Munich |
| Micheler, Carina M. | Technical University of Munich, TUM School of Medicine, Klinikum rechts der Isar |
| Rajaei, Nader | Technical University of Munich |
| Burgkart, Rainer | Technische Universität München |
| Haddadin, Sami | Mohamed Bin Zayed University of Artificial Intelligence |
|
|
| |
| 15:00-16:30, Paper TuI2I.226 | Add to My Program |
| Iterative Learning-Based Centre-Of-Mass Impedance Control for Articulated-Soft Humanoid Robots |
|
| Wang, Yibin | King's College London |
| Zhou, Lin | King’s College London |
| Morris, Sacha | King's College London |
| Luo, Shan | King's College London |
| Spyrakos-Papastavridis, Emmanouil | King's College London |
Keywords: Compliance and Impedance Control, Body Balancing, Humanoid Robot Systems
Abstract: Achieving safe and robust interaction in articulated-soft humanoid robots (ASRs) remains a major challenge due to their compliant joints, high degree of freedom, and highly nonlinear coupled dynamics, which makes them especially sensitive to external disturbances. This paper presents a novel contact-force-based iterative learning center-of-mass (CoM) impedance control framework (CF-IL-CIC) specifically designed to enhance disturbance robustness in floating-base ASRs. The key idea is to iteratively derive a time-series gross force compensation term from zero moment point (ZMP) tracking errors of previous trials, using a proportional-derivative (PD)-type update rule in simulation. This compensation is integrated with a contact-force-based CoM impedance controller to improve push recovery without requiring precise dynamic models or heavy online optimization. The approach is accompanied by mathematical proof of divergent component of motion (DCM) error convergence, ensuring theoretical stability guarantees. The proposed method is validated through both dynamic simulations and real-robot experiments on the compliant humanoid BRUCE, demonstrating significant improvements in external impact rejection and recovery stability compared to baseline controllers.
|
| |
| 15:00-16:30, Paper TuI2I.227 | Add to My Program |
| Explaining Failures of Cyber-Physical Systems with Actual Causality |
|
| Elimelech, Khen | King's College London |
| Yaacov, Tom | King's College London |
| Kelly, David A. | King's College London |
| Chockler, Hana | King's College London |
| Vardi, Moshe Y. | Rice University |
Keywords: Formal Methods in Robotics and Automation, Hybrid Logical/Dynamical Planning and Verification, Failure Detection and Recovery
Abstract: Modern autonomous Cyber-Physical Systems (CPSs), such as self-driving cars, face increasingly complex demands, and yet are expected to act reliably. The black-box nature often characterizing such systems, especially those relying on neural components, makes it impossible to fully verify the system behavior prior to deployment. Unfortunately, unexpected failures--cases when the system does not comply with its specification--are inevitable and may have catastrophic implications. To improve trust in the system and facilitate future mitigation after a failure occurs, it is important to try to derive an explanation for the unexpected system behavior. This paper introduces the novel concept of leveraging the framework of actual causality for CPS failure explanation. Up until now, this framework was only used to derive explanations in the context of simple systems, such as image classifiers. This paper addresses the theoretical gaps and provides the guidance needed to allow for correct explanation derivation in the CPS domain. Beyond the theoretical contribution, the paper presents two novel, practical, system-agnostic explanation derivation algorithms, allowing to prioritize either explanation optimality or derivation efficiency. The approach is demonstrated and evaluated in the context of a neural-network-controlled autonomous car, designed to avoid collisions.
|
| |
| 15:00-16:30, Paper TuI2I.228 | Add to My Program |
| CRAFT: Long-Horizon Cable Routing Algorithm and Low-Friction Caging Gripper |
|
| Chen, Ziyang | UC Berkeley |
| Chitambar, Yash | University |
| Lam, Theodore | University of California, Berkeley |
| Li, Hui | Autodesk Research |
| Chitta, Sachin | Autodesk Inc |
| Goldberg, Ken | UC Berkeley |
Keywords: Industrial Robots, Factory Automation, Bimanual Manipulation
Abstract: Cable routing is a common manipulation task in assembly and manufacturing, yet it remains challenging due to the deformable nature of cables and the constraints of cluttered routing environments. In this paper, we present CRAFT: Cable Routing Around Fixtures using Two grippers, a novel hardware plus software architecture that integrates unimanual and bimanual operations for long-horizon cable routing. To address jamming due to friction, we present a novel caging gripper with roller mechanism. Physical experiments consisting of 160 trials on a modified NIST board with five types of fixtures and turning angle up to 930 degrees, yield an average completion ratio of 84.5% across four routing difficulty tiers, representing a 54.2% improvement over an earlier baseline. The cable routing materials and benchmarks are available at https://manipulation-net.org/tasks/cable_routing.html.
|
| |
| 15:00-16:30, Paper TuI2I.229 | Add to My Program |
| Lifelong Localization in Dynamic Indoor Environments Combining Odometry with Sparse Distance Sampling |
|
| Bilevich, Michael M. | Tel Aviv University |
| Buber, Tomer | Tel Aviv University |
| Halperin, Dan | Tel Aviv University |
Keywords: Localization
Abstract: Localization is a key task in robot navigation, and many techniques exist for it. In many plausible scenarios, a robot might face unforeseen, dynamic obstacles, rendering any pre-determined map inaccurate for localization. In this work, we propose a robust lifelong localization framework in dynamic planar indoor environments, using the robot's odometry and sparse distance sampling. We demonstrate how distance samples can be used to provide a robust prior on the robot's location. This technique can solve the kidnapped robot problem in real time, up to symmetries. Based on insights from real-world recorded data, we also account for dynamic obstacles. We then fuse this prior, over time, with the odometry to converge to the robot's location. A central property of our method is that it provably converges to the robot's ground truth pose even in large indoor environments when the environment is static. We further show that this guarantee also holds in dynamic environments, as long as the nature of those changes has been correctly learned. We demonstrate the effectiveness of our approach in different real-world indoor environments. In particular, we achieve a localization comparable to SLAM with merely a few (sixteen) distance samples, as opposed to the full LiDAR range. Sufficing with only sparse distance sampling is advantageous in terms of sensor cost, privacy, storage space, and transmission bandwidth.
|
| |
| 15:00-16:30, Paper TuI2I.230 | Add to My Program |
| Residual Off-Policy RL for Finetuning Behavior Cloning Policies |
|
| Ankile, Lars | Stanford University |
| Jiang, Zhenyu | The Unversity of Texas at Austin |
| Duan, Yan | Amazon |
| Shi, Guanya | Carnegie Mellon University |
| Abbeel, Pieter | UC Berkeley |
| Nagabandi, Anusha | UC Berkeley |
Keywords: Bimanual Manipulation, Reinforcement Learning, Imitation Learning
Abstract: Recent advances in behavior cloning (BC) have enabled impressive visuomotor control policies. However, these approaches are limited by the quality of human demonstrations, the manual effort required for data collection, and the diminishing returns from offline data. In comparison, reinforcement learning (RL) trains an agent through autonomous interaction with the environment and has shown remarkable success in various domains. Still, training RL policies directly on real-world robots remains challenging due to sample inefficiency and safety concerns. These challenges are compounded for high-degree-of-freedom (DoF) systems that must learn from sparse rewards over long horizons. We present a recipe that combines the benefits of BC and RL through a residual learning framework. Our approach leverages BC policies as black-box bases and learns lightweight per-step residual corrections via sample-efficient off-policy RL. We demonstrate that our method requires only sparse binary reward signals and can effectively improve manipulation policies on high-degree-of-freedom (DoF) systems in both simulation and the real world. In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands. Our results demonstrate state-of-the-art performance in various vision-based tasks, pointing towards a practical pathway for deploying RL in the real world.
|
| |
| 15:00-16:30, Paper TuI2I.231 | Add to My Program |
| Stair Climbing for Vehicles with Articulated Tracked Arms: Closed-Loop Flippers Control |
|
| Henriques da Silva, Thales | Federal University of Rio De Janeiro |
| Lizarralde, Fernando | Federal University of Rio De Janeiro |
Keywords: Autonomous Vehicle Navigation, Constrained Motion Planning, Wheeled Robots
Abstract: We present a new approach to the stair-climbing problem for robots that rely on actively articulated tracked arms. The robot in question is considered to have a main locomotion system, such as wheels or tracks, and arms that can be controlled to extend the robot's mobility when needed. Further, we also assume the robot is equipped with a depth sensor (for stair perception) and an IMU (solely for orientation estimates). This paper's proposed key feature is to analyze the robot's differential kinematics as a planar manipulator with a position constraint. For the proposed model, we then present a state feedback control law with stability and convergence properties to move arms, guiding the robot towards the stairs autonomously. The controller fits the tracks to the floor, making the robot perform appropriate maneuvers, like a snake climbing an inclined plane, preventing sudden movements, improving traction, and avoiding collisions with the floor. The presented method seems to be a novel way to interpret the problem. The proposed control scheme is validated in a real robot, and experimental results are presented.
|
| |
| 15:00-16:30, Paper TuI2I.232 | Add to My Program |
| Optimal Control Approach for Non-Prehensile Ball Juggling Using a 7-DoF Manipulator |
|
| Ramadani, Joel | Technical University of Munich |
| Rakcevic, Vasilije | Technical University of Munich |
| Laha, Riddhiman | Technical University of Munich |
| Sachtler, Arne | Technical University of Munich (TUM) |
| Le Mesle, Valentin | Technical University of Munich |
| Lilienthal, Achim J. | TU Munich |
| Haddadin, Sami | Mohamed Bin Zayed University of Artificial Intelligence |
Keywords: Hybrid Logical/Dynamical Planning and Verification, Dynamics, Task and Motion Planning
Abstract: Non-prehensile object manipulation skills are important for real-world robot interactions, enabling highly dynamic tasks such as balancing a glass on a tray or the controlled sliding of items on a table. Among such tasks, those characterised by high-speed manipulation requirements and general sensitivity of the resulting hybrid dynamics are particularly hard to accomplish. Within these, juggling can be seen as a highly challenging maneuver to be solved. The key to robotic juggling is achieving dynamic stabilisation of an underactuated object. Since the object does not possess the ability of self-correction, its stability is entirely dependent on the forces applied to it. This creates a system that is sensitive to control inputs, where timing is critical to continuously counteract deviations and maintain the desired behavior. We develop a systematic method to control a 7-degree-of-freedom manipulator performing non-prehensile ball juggling with a tool. Our primary contribution is a model-based framework for generating juggling trajectories and stabilizing a periodic juggling motion for this hybrid system. The framework incorporates a two-stage optimal control approach to compute the underlying feasible motion patterns required for stable juggling. Offline-computed trajectories are then organised to enable real-time error correction without solving optimal control problems online. We demonstrate the effectiveness of the resulting controller by first evaluating its performance in a simulation environment and performing an experiment using a Franka Emika Panda robot.
|
| |
| 15:00-16:30, Paper TuI2I.233 | Add to My Program |
| Amortized NeuralSDF-Mesh Collision Detection for Robotic Contact Simulation |
|
| Yun, Jinhee | Seoul National University |
| Lee, Jeongmin | Seoul National University, Holiday Robotics |
| Park, Sunkyung | Seoul National University |
| Lee, Dongjun | Seoul National University |
Keywords: Simulation and Animation, Contact Modeling, Optimization and Optimal Control
Abstract: Collision detection is a fundamental problem in robotics, but handling collisions between non-convex objects remains challenging. A common approach for representing non-convex geometry is a signed distance function (SDF). Voxel-based SDF (VoxelSDF) enables fast distance queries but suffers from discretization artifacts and high memory costs. Neural implicit SDF (NeuralSDF) provides a continuous and memory-efficient representation with generalization, yet their slow query speed has limited their use in collision detection. To overcome these limitations, this paper proposes a novel amortized NeuralSDF–mesh collision detection framework. NeuralSDF–mesh collisions are formulated as a constrained optimization problem at the triangle level, and the Karush–Kuhn–Tucker conditions are derived to enable the amortization. A learning-based amortized optimization directly predicts collisions in a single forward pass, eliminating iterative optimization procedures. The amortized model adopts an auto-decoder architecture, extending the advantages of NeuralSDF in memory efficiency and category-level generalization to collision detection. Experiments demonstrate substantial speedups over baseline methods while maintaining comparable contact quality and reduced memory usage. The proposed approach also exhibits category-level generalization to unseen objects and can be applied to various robotic simulation scenarios.
|
| |
| 15:00-16:30, Paper TuI2I.234 | Add to My Program |
| Volumetric Ergodic Control |
|
| Kwon, Jueun | Northwestern University |
| Sun, Max Muchen | Northwestern University |
| Murphey, Todd | Northwestern University |
Keywords: Motion and Path Planning, Planning under Uncertainty, Sensor-based Control
Abstract: Ergodic control synthesizes optimal coverage behaviors over spatial distributions for nonlinear systems. However, existing formulations model the robot as a non-volumetric point, whereas in practice a robot interacts with the environment through its body and sensors with physical volume. In this work, we introduce a new ergodic control formulation that optimizes spatial coverage using a volumetric state representation. Our method preserves the asymptotic coverage guarantees of ergodic control, adds minimal computational overhead for real-time control, and supports arbitrary sample-based volumetric models. We evaluate our method across search and manipulation tasks---with multiple robot dynamics and end-effector geometries or sensor models---and show that it improves coverage efficiency by more than a factor of two while maintaining a 100% task completion rate across all experiments, outperforming the standard ergodic control method. Finally, we demonstrate the effectiveness of our method on a robot arm performing mechanical erasing tasks. Project website: https://murpheylab.github.io/vec/.
|
| |
| 15:00-16:30, Paper TuI2I.235 | Add to My Program |
| Consensus Driven Dynamical Systems Control for Dual-Arm Handover |
|
| Das, Debojit | Indian Institute of Technology Gandhinagar |
| Jain, Siddhi | Addverb Technologies |
| Kumar, Rajesh | Addverb Technologies |
| Palanthandalam-Madapusi, Harish | Indian Institute of Technology Gandhinagar |
Keywords: Bimanual Manipulation, Dual Arm Manipulation, Manipulation Planning
Abstract: Robot bimanual handovers (transferring an object between two arms) require careful coordination of timing, motion, and obstacle avoidance. Efficient, human-like object transfer between cooperating robots demands both spatial and tight temporal coordination. Existing approaches treat these requirements in isolation or rely on pre-computed trajectories that fail when obstacles/disturbances appear, degrading performance, segmented behavior, and introducing desynchronization. This paper introduces a dynamical systems framework that transitions each arm from independent asynchronous motion to coupled synchronous coordination. In this context, coupling denotes both the spatial coordination of the arms and their temporal synchronization. The framework's coordination and synchrony are robust to obstacles/disturbances along its path. Experiments on an upper torso dual-arm platform and on traditional manipulators show seamless handovers that remain stable despite obstructions, always preserving spatial coordination and temporal synchrony.
|
| |
| 15:00-16:30, Paper TuI2I.236 | Add to My Program |
| Find Anything Like Humans: Online Semantic Mapping and Coarse-To-Fine Navigation in Dynamic Environments |
|
| Zhang, Yutian | State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School |
| Zhang, Jianyu | State Key Laboratory of General Artificial Inteligence, Peking University, Shenzhen Graduate School |
| Liu, Mengyuan | State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School |
|
|
| |
| 15:00-16:30, Paper TuI2I.237 | Add to My Program |
| Dribble Master: Learning Agile Humanoid Dribbling through Legged Locomotion |
|
| Wang, Zhuoheng | Tsinghua University |
| Zhou, Jinyin | Tsinghua University |
| Wu, Qi | Cornell University |
Keywords: Humanoid and Bipedal Locomotion, Humanoid Robot Systems, Reinforcement Learning
Abstract: Humanoid soccer dribbling is a highly challenging task that demands dexterous ball manipulation while maintaining dynamic balance. Traditional rule-based methods often struggle to achieve accurate ball control due to their reliance on fixed walking patterns and limited adaptability to real-time ball dynamics. To address these challenges, we propose a two-stage curriculum learning framework that enables a humanoid robot to acquire dribbling skills without explicit dynamics or predefined trajectories. In the first stage, the robot learns basic locomotion skills; in the second stage, we fine-tune the policy for agile dribbling maneuvers. We further introduce a virtual camera model in simulation that simulates the field of view and perception constraints of the real robot, enabling realistic ball perception during training. We also design heuristic rewards to encourage active sensing, promoting a broader visual range for continuous ball perception. The policy is trained in simulation and successfully transferred to a physical humanoid robot. Experiment results demonstrate that our method enables effective ball manipulation, achieving flexible and visually appealing dribbling behaviors across multiple environments. This work highlights the potential of reinforcement learning in developing agile humanoid soccer robots. Additional details and videos are available at https://zhuoheng0910.github.io/dribble-master/.
|
| |
| 15:00-16:30, Paper TuI2I.238 | Add to My Program |
| Gaussian Process Implicit Surfaces As Control Barrier Functions for Safe Robot Navigation |
|
| Khan, Mouhyemen | Symbotic |
| Ibuki, Tatsuya | Meiji University |
| Chatterjee, Abhijit | Georgia Institute of Technology |
Keywords: Collision Avoidance, Optimization and Optimal Control, Aerial Systems: Mechanics and Control
Abstract: Level set methods underpin modern safety techniques such as control barrier functions (CBFs), while also serving as implicit surface representations for geometric shapes via distance fields. Inspired by these two paradigms, we propose a unified framework where the implicit surface itself acts as a CBF. We leverage Gaussian process (GP) implicit surface (GPIS) to represent the safety boundaries, using safety samples derived from sensor measurements to condition the GP. The GP posterior mean defines the implicit safety surface (safety belief), while the posterior variance provides a robust safety margin. Although GPs have favorable properties such as uncertainty estimation and analytical tractability, they scale cubically with data. To alleviate this issue, we develop a sparse solution called sparse Gaussian CBFs. To the best of our knowledge, GPIS has not been explicitly used to synthesize CBFs. We validate the approach on collision avoidance tasks in two settings: a simulated 7-DOF manipulator operating around the Stanford bunny, and a quadrotor navigating in 3D around a physical chair. In both cases, Gaussian CBFs (with and without sparsity) enable safe interaction and collision-free execution of trajectories that would otherwise intersect the objects.
|
| |
| 15:00-16:30, Paper TuI2I.239 | Add to My Program |
| Scalable Multi Agent Diffusion Policies for Coverage Control |
|
| Vatnsdal, Frederic | University of Pennsylvania |
| Garcia Camargo, Romina | University of Pennsylvania |
| Agarwal, Saurav | University of Pennsylvania |
| Ribeiro, Alejandro | University of Pennsylvania |
Keywords: Multi-Robot Systems, Swarm Robotics, Distributed Robot Systems
Abstract: We propose MADP, a novel diffusion-model-based approach for collaboration in decentralized robot swarms. MADP leverages diffusion models to generate samples from complex and high-dimensional action distributions that capture the interdependencies between agents' actions. Each robot conditions policy sampling on a fused representation of its own observations and perceptual embeddings received from peers. To evaluate this approach, we task a team of holonomic robots piloted by MADP to address coverage control---a canonical multi agent navigation problem. The policy is trained via imitation learning from a clairvoyant expert on the coverage control problem, with the diffusion process parameterized by a spatial transformer architecture to enable decentralized inference. We evaluate the system under varying numbers, locations, and variances of importance density functions, capturing the robustness demands of real-world coverage tasks. Experiments demonstrate that our model inherits valuable properties from diffusion models, generalizing across agent densities and environments, and consistently outperforming state-of-the-art baselines.
|
| |
| 15:00-16:30, Paper TuI2I.240 | Add to My Program |
| A Bayesian Reasoning Framework for Robotic Systems in Autonomous Casualty Triage |
|
| Rusiecki, Szymon | AGH University of Krakow |
| Morales, Cecilia | Carnege Mellon University |
| Störy, Pia | Osnabrück University |
| Elenberg, Kimberly | Carnegie Mellon University |
| Weiss, Leonard | University of Pittsburgh |
| Dubrawski, Artur | Carnegie Mellon University |
Keywords: Probabilistic Inference, Search and Rescue Robots, Sensor Fusion
Abstract: Autonomous robots deployed in mass casualty incidents (MCI) face the challenge of making critical decisions based on incomplete and noisy perceptual data. We present an autonomous robotic system for casualty assessment that fuses outputs from multiple vision-based algorithms, estimating signs of severe hemorrhage, visible trauma, or physical alertness, into a coherent triage assessment. At the core of our system is a Bayesian network, constructed from expert-defined rules, which enables probabilistic reasoning about a casualty's condition even with missing or conflicting sensory inputs. The system, evaluated during the DARPA Triage Challenge (DTC) in realistic MCI scenarios involving 11 and 9 casualties, demonstrated a nearly three-fold improvement in physiological assessment accuracy (from 15% to 42% and 19% to 46%) compared to a vision-only baseline. More importantly, overall triage accuracy increased from 14% to 53%, while the diagnostic coverage of the system expanded from 31% to 95% of cases. These results demonstrate that integrating expert-guided probabilistic reasoning with advanced vision-based sensing can significantly enhance the reliability and decision-making capabilities of autonomous systems in critical real-world applications.
|
| |
| 15:00-16:30, Paper TuI2I.241 | Add to My Program |
| Disaster-Aware Informative Path Planning in Emergency Response Scenarios |
|
| Cheng, Xinya | Beijing University of Posts and Telecommunications |
| Liu, Na | Beijing University of Posts and Telecommunications |
Keywords: Aerial Systems: Applications, Motion and Path Planning, Vision-Based Navigation
Abstract: In emergency response scenarios, rapid acquisition of critical disaster information supports effective decision-making. Traditional geometric coverage-based path planning often struggles to balance efficiency and information value. To address this, we propose a Disaster-Aware Informative Path Planning (DAIPP) method, which integrates a Siamese UNetbased building damage recognition model and formulates a novel information value function that considers recognition results, model uncertainty, and flight cost. We design an improved Frontier-based path planning algorithm, named the Selective Frontier Algorithm (SFA), which enhances the selection of candidate points to achieve the prioritized exploration of critical regions. To validate its effectiveness, the proposed method is compared with coverage path planning, random planning, and Monte Carlo tree search (MCTS). Experiments on the xView2 dataset demonstrate that the proposed method outperforms baselines in terms of information coverage, semantic target hit rate, and weighted information coverage, providing strong support for efficient disaster perception in emergency response.
|
| |
| 15:00-16:30, Paper TuI2I.242 | Add to My Program |
| CorrectManip: A Data-Driven Closed-Loop Framework for Autonomous Skill Learning with Failure Recovery |
|
| Li, Shiwen | Westlake University |
| Yang, Zhen | Beihang University |
| Wang, Zuofu | Peking University |
| Ma, Bin | The Hong Kong University of Science and Technology (Guangzhou) |
| Tengju, Ye | Udeer |
| Zhao, Shiyu | Westlake University |
| Chen, Junbo | Udeer AI |
| Yu, Kaicheng | Westlake University |
Keywords: Autonomous Agents, Deep Learning Methods, Task Planning
Abstract: Simulation-based training offers an efficient paradigm for robotic skill learning, providing scalable data generation while reducing reliance on costly hardware trials and manual data collection. However, existing methods that rely on handcrafted scenarios fail to fully cover the complexity of open-world variations and neglect the critical insights offered by inevitable failures in unseen environments. As a result, current policies struggle to achieve robust generalization, hindering deployment in open-world settings. This highlights the need for a continuous learning framework that enables robots to reflect on failures and iteratively refine policies in a targeted way. In this paper, we propose CorrectManip, a novel data-driven closed-loop framework that enables the policy to continuously improve performance in unseen environments by learning from failures. Existing methods remain confined to single-loop adaptation, addressing policy errors in static environments or indiscriminately scaling data without targeting failure modes, CorrectManip closes the loop both at the policy recovery and environment generation: EvoGen, a self-evolving generator, and TTO, a test-time optimization module. EvoGen adaptively generates training data to strengthen policy performance, while TTO analyzes execution failures to provide fine-grained optimization signals. Together, TTO exposes policy weakness and EvoGen converts them into task-relevant training data, forming a closed feedback loop that drives continual policy improvement and stronger generalization. Extensive experiments across diverse tasks demonstrate that CorrectManip improves the average success rate in unseen environments by 45.22% over baseline methods. These results validate the complementary roles of TTO and EvoGen in enhancing generalization. Furthermore, we showcase sim-to-real transfer ability on Unitree H1 and Unitree G1. Demos are available here.
|
| |
| 15:00-16:30, Paper TuI2I.243 | Add to My Program |
| PRRTC: GPU-Parallel RRT-Connect for Fast, Consistent, and Low-Cost Motion Planning |
|
| Huang, Chih H. | Columbia University |
| Jadhav, Pranav | Purdue University |
| Plancher, Brian | Dartmouth College |
| Kingston, Zachary | Purdue University |
Keywords: Motion and Path Planning, Collision Avoidance
Abstract: Sampling-based motion planning algorithms, like the Rapidly-Exploring Random Tree (RRT) and its widely used variant, RRT-Connect, provide efficient solutions for high-dimensional planning problems faced by real-world robots. However, these methods remain computationally intensive, particularly in complex environments that require many collision checks. To improve performance recent efforts have explored parallelizing specific components of RRT such as collision checking, or running multiple planners independently. However, little has been done to develop an integrated parallelism approach, co-designed for large-scale parallelism. In this work we present pRRTC, a RRT-Connect based planner co-designed for GPU acceleration across the entire algorithm through parallel expansion and SIMT-optimized collision checking. We evaluate the effectiveness of pRRTC on the MotionBenchMaker dataset using robots with 7, 8, and 14 degrees-of-freedom (DoF). Compared to the state-of-the-art, pRRTC achieves as much as a 10× speedup on constrained reaching tasks with a 5.4× reduction in standard deviation. pRRTC also achieves a 1.4× reduction in average initial path cost. Finally, we deploy pRRTC on a 14-DoF dual Franka Panda arm setup and demonstrate real-time, collision-free motion planning with dynamic obstacles. We open-source our planner to support the wider community.
|
| |
| 15:00-16:30, Paper TuI2I.244 | Add to My Program |
| Analytical Stiffness Formulation and Interpretation for Six-DOF Tensegrity Joints Using Screw Theory |
|
| Monke, Robbie | University of Alabama |
| Vikas, Vishesh | University of Alabama |
Keywords: Kinematics, Compliant Joints and Mechanisms, Mechanism Design
Abstract: Compliant mechanisms, e.g., tensegrities, inherently exhibit nonlinear behavior, wherein the stiffness matrix, evaluated at a specific configuration, characterizes the instantaneous relationship between applied forces and resulting displacements. For traditional robot joints, the stiffness matrix is defined using Cartesian and Euler angle parameters. This representation is convenient when the joints display translation or single degree of rotation behavior. However, it faces parameterization issues in modeling higher degree of freedom joints due to singularities and lack of uniqueness. Lie groups and screw theory representations provide a minimal and intrinsic representation of the rigid body motion. This representation is well suited for tensegrity joints which combine tensile and compressive members and behave as six degree-of-freedom joints. A key challenge in this context is that computing the stiffness matrix necessitates differentiating the transformation matrix with respect to the screw, a task that is highly nontrivial. This work derives an analytical formulation of the stiffness matrix for six degree-of-freedom tensegrity joints using screw theory representation, including a closed-form expression for the derivative of the transformation matrix with respect to its exponential coordinates. The analytical results are validated against numerical differentiation while achieving approximately three times faster computation speeds. The paper further interprets the stiffness matrix through block-form, column-wise, and row-wise representations, providing additional physical insight into the translational, rotational, and coupled stiffness contributions. These contributions establish an efficient framework for the stiffness analysis and lay the foundation for future integration of screw theory methods into Euler-Lagrange dynamics for higher degree-of-freedom robot joints including tensegrity joints.
|
| |
| 15:00-16:30, Paper TuI2I.245 | Add to My Program |
| Monocular Visual Odometry Via Diffusion-Based Joint Learning of Optical Flow and Depth |
|
| Hu, Qingyuan | University of Chinese Academy of Sciences |
| Li, Wei | Institute of Computing Technology, Chinese Academy of Sciences |
| Meng, Xuebin | Chinese Academy of Sciences |
| Hu, Yu | Institute of Computing Technology Chinese Academy of Sciences |
Keywords: Localization, SLAM, Visual Learning
Abstract: Monocular visual odometry (VO) often suffers from scale ambiguity and interference from moving objects in real-world scenarios. Jointly learning optical flow and depth estimation provides a promising solution for these issues by leveraging their geometric correlation and task complementarity. In this paper, we propose JFD-VO, a novel monocular VO framework that integrates jointly learned optical flow and depth networks. We design a two-stage training process with recursive noise diffusion and a specialized loss function, which enables the model to predict dense and scale-aware depth and optical flow using only readily available sparse LiDAR data and pose ground truth, thereby eliminating the need for expensive and difficult-to-obtain dense annotations. Furthermore, a dedicated masking module is incorporated during joint training to enhance robustness in dynamic environments. Within the VO pipeline, we introduce a Keypoint-weighted Matching Selection module that prioritizes stable features based on forward-backward flow consistency, rather than treating all pixels equally as in conventional optical flow methods. Extensive experiments on public datasets demonstrate the effectiveness of our joint training approach. JFD-VO achieves state-of-the-art accuracy, reducing absolute trajectory error by 14.99% and 27.37% over KPDepth-VO and DF-VO.Code and our self-collected dataset are available at: https://github.com/huqingyuan-9952/JFD-VO.
|
| |
| 15:00-16:30, Paper TuI2I.246 | Add to My Program |
| Failing Gracefully: Mitigating Impact of Inevitable Robot Failures |
|
| Nguyen, Duc | George Mason University |
| Ghani, Saad Abdul | George Mason University |
| Marshall, Andrew | Independent |
| Andreyev, Allison | Independent |
| Stein, Gregory | George Mason University |
| Xiao, Xuesu | George Mason University |
Keywords: Robot Safety, Safety in HRI, Service Robotics
Abstract: Service robots operate in household environments shared with humans, pets, and everyday objects, where they are highly susceptible to failures such as software crashes, hardware degradation, or unpredictable interactions. While roboticists strive to minimize failures, some remain inevitable, making it critical to mitigate their potential consequences for safe and reliable deployment. This paper introduces a novel safety formulation that evaluates both the probability of impactful interactions between robots and surrounding entities during failures, and the severity of their outcomes. By quantifying the impact of failures on different entities, our approach enables robots to make informed planning decisions that balance safety with task efficiency. To support systematic evaluation, we also present FailBench, a MuJoCo-based simulation framework for studying robot-environment interactions under diverse failure modes, including sensing issues and actuator malfunctions. Together, our safety formulation and FailBench provide a foundation for developing safer and more robust motion plans and learned policies in real-world household environments.
|
| |
| 15:00-16:30, Paper TuI2I.247 | Add to My Program |
| LLM-Guided Task and Affordance-Level Exploration in Reinforcement Learning |
|
| Luijkx, Jelle Douwe | Delft University of Technology |
| Ma, Runyu | Tu Delft |
| Ajanovic, Zlatan | RWTH Aachen University |
| Kober, Jens | University of Stuttgart |
Keywords: Reinforcement Learning
Abstract: Reinforcement learning (RL) is a promising approach for robotic manipulation, but it can suffer from low sample efficiency and requires extensive exploration of large state-action spaces. Recent methods leverage the commonsense knowledge and reasoning abilities of large language models (LLMs) to guide exploration toward more meaningful states. However, LLMs can produce plans that are semantically plausible yet physically infeasible, yielding unreliable behavior. We introduce LLM-TALE, a framework that uses LLMs' planning to directly steer RL exploration. LLM-TALE integrates planning at both the task level and the affordance level, improving learning efficiency by directing agents toward semantically meaningful actions. Unlike prior approaches that assume optimal LLM-generated plans or rewards, LLM-TALE corrects suboptimality online and explores multimodal affordance-level plans without human supervision. We evaluate LLM-TALE on pick-and-place tasks in standard RL benchmarks, observing improvements in both sample efficiency and success rates over strong baselines. Real-robot experiments indicate promising zero-shot sim-to-real transfer. Code and supplementary material are available at https://llm-tale.github.io.
|
| |
| 15:00-16:30, Paper TuI2I.248 | Add to My Program |
| TwinTrack: Bridging Vision and Contact Physics for Real-Time Tracking of Unknown Objects in Contact-Rich Scenes |
|
| Yang, Wen | Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) |
| Xie, Zhixian | Arizona State University |
| Wang, Yiting | Arizona State University |
| Tadepalli, Vamsi Sai Abhijit | Arizona State University |
| Ben Amor, Heni | Arizona State University |
| Lin, Shan | Arizona State University |
| Jin, Wanxin | Arizona State University |
Keywords: Perception for Grasping and Manipulation, Visual Tracking, Contact Modeling
Abstract: Real-time tracking of previously unseen, highly dynamic objects in contact-rich scenes, such as during dexterous in-hand manipulation, remains a major challenge. Pure vision-based approaches often fail under heavy occlusions due to frequent contact interactions and motion blur caused by abrupt impacts. We propose TwinTrack, a physics-aware perception system that enables robust, real-time 6-DoF pose tracking of unknown dynamic objects in contact-rich scenes by leveraging contact physics cues. At its core, TwinTrack integrates Real2Sim and Sim2Real. Real2Sim combines vision and contact physics to jointly estimate object geometry and physical properties: an initial reconstruction is obtained from vision, then refined by learning a geometry residual and simultaneously estimating physical parameters (e.g., mass, inertia, and friction) based on contact dynamics consistency. Sim2Real achieves robust pose estimation by adaptively fusing a visual tracker with predictions from the updated contact dynamics. TwinTrack is implemented on a GPU-accelerated, customized MJX engine to guarantee real-time performance. We evaluate our method on two contact-rich scenarios: object falling with environmental contacts and multi-fingered in-hand manipulation. Results show that, compared to baselines, TwinTrack delivers significantly more robust, accurate, and real-time tracking in these challenging settings, with tracking speeds above 20 Hz.
|
| |
| 15:00-16:30, Paper TuI2I.249 | Add to My Program |
| Trajectory Conditioned Cross-Embodiment Skill Transfer |
|
| Tang, Yuhang | Northwestern Polytechnical University |
| Lou, Yixuan | Northwestern Polytechnical University |
| Han, Pengfei | Shanghai AI Laboratory |
| Song, Haoming | Shanghai Jiao Tong University |
| Ye, Xinyi | Shanghai AI Laboratory |
| Wang, Dong | Shanghai AI Laboratory |
| Zhao, Bin | Northwestern Polytechnical University |
Keywords: Deep Learning Methods, Task Planning, Computer Vision for Automation
Abstract: Learning manipulation skills from human demonstration videos presents a promising yet challenging problem, primarily due to the significant embodiment gap between human body and robot manipulators. Existing methods rely on paired datasets or hand-crafted rewards, which limit scalability and generalization. We propose TrajSkill, a framework for Trajectory Conditioned Cross-embodiment Skill Transfer, enabling robots to acquire manipulation skills directly from human demonstration videos. Our key insight is to represent human motions as sparse optical flow trajectories, which serve as embodiment-agnostic motion cues by removing morphological variations while preserving essential dynamics. Conditioned on these trajectories together with visual and textual inputs, TrajSkill jointly synthesizes temporally consistent robot manipulation videos and translates them into executable actions, thereby achieving cross-embodiment skill transfer. Extensive experiments are conducted, and the results on simulation data (MetaWorld) show that TrajSkill reduces FVD by 39.6% and KVD by 36.6% compared with the state-of-the-art, and improves cross-embodiment success rate by up to 16.7%. Real-robot experiments in kitchen manipulation tasks further validate the effectiveness of our approach, demonstrating practical human-to-robot skill transfer across embodiments.
|
| |
| 15:00-16:30, Paper TuI2I.250 | Add to My Program |
| Category-Level Object Shape and Pose Estimation in Less Than a Millisecond |
|
| Shaikewitz, Lorenzo | Massachusettes Institute of Technology |
| Nguyen, Tim | Boston University |
| Carlone, Luca | Massachusetts Institute of Technology |
Keywords: Optimization and Optimal Control, Perception for Grasping and Manipulation, RGB-D Perception
Abstract: Object shape and pose estimation is a foundational robotics problem, supporting tasks from manipulation to scene understanding and navigation. We present a fast local solver for shape and pose estimation which requires only category-level object priors and admits an efficient certificate of global optimality. Given an RGB-D image of an object, we use a learned front-end to detect sparse, category-level semantic keypoints on the target object. We represent the target object's unknown shape using a linear active shape model and pose a maximum a posteriori optimization problem to solve for position, orientation, and shape simultaneously. Expressed in unit quaternions, this problem admits first-order optimality conditions in the form of an eigenvalue problem with eigenvector nonlinearities. Our primary contribution is to solve this problem efficiently with self-consistent field iteration, which only requires computing a 4-by-4 matrix and finding its minimum eigenvalue-vector pair at each iterate. Solving a linear system for the corresponding Lagrange multipliers gives a simple global optimality certificate. One iteration of our solver runs in about 100 microseconds, enabling fast outlier rejection. We test our method on synthetic data and a variety of real-world settings, including two public datasets and a drone tracking scenario.
|
| |
| 15:00-16:30, Paper TuI2I.251 | Add to My Program |
| JuggleRL: Mastering Ball Juggling with a Quadrotor Via Deep Reinforcement Learning |
|
| Ji, Shilong | Tsinghua University |
| Chen, Yinuo | Tsinghua University |
| Wang, Chuqi | Tsinghua University |
| Chen, Jiayu | Tsinghua University |
| Zhang, Ruize | Tsinghua University |
| Gao, Feng | Tsinghua University |
| Tang, Wenhao | Tsinghua University |
| Yu, Shu'ang | Tsinghua University |
| Xiang, Sirui | Tsinghua University |
| Chen, Xinlei | Tsinghua University |
| Yu, Chao | Tsinghua University |
| Wang, Yu | Tsinghua University |
Keywords: Reinforcement Learning, Machine Learning for Robot Control, Aerial Systems: Applications
Abstract: Aerial robots interacting with objects must perform precise, contact-rich maneuvers under uncertainty. In this paper, we study the problem of aerial ball juggling using a quadrotor equipped with a racket, a task that demands accurate timing, stable control, and continuous adaptation. We propose JuggleRL, the first reinforcement learning–based system for aerial juggling. It learns closed-loop policies in large-scale simulation using systematic calibration of quadrotor and ball dynamics to reduce the sim-to-real gap. The training incorporates reward shaping to encourage racket-centered hits and sustained juggling, as well as domain randomization over ball position and coefficient of restitution to enhance robustness and transferability. The learned policy outputs mid-level commands executed by a low-level controller and is deployed zero-shot on real hardware, where an enhanced perception module with a lightweight communication protocol reduces delays in high-frequency state estimation and ensures real-time control. Experiments show that JuggleRL achieves an average of 311 hits over 10 consecutive trials in the real world, with a maximum of 462 hits observed, far exceeding a model-based baseline that reaches at most 14 hits with an average of 3.1. Moreover, the policy generalizes to unseen conditions, successfully juggling a lighter 5 g ball with an average of 145.9 hits. This work demonstrates that reinforcement learning can empower aerial robots with robust and stable control in dynamic interaction tasks.
|
| |
| 15:00-16:30, Paper TuI2I.252 | Add to My Program |
| DropClick: Semi-Automated One-Click Segmentation for Agricultural Robotic Data |
|
| Zimmer, Patrick | University of Bonn |
| Halstead, Michael Allan | Bonn University |
| McCool, Christopher Steven | CSIRO |
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation, Computer Vision for Automation
Abstract: Labelling vision datasets, especially for segmentation tasks, is a laborious and costly process that stymies novel developments in agricultural robotics. In this paper, we present DropClick, a click-guided segmentation tool that simplifies the annotation process. Our system utilises single-click inputs on objects to generate pseudo-labels, which can replace manual annotations. DropClick stands out as it is a semi-automated approach and does not require a click for every object in the scene. It can therefore further reduce the required amount of user input drastically. We evaluate our method on two challenging agricultural robotic datasets, SB20 and BUP20 for plant and fruit segmentation, respectively. DropClick is first trained on a small subset of just 5 images from the original training data. This DropClick model can then be deployed as a one-click segmentation system and achieves comparable or higher performance than other one-click methods achieving an mIoU of 70.0 and 72.6 points, for SB20 and BUP20 respectively. DropClick then excels at maintaining high performance when clicks are not given (e.g. dropped); when 50% of the clicks are missing it still maintains an mIoU of 68.9 and 71.3 points, for SB20 and BUP20 respectively. We validate DropClick as a pseudo-labelling approach by taking its outputs to train a Mask2Former instance-based segmentation model in a semi-supervised manner. In this process, partially removing user input from DropClick yields similar high performance when compared to providing all clicks, at 70.1 vs 70.7 points AP50 for SB20 and no difference for BUP20 at 77.0 for both models; at the same time saving 46.3% of total input for SB20 and 31.9% for BUP20.
|
| |
| 15:00-16:30, Paper TuI2I.253 | Add to My Program |
| Influence of Gripper Design on Human Demonstration Quality for Robot Learning |
|
| Georgadarellis, Gina | University of Massachusetts Amherst |
| Beslic, Natalija | University of Massachusetts Amherst |
| Lee, Seonhun | University of Massachusetts Amherst |
| Sup IV, Frank | University of Massachusetts Amherst |
| Huber, Meghan | University of Massachusetts Amherst |
Keywords: Design and Human Factors, Telerobotics and Teleoperation, Haptics and Haptic Interfaces
Abstract: Opening sterile medical packaging is routine for healthcare workers but remains challenging for robots. Learning from demonstration enables robots to acquire manipulation skills directly from humans, and handheld gripper tools such as the Universal Manipulation Interface (UMI) offer a pathway for efficient data collection. However, the effectiveness of these tools depends heavily on their usability. We evaluated UMI in demonstrating a bandage opening task, a common manipulation task in hospital settings, by testing three conditions: distributed load grippers, concentrated load grippers, and bare hands. Eight participants performed timed trials, with task performance assessed by success rate, completion time, and damage, alongside perceived workload using the NASA-TLX questionnaire. Concentrated load grippers improved performance relative to distributed load grippers but remained substantially slower and less effective than hands. These results underscore the importance of ergonomic and mechanical refinements in handheld grippers to reduce user burden and improve demonstration quality, especially for applications in healthcare robotics.
|
| |
| 15:00-16:30, Paper TuI2I.254 | Add to My Program |
| LLM-HBT: Dynamic Behavior Tree Construction for Adaptive Coordination in Heterogeneous Robots |
|
| Chaoran, Wang | Zhejiang University |
| Sun, Jingyuan | National University of Singapore |
| Zhang, Yanhui | Zhejiang University |
| Zhang, Mingyu | Sun Yat-Sen University |
| Changju, Wu | Zhejiang University |
Keywords: Behavior-Based Systems, Autonomous Agents, Cognitive Control Architectures
Abstract: We introduce a novel framework for automatic behavior tree (BT) construction in heterogeneous multi-robot systems, designed to address the challenges of adaptability and robustness in dynamic environments. Traditional robots are limited by fixed functional attributes and cannot efficiently reconfigure their strategies in response to task failures or environmental changes. To overcome this limitation, we leverage large language models (LLMs) to generate and extend BTs dynamically, combining the reasoning and generalization power of LLMs with the modularity and recovery capability of BTs. The proposed framework consists of four interconnected modules—task initialization, task assignment, BT update, and failure node detection—which operate in a closed loop. Robots tick their BTs during execution, and upon encountering a failure node, they can either extend the tree locally or invoke a centralized virtual coordinator (Alex) to reassign subtasks and synchronize BTs across peers. This design enables long-term cooperative execution in heterogeneous teams. We validate the framework on 60 tasks across three simulated scenarios and in a real-world café environment with a robotic arm and a wheeled-legged robot. Results show that our method consistently outperforms baseline approaches in task success rate, robustness, and scalability, demonstrating its effectiveness for multi-robot collaboration in complex scenarios.
|
| |
| 15:00-16:30, Paper TuI2I.255 | Add to My Program |
| HAVEN: Hierarchical Adversary-Aware Visibility-Enabled Navigation with Cover Utilization Using Deep Transformer Q-Networks |
|
| Chauhan, Mihir | Purdue University |
| Conover, Damon | DEVCOM Army Research Laboratory |
| Bera, Aniket | Purdue University |
Keywords: Robotics in Hazardous Fields, Surveillance Robotic Systems
Abstract: Autonomous navigation in partially observable environments requires agents to reason beyond immediate sensor input, exploit occlusion, and ensure safety while progressing toward a goal. These challenges arise in many robotics domains, from urban driving and warehouse automation to defense and surveillance. Classical path planning approaches and memory-less reinforcement learning often fail under limited fields-of-view (FoVs) and occlusions, committing to unsafe or inefficient maneuvers. We propose a hierarchical navigation framework that integrates a Deep Transformer Q-Network (DTQN) as a high-level subgoal selector with a modular low-level controller for waypoint execution. The DTQN consumes short histories of task-aware features, encoding odometry, goal direction, obstacle proximity, and visibility cues, and outputs Q-values to rank candidate subgoals. Visibility-aware candidate generation introduces masking and exposure penalties, rewarding the use of cover and anticipatory safety. A low-level potential field controller then tracks the selected subgoal, ensuring smooth short-horizon obstacle avoidance. We validate our approach in 2D simulation and extend it directly to a 3D Unity–ROS environment by projecting point-cloud perception into the same feature schema, enabling transfer without architectural changes. Results show consistent improvements over classical planners and RL baselines in success rate, safety margins, and time-to-goal, with ablations confirming the value of temporal memory and visibility-aware candidate design. These findings highlight a generalizable framework for safe navigation under uncertainty, with broad relevance across robotic platforms.
|
| |
| 15:00-16:30, Paper TuI2I.256 | Add to My Program |
| PIMBS: Efficient Body Schema Learning for Musculoskeletal Humanoids with Physics-Informed Neural Networks |
|
| Kawaharazuka, Kento | The University of Tokyo |
| Hattori, Takahiro | The University of Tokyo |
| Yoneda, Keita | The University of Tokyo |
| Okada, Kei | The University of Tokyo |
Keywords: Tendon/Wire Mechanism, Learning from Experience, Human and Humanoid Motion Analysis and Synthesis
Abstract: Musculoskeletal humanoids are robots that closely mimic the human musculoskeletal system, offering various advantages such as variable stiffness control, redundancy, and flexibility. However, their body structure is complex, and muscle paths often significantly deviate from geometric models. To address this, numerous studies have been conducted to learn body schema, particularly the relationships among joint angles, muscle tension, and muscle length. These studies typically rely solely on data collected from the actual robot, but this data collection process is labor-intensive, and learning becomes difficult when the amount of data is limited. Therefore, in this study, we propose a method that applies the concept of Physics-Informed Neural Networks (PINNs) to the learning of body schema in musculoskeletal humanoids, enabling high-accuracy learning even with a small amount of data. By utilizing not only data obtained from the actual robot but also the physical laws governing the relationship between torque and muscle tension under the assumption of correct joint structure, more efficient learning becomes possible. We apply the proposed method to both simulation and an actual musculoskeletal humanoid and discuss its effectiveness and characteristics.
|
| |
| 15:00-16:30, Paper TuI2I.257 | Add to My Program |
| NaviGait: Navigating Dynamically Feasible Gait Libraries Using Deep Reinforcement Learning |
|
| Janwani, Neil | Georgia Institute of Technology |
| Madabushi, Varun | Georgia Institute of Technology |
| Tucker, Maegan | Georgia Institute of Technology |
Keywords: Humanoid and Bipedal Locomotion, Reinforcement Learning, Machine Learning for Robot Control
Abstract: Reinforcement learning (RL) has emerged as a powerful method to learn robust control policies for bipedal locomotion. Yet, it can be difficult to tune desired robot behaviors due to unintuitive and complex reward design. In comparison, trajectory optimization-based methods offer more tuneable, interpretable, and mathematically grounded motion plans for high-dimensional legged systems. However, these methods often remain brittle to real-world disturbances like external perturbations. In this work, we present NaviGait, a hierarchical framework that combines the structure of trajectory optimization with the adaptability of RL for robust and intuitive locomotion control. NaviGait leverages RL to synthesize new motions by selecting, minimally morphing, and stabilizing gaits taken from an offline-generated gait library. NaviGait results in walking policies that match the reference motion well while maintaining robustness comparable to other locomotion controllers. Additionally, the structure imposed by NaviGait drastically simplifies the RL reward composition. Our experimental results demonstrate that NaviGait enables faster training compared to conventional and imitation-based RL, and produces motions that remain closest to the original reference. Overall, by decoupling high-level motion generation from low-level correction, NaviGait offers a more scalable and generalizable approach for achieving dynamic and robust locomotion. Videos and the full framework are publicly available at https://dynamicmobility.github.io/navigait/.
|
| |
| 15:00-16:30, Paper TuI2I.258 | Add to My Program |
| Searching in Space and Time: Unified Memory-Action Loops for Open-World Object Retrieval |
|
| Chen, Taijing | The University of Texas at Austin |
| Kumar, Sateesh | The University of Texas at Austin |
| Xu, Junhong | The University of Texas at Austin |
| Pavlakos, Georgios | The University of Texas at Austin |
| Biswas, Joydeep | The University of Texas at Austin |
| Martín-Martín, Roberto | The University of Texas at Austin |
Keywords: Mobile Manipulation, AI-Enabled Robotics, Autonomous Agents
Abstract: Service robots must retrieve objects in dynamic, open-world settings where requests may reference attributes (“the red mug”), spatial context (“the mug on the table”), or past states (“the mug that was here yesterday”). Existing approaches capture only parts of this problem: scene graphs capture spatial relations but ignore temporal grounding, temporal reasoning methods model dynamics but do not support embodied interaction, and dynamic scene graphs handle both but remain closed-world with fixed vocabularies. We present STAR (SpatioTemporal Active Retrieval), a framework that unifies memory queries and embodied actions within a single decision loop. STAR leverages a non-parametric long-term memory and a working memory to support efficient recall, and uses a vision-language model to select either temporal or spatial actions at each step. We introduce STARBench, a benchmark of spatiotemporal object search tasks across simulated and real environments. Experiments on STARBench and on a Tiago robot show that STAR consistently outperforms scene-graph and memory-only baselines, demonstrating the benefits of treating search in time and search in space as a unified problem.
|
| |
| 15:00-16:30, Paper TuI2I.259 | Add to My Program |
| EgoMI: Learning Active Vision and Whole-Body Manipulation from Egocentric Human Demonstrations |
|
| Yu, Justin | University of California Berkeley |
| Shentu, Yide | University of California -- Berkeley |
| Wu, Di | Xdof.ai |
| Abbeel, Pieter | UC Berkeley |
| Goldberg, Ken | UC Berkeley |
| Wu, Shiyao | University of California, Berkeley |
Keywords: Imitation Learning, Whole-Body Motion Planning and Control, Deep Learning in Grasping and Manipulation
Abstract: Imitation learning from human demonstrations offers a promising approach for robot skill acquisition, but egocentric human data introduces fundamental challenges due to the embodiment gap. During manipulation, humans actively coordinate head and hand movements, continuously reposition their viewpoint and use pre-action visual search strategies to locate task-relevant objects. These behaviors create dynamic, task-driven head motions that static robot sensing systems cannot replicate, leading to a significant distribution shift that degrades policy performance. We present EgoMI (Egocentric Manipulation Interface), a framework that captures synchronized end-effector and active head trajectories during manipulation tasks, resulting in data that can be retargeted to compatible semi-humanoid robot embodiments. To handle rapid and wide-spanning head viewpoint changes, we introduce a memory-augmented policy that selectively incorporates context from historical observations. We evaluate our approach on a bimanual robot equipped with an actuated camera head and find that policies with explicit head-motion modeling consistently outperform baseline methods. Results suggest that coordinated hand–eye learning with EgoMI effectively bridges the human-robot embodiment gap for robust imitation learning on semi-humanoid embodiments. Project page: https://egocentric-manipulation-interface.github.io
|
| |
| 15:00-16:30, Paper TuI2I.260 | Add to My Program |
| Feedback Matters: Augmenting Autonomous Dissection with Visual and Topological Feedback |
|
| Wang, Chung-Pang | University of California, San Diego |
| Chen, Changwei | University of California, San Diego |
| Liang, Xiao | University of California San Diego |
| Atar, Soofiyan | University of California San Diego |
| Richter, Florian | University of California, San Diego |
| Yip, Michael C. | University of California, San Diego |
Keywords: Surgical Robotics: Laparoscopy, Surgical Robotics: Planning, Medical Robots and Systems
Abstract: Autonomous surgical systems must adapt to highly dynamic environments where tissue properties and visual cues evolve rapidly. Central to such adaptability is feedback: the ability to sense, interpret, and respond to changes during execution. While feedback mechanisms have been explored in surgical robotics, ranging from tool and tissue tracking to error detection, existing methods remain limited in handling the topological and perceptual challenges of tissue dissection. In this work, we propose a feedback-enabled framework for autonomous tissue dissection that explicitly reasons about topological changes from endoscopic images after each dissection action. This structured feedback guides subsequent actions, enabling the system to localize dissection progress and adapt policies online. To improve the reliability of such feedback, we introduce visibility metrics that quantify tissue exposure and formulate optimal controller designs that actively manipulate tissue to maximize visibility. Finally, we integrate these feedback mechanisms with both control-based and learning-based dissection methods, and demonstrate experimentally that they significantly enhance autonomy, reduce errors, and improve robustness in complex surgical scenarios.
|
| |
| 15:00-16:30, Paper TuI2I.261 | Add to My Program |
| Knowledge-Guided Manipulation Using Multi-Task Reinforcement Learning |
|
| Narendra, Aditya | MBZUAI - Mohamed Bin Zayed University of AI |
| Maribjonov, Mukhammadrizo | Innopolis University |
| Makarov, Dmitry | Moscow Independent Research Institute of Artificial Intelligence |
| Yudin, Dmitry | Moscow Independent Research Institute of Artificial Intelligence |
| Panov, Aleksandr | Moscow Independent Research Institute of Artificial Intelligence |
Keywords: Reinforcement Learning, Semantic Scene Understanding, Machine Learning for Robot Control
Abstract: This paper introduces Knowledge-Guided Massively Multi-task Model-based Policy Optimization (KG-M3PO), a framework for multi-task robotic manipulation in partially observable settings that unifies Perception, Knowledge, and Policy. KG-M3PO leverages a model-based policy optimization method to control backbone with an online 3D scene graph that grounds open-vocabulary detections into a metric, relational representation. A dynamic-relation mechanism updates spatial, containment, and affordance edges at every step, and a graph neural encoder is trained end-to-end through the RL objective so that relational features are shaped directly by control performance. Multiple observation modalities (visual, proprioceptive, linguistic, and graph-based) are encoded into a shared latent space, upon which the RL agent operates to drive the control loop. The policy conditions on lightweight graph queries alongside visual and proprioceptive inputs, yielding a compact, semantically informed state for decision making. Experiments on a suite of manipulation tasks with occlusions, distractors, and layout shifts demonstrate consistent gains over strong baselines: the knowledge-conditioned agent achieves higher success rates, improved sample efficiency, and stronger generalization to novel objects and unseen scene configurations. These results support the premise that structured, continuously maintained world knowledge is a powerful inductive bias for scalable, generalizable manipulation: when the knowledge module participates in the RL computation graph, relational representations align with control, enabling robust long-horizon behavior under partial observability.
|
| |
| 15:00-16:30, Paper TuI2I.262 | Add to My Program |
| COLSON: Controllable Learning-Based Social Navigation Via Diffusion-Based Reinforcement Learning |
|
| Matsumoto, Kohei | Kyushu University |
| Tomita, Yuki | Kyushu University |
| Hyodo, Yuki | Kyushu University |
| Kurazume, Ryo | Kyushu University |
Keywords: Human-Aware Motion Planning, Reinforcement Learning, Motion and Path Planning
Abstract: Navigation of mobile robots in dynamic environments with pedestrian traffic poses a significant challenge in the development of autonomous mobile service robots. Recently, deep reinforcement learning-based methods have been actively studied and have outperformed traditional rule-based approaches, owing to their optimization capabilities. Among these methods, those assuming continuous action spaces typically use Gaussian distributions, limiting the flexibility of action generation. By contrast, the application of diffusion models to reinforcement learning has advanced, allowing more flexible action distributions than Gaussian policy-based approaches. In this study, we used a diffusion-based reinforcement learning approach to social navigation and validated its effectiveness. Furthermore, using the characteristics of diffusion models, we propose extensions that allow adaptation to previously unseen scenarios without additional training. As concrete scenario examples, we show adaptability to scenarios in which static obstacles exist in an environment that was not present during training, as well as scenarios in which the objective differs from training, such as accompanying a target pedestrian while avoiding other pedestrians to reach a destination.
|
| |
| 15:00-16:30, Paper TuI2I.263 | Add to My Program |
| Visual Grounding Via Heterogeneous Representation Learning and Hierarchical Reasoning of Human-To-Vehicle Commands |
|
| Wang, Hao | University of Connecticut |
| He, Suining | University of Connecticut |
| Shin, Kang G. | University of Michigan |
Keywords: AI-Based Methods, Human-Centered Automation, Intelligent Transportation Systems
Abstract: With the proliferation of autonomous vehicles (AVs) and their increasing interaction and communication with the riders, how to ground or locate the visual objects of interests (OoIs), such as the concerned pedestrians and other traffic participants, based on the human riders’ natural language and communication (e.g., vocal commands), is essential for increasing the efficiency, effectiveness, and reliability/safety of AVs in following the riders’ reasonable commands and preferences. There are several technical challenges to achieve visual grounding for such human-to-vehicle commanding (HVC) scenes, including (1) how to fuse heterogeneous sensor modalities — i.e., visual object information, textual contexts, and situation awareness (say, obtained from the light detection and ranging); (2) how to discern the opaque commands in the human natural language; and (3) how to reason about the relative positions of the OoIs within the visual modality. To meet these challenges, we propose VIGOR, a VIsual Grounding approach based on heterogeneous mOdality learning and hierarchical Reasoning for HVC scenes. First, we design a heterogeneous modality learning approach in order to incorporate the visual, textual, and situational modalities, and learn their cross-modality representations to identify important information for visual grounding. Then, VIGOR performs hierarchical reasoning of objects and context levels, and differentiates the OoIs in the complex traffic environments that relate to the natural language commands. Finally, we conduct extensive experimental studies on a total of 12,037 HVC scenes, demonstrating VIGOR to achieve higher accuracy than the state-of-the-art approaches (by 14.81% on average) in terms of the Intersection over Union (IoU) in grounding the OoIs in the complex (including low-visibility) HVC scenes.
|
| |
| 15:00-16:30, Paper TuI2I.264 | Add to My Program |
| Source-Only Cross-Weather LiDAR Via Geometry-Aware Point Drop |
|
| Cheong, YoungJae | Gachon University |
| An, Jhonghyun | Gachon University |
Keywords: Object Detection, Segmentation and Categorization, Deep Learning Methods, Reinforcement Learning
Abstract: Adverse weather conditions, such as rain, snow, and fog, severely degrade LiDAR semantic segmentation by introducing refraction, scattering, and point dropouts that compromise geometric integrity. While prior approaches ranging from weather simulation and mixing-based augmentation to domain randomization and regularization enhance robustness, they frequently overlook structural vulnerabilities inherent to object boundaries, corners, and highly sparse regions. To address this limitation, we propose a Light GeometryAware Adapter. This module aligns azimuths and applies horizontal circular padding to preserve neighbor continuity across the 0 ◦ –360◦ wrap-around boundary. Using a local-window KNearest Neighbors (KNN) search, it aggregates nearby points and computes lightweight local statistics, compressing them into compact geometry-aware cues. During training, these cues facilitate region-aware regularization, which effectively stabilizes predictions in structurally fragile areas. The proposed adapter is designed to be plug-and-play, complements existing augmentation techniques, and operates exclusively during training, incurring negligible inference overhead. Operating under a rigorous source-only cross-weather paradigm wherein models are trained on SemanticKITTI and evaluated on SemanticSTF without target-domain labels or fine-tuning, our adapter achieves a +3.4 mIoU improvement over strong data-centric augmentation baselines. Furthermore, it demonstrates performance comparable to advanced classcentric regularization methods. These findings highlight that geometry-driven regularization constitutes a critical pathway toward achieving highly robust, all-weather LiDAR segmentation. S
|
| |
| 15:00-16:30, Paper TuI2I.265 | Add to My Program |
| 3DGS-Holo-Inspector: A Mixed Reality UAV Controller with 3D Gaussian Splatting Localization for Infrastructure Inspection |
|
| Rizvi, Syed Muhammad Raza | University of Waterloo |
| Weng, Huaiyuan | University of Waterloo |
| Yeum, Chul Min | University of Waterloo |
Keywords: Aerial Systems: Perception and Autonomy, Localization, Virtual Reality and Interfaces
Abstract: Unmanned aerial vehicles (UAVs) are increasingly used for infrastructure inspection, but conventional joystick and first-person-view (FPV) controllers remain unintuitive, error-prone, and cognitively demanding, particularly in cluttered or safety-critical environments. We present 3DGS-Holo- Inspector, a Mixed Reality (MR) UAV controller that combines holographic goal-setting with autonomous UAV navigation. Using natural hand gestures, operators can define and preview navigation goals directly in MR before flight, ensuring precise and safe data capture at inspection viewpoints. The system complements existing inspection pipelines by leveraging pre-built 3D maps (e.g., photogrammetry or LiDAR reconstructions) to enable refinement of regions of interest (ROIs) where coverage is incomplete or the detail is insufficient. Robust headset–UAV alignment is achieved through a LiDAR–RGB 3D Gaussian Splatting (3DGS) localization backbone, which provides dense, markerless, and persistent spatial registration in both indoor and outdoor settings. Once goals are placed, the UAV autonomously navigates to the specified pose, with real-time telemetry and live video overlaid in MR to enhance situational awareness. Experimental validation using a ModalAI Starling UAV and Microsoft HoloLens 2 demonstrated accurate UAV-goal alignment, achieving a positional Root Mean Square Error (RMSE) of 0.090 m (median = 0.084 m) indoors and 0.119 m (median = 0.118 m) outdoors, with orientation (yaw) RMSEs of 1.491 ◦ (median = 1.400 ◦ ) and 2.233 ◦ (median = 2.268 ◦ ), respectively. These results confirm that 3DGS-Holo-Inspector provides reliable MR-based UAV control, augmenting inspection workflows by enabling safe, intuitive, and high-precision UAV operations in real-world environments.
|
| |
| 15:00-16:30, Paper TuI2I.266 | Add to My Program |
| EB-MBD: Emerging-Barrier Model-Based Diffusion for Safe Trajectory Optimization in Highly Constrained Environments |
|
| Mishra, Raghav | University of Sydney |
| Manchester, Ian | University of Sydney |
Keywords: Constrained Motion Planning, Optimization and Optimal Control, Probabilistic Inference
Abstract: We propose enforcing constraints on Model-Based Diffusion by introducing emerging barrier functions inspired by interior point methods. We show that constraints on Model-Based Diffusion can lead to catastrophic performance degradation, even on simple 2D systems due to sample inefficiency in the Monte Carlo approximation of the score function. We introduce Emerging-Barrier Model-Based Diffusion (EB-MBD) which uses progressively introduced barrier constraints to avoid these problems, significantly improving solution quality, without expensive projection operations such as projections. We analyze the sampling liveliness of samples at each iteration to inform barrier parameter scheduling choice. We demonstrate results for 2D collision avoidance and a 3D underwater manipulator system and show that our method achieves lower cost solutions than Model-Based Diffusion, and requires orders of magnitude less computation time than projection based methods.
|
| |
| 15:00-16:30, Paper TuI2I.267 | Add to My Program |
| Uncertainty Guided Exploratory Trajectory Optimization for Sampling-Based Model Predictive Control |
|
| Poyrazoglu, Oguzhan Goktug | University of Minnesota |
| Cao, Yukang | University of Minnesota |
| Mahesh, Rahul Moorthy | University of Minnesota - Twin Cities |
| Isler, Volkan | The University of Texas at Austin |
Keywords: Motion and Path Planning, Integrated Planning and Control, Constrained Motion Planning
Abstract: Trajectory optimization depends heavily on initialization. In particular, sampling-based approaches are highly sensitive to initial solutions, and limited exploration frequently leads them to converge to local minima in complex environments. We present Uncertainty Guided Exploratory Trajectory Optimization (UGE-TO), a trajectory optimization algorithm that generates well-separated samples to achieve a better coverage of the configuration space. UGE-TO represents trajectories as probability distributions induced by uncertainty ellipsoids. Unlike sampling-based approaches that explore only in the action space, this representation captures the effects of both system dynamics and action selection. By incorporating the impact of dynamics, in addition to the action space, into our distributions, our method enhances trajectory diversity by enforcing distributional separation via the Hellinger distance between them. It enables a systematic exploration of the configuration space and improves robustness against local minima. Further, we present UGE-MPC, which integrates UGE-TO into sampling-based model predictive controller methods. Experiments demonstrate that UGE-MPC achieves higher exploration and faster convergence in trajectory optimization compared to baselines under the same sampling budget, chieving 72.1% faster convergence in obstacle-free environments and 66% faster convergence with a 6.7% higher success rate in the cluttered environment compared to the best-performing baseline. Additionally, we validate the approach through a range of simulation scenarios and real-world experiments. Our results indicate that UGE-MPC has higher success rates and faster convergence, especially in environments that demand significant deviations from nominal trajectories to avoid failures. The project and code are available at https://ogpoyrazoglu.github.io/cuniform_sampling/.
|
| |
| 15:00-16:30, Paper TuI2I.268 | Add to My Program |
| Embracing Bulky Objects with Humanoid Robots: Whole-Body Manipulation with Reinforcement Learning |
|
| Zheng, Chunxin | The Hong Kong University of Science and Technology(Guangzhou) |
| Chen, Kai | The Hong Kong University of Science and Technology |
| Bi, Zhihai | Hong Kong University of Science and Technology (Guangzhou) |
| Li, Yulin | Hong Kong University of Science and Technology(HKUST) |
| Pan, Liang | The University of Hong Kong |
| Zhou, Jinni | Hong Kong University of Science and Technology (Guangzhou) |
| Li, Haoang | Hong Kong University of Science and Technology (Guangzhou) |
| Ma, Jun | The Hong Kong University of Science and Technology |
Keywords: Multi-Contact Whole-Body Motion Planning and Control, Humanoid Robot Systems, Whole-Body Motion Planning and Control
Abstract: Whole-body manipulation (WBM) for humanoid robots presents a promising approach for executing embracing tasks involving bulky objects, where traditional grasping relying on end-effectors only remains limited in such scenarios due to inherent stability and payload constraints. This paper introduces a reinforcement learning framework that integrates a pre-trained human motion prior with a neural signed distance field (NSDF) representation to achieve robust whole-body embracing. Our method leverages a teacher-student architecture to distill large-scale human motion data, generating kinematically natural and physically feasible whole-body motion patterns. This facilitates coordinated control across the arms and torso, enabling stable multi-contact interactions that enhance the robustness in manipulation and also the load capacity. The embedded NSDF further provides accurate and continuous geometric perception, improving contact awareness throughout long-horizon tasks. We thoroughly evaluate the approach through comprehensive simulations and real-world experiments. The results demonstrate improved adaptability to diverse shapes and sizes of objects and also successful sim-to-real transfer. These indicate that the proposed framework offers an effective and practical solution for multi-contact and long-horizon WBM tasks of humanoid robots. The open-source project can be found at https://github.com/Chunx1nZHENG/Embracing-Bulky-Objects-with-Humanoid-Robots.
|
| |
| 15:00-16:30, Paper TuI2I.269 | Add to My Program |
| Retargeting Matters: General Motion Retargeting for Humanoid Motion Tracking |
|
| Araújo, João | Stanford University |
| Ze, Yanjie | Stanford University |
| Xu, Pei | Stanford University |
| Wu, Jiajun | Stanford University |
| Liu, Karen | Stanford University |
Keywords: Humanoid and Bipedal Locomotion, Simulation and Animation, Reinforcement Learning
Abstract: Humanoid motion tracking policies are central to building teleoperation pipelines and hierarchical controllers, yet they face a fundamental challenge: the embodiment gap between humans and humanoid robots. Current approaches address this gap by retargeting human motion data to humanoid embodiments and then training reinforcement learning (RL) policies to imitate these reference trajectories. However, artifacts introduced during retargeting, such as foot sliding, self-penetration, and physically infeasible motion are often left in the reference trajectories for the RL policy to correct. While prior work has demonstrated motion tracking abilities, they often require extensive reward engineering and domain randomization to succeed. In this paper, we systematically evaluate how retargeting quality affects policy performance when excessive reward tuning is suppressed. To address issues that we identify with existing retargeting methods, we propose a new retargeting method, General Motion Retargeting (GMR). We evaluate GMR alongside two open-source retargeters, PHC and ProtoMotions2, as well as with a high-quality closed-source dataset from Unitree. Using BeyondMimic for policy training, we isolate retargeting effects without reward tuning. Our experiments on a diverse subset of the LAFAN1 dataset reveal that while most motions can be tracked, artifacts in retargeted data significantly reduce policy robustness, particularly for dynamic or long sequences. GMR consistently outperforms existing open-source methods in both tracking performance and faithfulness to the source motion, achieving perceptual fidelity and policy success rates close to the closed-source baseline.
|
| |
| 15:00-16:30, Paper TuI2I.270 | Add to My Program |
| Unlocking the Potential of Soft Actor-Critic for Imitation Learning |
|
| Lessa, Nayari Marie | University of Bremen |
| Boukheddimi, Melya | DFKI GmbH |
| Kirchner, Frank | University of Bremen |
Keywords: Imitation Learning, Bioinspired Robot Learning, Reinforcement Learning
Abstract: Learning-based methods have enabled robots to acquire bio-inspired movements with increasing levels of naturalness and adaptability. Among these, Imitation Learning (IL) has proven effective in transferring complex motion patterns from animals to robotic systems. However, current state-of-the-art frameworks predominantly rely on Proximal Policy Optimization (PPO), an on-policy algorithm that prioritizes stability over sample efficiency and policy generalization. This paper proposes a novel IL framework that combines Adversarial Motion Priors (AMP) with the off-policy Soft Actor-Critic (SAC) algorithm to overcome these limitations. This integration leverage replay-driven learning and entropy regularized exploration, enabling naturalistic behavior and task execution improving data efficiency and robustness. We evaluate the proposed approach (AMP+SAC) on quadruped gaits involving multiple reference motions and diverse terrains. Experimental results demonstrate that the proposed framework not only maintains stable task execution but also achieves higher imitation rewards compared to the widely used AMP+PPO method. These findings highlight the potential of an off-policy IL formulations for advancing motion generation in robotics.Code and supplementary material are available at: url{https://github.com/nayariml/AMP_SAC.git}
|
| |
| 15:00-16:30, Paper TuI2I.271 | Add to My Program |
| Towards Global Sparse and Partial Point Set Registration with Pose-Robust Completion for Computer-Assisted Orthopedic Surgery |
|
| Du, Xinzhe | Shandong University |
| Zhai, Yuxin | Shandong University |
| Ma, Shixing | Shandong University |
| Liu, Mingyang | Shandong University |
| Liu, Yi | The Second Hospital of Shandong University |
| Yin, Qingfeng | The Second Hospital of Shandong University |
| Song, Rui | Shandong University |
| Li, Yibin | Shandong University |
| Meng, Max Q.-H. | The Chinese University of Hong Kong |
| Min, Zhe | University College London |
Keywords: Medical Robots and Systems, Computer Vision for Medical Robotics
Abstract: In computer-assisted orthopedic surgery (CAOS), accurately registering sparse and partial intraoperative point sets with a complete preoperative model remains highly challenging due to limited overlap, extreme sparsity, and point localisation noise. In this paper, we propose a novel end-to-end completion–registration framework, to accurately register partial and sparse point sets in CAOS. First, we develop a three-branch network that separately encodes intraoperative pose and geometry, while extracting rotation-invariant geometric priors from the preoperative model in a canonical space. This structure-aware design provides strong and beneficial cues for completing missing regions using sparse and partial data. Second, to address the sensitivity of the completion to random input poses, the completion is specifically conducted in a canonical frame and a learned SE(3) transform maps the output back to the observed intraoperative space. Third, we introduce a probabilistic registration module based on a bidirectional hybrid mixture model that aligns the completed intraoperative and preoperative point sets in distribution space by jointly optimizing the source-to-target and target-to-source objectives, addressing density mismatch and geometric inconsistencies that may arise from completion. Finally, we present the individual loss formulations for both supervised and unsupervised learning paradigms respectively, enabling robust end-to-end optimization of the entire pipeline. We systematically validate our approach on (1,757) femur, (1,301) hip, and (397) tibia models, as well as real-world phantom experiments. Our method achieves state-of-the-art performance under low overlap (15–30%), sparse observations (64–128 points), and large initial misalignments (up to ([-180, 180]^circ) rotation and ([-100, 100]mm) translation), demonstrating strong robustness and generalization.
|
| |
| 15:00-16:30, Paper TuI2I.272 | Add to My Program |
| Run-Time Optimization of Overall Energy Consumption in Lightweight Collaborative Arms for Repetitive Tasks |
|
| Zarei, Ahmadreza | University of Turku |
| Shahsavari, Sajad | University of Turku |
| Plosila, Juha | University of Turku |
| Haghbayan, Hashem | University of Turku |
Keywords: Optimization and Optimal Control, Energy and Environment-Aware Automation, Manipulation Planning
Abstract: Lightweight industrial robots are increasingly deployed alongside humans to perform diverse and intelligent industrial tasks. A major concern with these robots is energy efficiency, driven by rising operational costs and environmental impacts. A growing contributor to energy use is the heavy computational workload of their electronic components. Although motion configurations and computational load are often interdependent, current state-of-the-art energy optimization methods tend to address them separately, focusing on individual consumption. In this work, we demonstrate that computational energy is comparable to mechanical energy and show how their dependency affects overall consumption in a Franka Emika Panda robot equipped with a multi-core processing system and two depth cameras. Building on this understanding, we propose a Bayesian approach for the joint optimization of mechanical motion and computational frequency in a robotic arm. Experiments show that the proposed method enables the Franka arm to reduce energy use by 3.7% in pick-and-place tasks and 6.2% in sorting tasks, compared to methods that optimize locomotion and computation separately.
|
| |
| 15:00-16:30, Paper TuI2I.273 | Add to My Program |
| LEGO: Latent-Space Exploration for Geometry-Aware Optimization of Humanoid Kinematic Design |
|
| Yoon, Jihwan | Korea University |
| Jeong, Taemoon | Korea University |
| Park, Jeongeun | Korea University |
| Kim, Chanwoo | Korea University |
| Kwon, Jaewoon | NAVER LABS |
| Lee, Yonghyeon | MIT |
| Lee, Kyungjae | Korea University |
| Choi, Sungjoon | Korea University |
Keywords: Mechanism Design, Methods and Tools for Robot System Design, Optimization and Optimal Control
Abstract: Designing robot morphologies and kinematics has traditionally relied on human intuition, with little systematic foundation. Motion–design co-optimization offers a promising path toward automation, but two major challenges remain: (i) the vast, unstructured design space and (ii) the difficulty of constructing task-specific loss functions. We propose a new paradigm that minimizes human involvement by (i) learning the design search space from existing mechanical designs, rather than hand-crafting it, and (ii) defining the loss directly from human motion data via motion retargeting and Procrustes analysis. Using screw-theory-based joint axis representation and isometric manifold learning, we construct a compact, geometry-preserving latent space of robot designs in which optimization is tractable. We then solve design optimization in this latent space using gradient-free optimization. Our approach establishes a principled framework for data-driven robot design and demonstrates that leveraging existing designs and human motion can effectively guide the automated discovery of novel robot design.
|
| |
| 15:00-16:30, Paper TuI2I.274 | Add to My Program |
| AnchorDream: Repurposing Video Diffusion for Embodiment-Aware Robot Data Synthesis |
|
| Ye, Junjie | Toyota Research Institute; University of Southern California |
| Xue, Rong | University of Southern California |
| Van Hoorick, Basile | Toyota Research Institute |
| Tokmakov, Pavel | Toyota Research Institute |
| Irshad, Muhammad Zubair | Toyota Research Institute |
| Wang, Yue | University of Southern California |
| Guizilini, Vitor | Toyota Research Institute |
Keywords: Deep Learning in Grasping and Manipulation
Abstract: The collection of large-scale and diverse robot demonstrations remains a major bottleneck for imitation learning, as real-world data acquisition is costly and simulators offer limited diversity and fidelity with pronounced sim-to-real gaps. While generative models present an attractive solution, existing methods often alter only visual appearances without creating new behaviors, or suffer from embodiment inconsistencies that yield implausible motions. To address these limitations, we introduce AnchorDream, an embodiment-aware world model that repurposes pretrained video diffusion models for robot data synthesis. AnchorDream conditions the diffusion process on robot motion renderings, anchoring the embodiment to prevent hallucination while synthesizing objects and environments consistent with the robot's kinematics. Starting from only a handful of human teleoperation demonstrations, our method scales them into large, diverse, high-quality datasets without requiring explicit environment modeling. Experiments show that the generated data leads to consistent improvements in downstream policy learning, with relative gains of 36.4% in simulator benchmarks and nearly double performance in real-world studies. These results suggest that grounding generative world models in robot motion provides a practical path toward scaling imitation learning.
|
| |
| 15:00-16:30, Paper TuI2I.275 | Add to My Program |
| Hybrid Rigid-Soft Robotic Gripper with Shape Adaptation, Uniform Force Distribution, and Self-Locking Capabilities |
|
| Chen, Xi | China Agricultural University |
| Wang, Yun | Beijing Academy of Agriculture and Forestry Sciences, Shanxi Agricultural University |
| Yang, Lichao | China Agriculture University |
| Li, Haitao | China Agricultural University |
| Xiong, Ya | Beijing Academy of Agriculture and Forestry Sciences |
|
|
| |
| 15:00-16:30, Paper TuI2I.276 | Add to My Program |
| Simplified Fabrication and Open-Loop Control of an Electromagnetic Insect-Scale Terrestrial Robot |
|
| Villamil, Julie | Cornell University |
| Li, Yaochen | Cornell University |
| Long, Jack | Cornell University |
| Karpelson, Michael | Harvard University |
| Helbling, E. Farrell | Cornell University |
Keywords: Mechanism Design, Actuation and Joint Mechanisms, Biologically-Inspired Robots
Abstract: Small-scale terrestrial robots have a number of applications where operation in confined spaces is required. Because of their low mass (less than five grams) and small size (less than five centimeters), their mechanical design requires careful analysis of multiple subsystems (e.g., actuation, power, fabrication, and assembly). Planar electromagnetic actuators show linear force-displacement behavior, large displacements, and low-voltage operation. Here, we integrate these actuators into the Cornell Micro Terrestrial Robot (COMT), a 1.9 − g quadrupedal robot that uses a simplified fabrication strategy for the transmissions that takes advantage of the large displacement. Each leg is fabricated using laminate-based techniques, but requires only a single manual fold-and-lock step. The robot (BL = 3 cm) achieves speeds up to 4.36 BL/s and consumes approximately 300 mA during operation. These results provide a path towards a untethered terrestrial robots that can navigate in confined spaces and enable future collectives through simplified manufacturing strategies.
|
| |
| 15:00-16:30, Paper TuI2I.277 | Add to My Program |
| A Vision-Language-Action Model for Adaptive Ultrasound-Guided Needle Insertion and Needle Tracking |
|
| Zhang, Yuelin | The Chinese University of Hong Kong |
| Ding, Qingpeng | The Chinese University of Hong Kong |
| Tang, Longxiang | Tsinghua University |
| Fang, Chengyu | Tsinghua University |
| Cheng, Shing Shin | The Chinese University of Hong Kong |
Keywords: Medical Robots and Systems, Visual Servoing, Visual Tracking
Abstract: Ultrasound (US)-guided needle insertion is a critical yet challenging procedure due to dynamic imaging conditions and difficulties in needle visualization. Many methods have been proposed for automated needle insertion, but they often rely on hand-crafted pipelines with modular controllers, whose performance degrades in challenging cases. In this paper, a Vision-Language-Action (VLA) model is proposed for adaptive and automated US-guided needle insertion and tracking on a robotic ultrasound (RUS) system. This framework provides a unified approach to needle tracking and needle insertion control, enabling real-time, dynamically adaptive adjustment of insertion based on the obtained needle position and environment awareness. To achieve real-time and end-to-end tracking, a Cross-Depth Fusion (CDF) tracking head is proposed, integrating shallow positional and deep semantic features from the large-scale vision backbone. To adapt the pretrained vision backbone for tracking tasks, a Tracking-Conditioning (TraCon) register is introduced for parameter-efficient feature conditioning. After needle tracking, an uncertainty-aware control policy and an asynchronous VLA pipeline are presented for adaptive needle insertion control, ensuring timely decision-making for improved safety and outcomes. Extensive experiments on both needle tracking and insertion show that our method consistently outperforms state-of-the-art trackers and manual operation, achieving higher tracking accuracy, improved insertion success rates, and reduced procedure time, highlighting promising directions for RUS-based intelligent intervention.
|
| |
| 15:00-16:30, Paper TuI2I.278 | Add to My Program |
| OCT-DeformNet: Optical Coherence Tomography-Guided Biological Tissue Shape Prediction for Robot Palpation in Microsurgery |
|
| Ma, Guangshen | Duke University |
| Qin, Tianhao | Worcester Polytechnic Institute |
| Liu, Jiawei | University of Michigan |
| Draelos, Mark | University of Michigan |
Keywords: Computer Vision for Medical Robotics, Deep Learning for Visual Perception, Surgical Robotics: Planning
Abstract: In medical robotics, biological shape deformation resulting from arbitrary tool-tissue interaction commonly occurs and motivates the need in microsurgery to predict the new geometry of tissue structures. However, handling deformation is challenging due to the lack of a general prediction model for varied surgical scenarios, complex tissue properties, and myriad surgical tool geometries. Limited intraoperative sensors to observe microlevel deformations further compound this difficulty. To solve this problem, this paper proposes the first geometric data-driven framework that uses only the robot palpation tooltip movement and a pre-deformed surface to predict the tissue deformation by using the optical coherence tomography (OCT) sensor. A neural network is trained to learn tooltissue physics and predict the shape from the given robot-tool configurations represented as orientations and displacements. We conducted realistic experiments to verify the models using phantoms of various stiffness and three ex vivo tissue types, with average prediction errors of approximately 0.15 mm and 0.52 mm respectively. This framework provides a general data collection platform for collecting micro-scale palpation data under OCT and can be generalized to soft-tissue related studies in biomedical engineering and surgical robotics research.
|
| |
| 15:00-16:30, Paper TuI2I.279 | Add to My Program |
| Automated Nerve Suturing Using Dual Arm Nanorobotic System Considering Needle Insertion Depth |
|
| Qin, Chao | ShanghaiTech University |
| Jiang, Yujie | ShanghaiTech University |
| Fu, Xiang | ShanghaiTech University |
| Zhong, Chengxi | ShanghaiTech University |
| Liu, Song | ShanghaiTech University |
Keywords: Automation at Micro-Nano Scales, Micro/Nano Robots, Nanomanufacturing
Abstract: Peripheral nerve injuries represent a significant clinical challenge in reconstructive surgery, traumatology, and neurosurgery, often leading to permanent sensorimotor deficits and diminished life quality. Thus, achieving precise epineurial suturing without nerve fascicle damage and tension remains a long-term aspiration for nerve repair. Yet, current techniques, mostly using direct suturing by surgeons, showcase unavoidable tension and limited functional outcomes. To address them, this work proposes a dual arm nanorobotic system-based approach for highly automated, precise, repeatable nerve suturing. An optimized path planning algorithm is designed leveraging the epineurial thickness estimation in order to control needle insertion depth and suturing trajectory. Due to the natural advantages of nanorobotics and microscope, the developed system can suture nerve with micron-scale diameter within confined space. Ex-vivo experiments on three types of rabbit sciatic nerves demonstrated the effectiveness and motion accuracy of 48 microns and 39 microns for two arms. In-vivo experiments with anatomic and functional analyses further validated the functional recovery, showing the potential for clinical translation.
|
| |
| 15:00-16:30, Paper TuI2I.280 | Add to My Program |
| RPG: Robust Policy Gating for Smooth Multi-Skill Transitions in Humanoid Fighting |
|
| Xin, Yucheng | Tsinghua University |
| Bao, Jiacheng | Northwestern Polytechnical University; Shanghai Artificial Intelligence Laboratory |
| Dong, Yubo | Shanghai Jiao Tong University |
| Wang, Dong | Shanghai Artificial Intelligence Laboratory |
| Tan, Junbo | Tsinghua University |
| Wang, Xueqian | Center for Artificial Intelligence and Robotics, Graduate School at Shenzhen, Tsinghua University, 518055 Shenzhen, China |
| Zhao, Bin | Northwestern Polytechnical University |
| Li, Xuelong | China Telecom |
|
|
| |
| 15:00-16:30, Paper TuI2I.281 | Add to My Program |
| TinyVPR: Distilling Correct and Confusing Knowledge for Lightweight Visual Place Recognition |
|
| Yang, Zhuochen | University of Chinese Academy of Sciences |
| Zuo, Runheng | Shanghai Jiao Tong University |
| Yang, Xu | Institute of Semiconductors |
| Dou, Runjiang | Institute of Semiconductors |
| Wang, Zhe | Institute of Semiconductors |
| Liu, Liyuan | Chinese Academy of Sciences |
| Yu, Shuangming | Institute of Semiconductors |
Keywords: Visual Learning, Vision-Based Navigation, Intelligent Transportation Systems
Abstract: Visual Place Recognition (VPR) is a key technology in autonomous driving, robotics, and augmented reality, requiring efficient and robust localization in large-scale environments. However, most existing methods rely on heavy deep models that are computationally expensive and difficult to deploy on edge devices, limiting their practical use. While model compression techniques such as compact model fine-tuning and traditional knowledge distillation have shown some promise, they often fall short in visual retrieval tasks. Inspired by the teaching principle that emphasizes both reinforcing correct knowledge and correcting errors, we propose an online positive-negative sample contrastive distillation framework. This approach enables the student model to learn more discriminative features by simultaneously distilling the relationships among positive and negative samples. We also design a cross-attention based feature alignment operator to better align intermediate feature representations between teacher and student models after feature extraction, improving feature consistency and distillation efficiency. Experimental results demonstrate that our method achieves a favorable trade-off between accuracy and efficiency on multiple visual localization benchmarks, significantly outperforming existing lightweight approaches in several scenarios. These advantages make it well-suited for deployment on resource-constrained edge devices.
|
| |
| 15:00-16:30, Paper TuI2I.282 | Add to My Program |
| Impact of Different Failures on a Robot’s Perceived Reliability |
|
| Violette, Andrew | Cornell University |
| Wu, Zhanxin | Cornell University |
| Nishimura, Haruki | Toyota Research Institute |
| Itkina, Masha | Stanford University |
| Priebe Rocha, Leticia | Toyota Research Institute |
| Zolotas, Mark | Toyota Research Institute |
| Hoffman, Guy | Cornell University |
| Kress-Gazit, Hadas | Cornell University |
Keywords: Acceptability and Trust
Abstract: Robots fail, potentially leading to a loss in the robot’s perceived reliability (PR), a measure correlated with trustworthiness. In this study we examine how various kinds of failures affect the PR of the robot differently, and how this measure recovers without explicit social repair actions by the robot. In a preregistered and controlled online video study, participants were asked to predict a robot’s success in a pick-and-place task. We examined manipulation failures (slips), freezing (lapses), and three types of incorrect picked objects or place goals (mistakes). Participants were shown one of 11 videos—one of five types of failure, one of five types of failure followed by a successful execution in the same video, or a successful execution video. This was followed by two additional successful execution videos. Participants bet money either on the robot or on a coin toss after each video. People’s betting patterns along with a qualitative analysis of their survey responses highlight that mistakes are less damaging to PR than slips or lapses, and some mistakes are even perceived as successes. We also see that successes immediately following a failure have the same effect on PR as successes without a preceding failure. Finally, we show that successful executions recover PR after a failure. Our findings highlight which robot failures are in higher need of repair in a human-robot interaction, and how trust could be recovered by robot successes.
|
| |
| 15:00-16:30, Paper TuI2I.283 | Add to My Program |
| Stroke-Based Variable-Damping with Force Attenuation for Capturing Large-Momentum Objects under Non-Zero Contact Velocity |
|
| Chen, Yang | Tsinghua University |
| Cao, Junda | Tsinghua University |
| Gong, Kai | Qiyuan Lab |
| Deng, Yang | Tsinghua University |
| Chen, Zhang | Tsinghua University |
| Zheng, Xudong | Qiyuan Lab |
| Hou, Zhili | QiYuan Lab |
| Liang, Bin | Tsinghua University |
Keywords: Compliance and Impedance Control, Robust/Adaptive Control, Robot Safety
Abstract: Basketball players catch fast passes, and porters unload goods with apparent ease. These actions demonstrate how humans rely on intelligent regulation strategies to drive muscle activity. Replicating similar dynamic responses and strong impact absorption in robotics, however, remains a major challenge. Classical impedance control requires a trade-off between compliance and stability in high-impact interactions, which limits dynamic performance. To address this issue, this paper proposes a Stroke-based Variable Damping Model (SVDM), which adjusts the damping coefficient adaptively according to the position error relative to the contact point. In addition, a Force Attenuation (FA) strategy is applied to the external forces injected into SVDM, resulting in the SVDM with Force Attenuation (FA-SVDM). Based on human biomechanical principles, we fabricated a 4-DOF robotic manipulator using 3D printing technology. Using FA-SVDM, the manipulator successfully captured a 1kg rigid sphere falling freely from 0.8m. Under identical conditions, it exhibits superior performance compared to various fixed-damping configurations. We further developed a 6-DOF robotic manipulator equipped with a dexterous hand in the widely-used MuJoCo engine, employing quadratic programming (QP) for pre-contact trajectory tracking and FA-SVDM for post-contact energy dissipation, ultimately achieving human-like compliant capture of high-momentum flying objects using a single arm with a half-prehensile strategy.
|
| |
| 15:00-16:30, Paper TuI2I.284 | Add to My Program |
| AURA: Autonomous Upskilling with Retrieval-Augmented Agents |
|
| Zhu, Alvin | University of California Los Angeles |
| Tanaka, Yusuke | University of California, Los Angeles |
| Goldberg, Andrew | University of California Berkeley |
| Hong, Dennis | UCLA |
Keywords: Reinforcement Learning, Humanoid and Bipedal Locomotion, Machine Learning for Robot Control
Abstract: Designing reinforcement learning curricula for agile robots traditionally requires extensive manual tuning of reward functions, environment randomizations, and training configurations. We introduce AURA (Autonomous Upskilling with Retrieval-Augmented Agents), a schema-centric curriculum RL framework that leverages Large Language Models (LLMs) as autonomous designers of multi-stage curricula. AURA transforms user prompts into YAML workflows that encode full reward functions, domain randomization strategies, and training configurations. All files are statically validated before any GPU runtime, ensuring reliable and efficient execution with minimal human intervention. A retrieval-augmented feedback loop allows specialized LLM agents to design, execute, and refine curriculum stages based on prior training results stored in a vector database, enabling continual improvement over time. Quantitative experiments show that AURA consistently outperforms LLM-guided baselines in generation success rate, humanoid locomotion, and manipulation tasks. Ablation studies highlight the importance of retrieval for curriculum quality and convergence stability. AURA successfully trains end-to-end policies directly from user prompts and deploys them zero-shot on a custom humanoid robot cross multiple environments, enabling robust locomotion on varied terrain and recovery from strong perturbations—capabilities that did not exist previously with manually designed controllers. By abstracting the complexity of curriculum design, AURA enables scalable and adaptive policy learning pipelines that would be complex to construct by hand.
|
| |
| 15:00-16:30, Paper TuI2I.285 | Add to My Program |
| VSL-Skin: Individually Addressable Phase-Change Voxel Skin for Variable-Stiffness and Virtual Joints Bridging Soft and Rigid Robots |
|
| Zeng, Zihan Oliver | Purdue University |
| An, Jiajun | Purdue University |
| Luk, Preston | Purdue University |
| Kaur, Upinder | Purdue University |
Keywords: Soft Sensors and Actuators, Soft Robot Materials and Design
Abstract: Soft robots exhibit compliance but lack load support and pose retention, while rigid robots provide structural capacity but sacrifice adaptability. Existing variable-stiffness approaches operate at segment or patch scales, preventing precise spatial control over stiffness distribution and virtual joint placement. This paper presents the Variable Stiffness Lattice Skin (VSL-Skin), the first system enabling individually addressable voxel-level morphological control with millimeter-scale precision. The system achieves three unprecedented capabilities: nearly two orders of magnitude stiffness modulation across axial 15-1200 N/mm, shear 45-850 N/mm, bending 8*10^2-3*10^4 N/deg, and torsional modes with millimeter-scale spatial control; the first demonstrated 30% axial compression in phase-change systems while maintaining structural integrity; and autonomous component-level self-repair through thermal cycling that eliminates fatigue accumulation and enables programmable sacrificial joints for predictable failure management. Selective voxel activation creates six canonical virtual joint types with programmable compliance while preserving structural integrity in non-activated regions. The platform incorporates closed-form design models and finite element analysis for predictive synthesis of stiffness patterns and joint placement. Experimental validation demonstrates 30% axial contraction, thermal switching in 30-45 second cycles, and cut-to-fit integration that preserves addressability after trimming. The row-column architecture enables platform-agnostic deployment across diverse robotic systems without specialized infrastructure. This framework establishes morphological intelligence as an engineerable system property, fundamentally advancing autonomous reconfigurable robotics.
|
| |
| 15:00-16:30, Paper TuI2I.286 | Add to My Program |
| AMPLIFY: Actionless Motion Priors for Robot Learning from Videos |
|
| Collins, Jeremy | Georgia Institute of Technology |
| Cheng, Lorand | Georgia Institute of Technology |
| Aneja, Kunal | Georgia Institute of Technology |
| Wilcox, Albert | Georgia Institute of Technology |
| Joffe, Benjamin | Georgia Institute of Technology |
| Garg, Animesh | Georgia Institute of Technology |
Keywords: Representation Learning, Imitation Learning, Machine Learning for Robot Control
Abstract: Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it. We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples, allowing for independent scaling. Extensive evaluations demonstrate our dynamics model achieves over 2× better point track prediction accuracy compared to the prior state-of-the-art. In downstream policy learning, our dynamics predictions enable a 1.2-2.2× success rate improvement in low-data regimes, a 1.4× average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks with zero in-distribution action data. Beyond robotic control, we find the latent dynamics learned by AMPLIFY to enhance video prediction quality. Our results present a novel paradigm leveraging heterogeneous data sources to build efficient, generalizable world models.
|
| |
| 15:00-16:30, Paper TuI2I.287 | Add to My Program |
| NeuroVLA: Surgical Scenario-Aware Learning of Debulking Skills in Endoscopic Robotic Neurosurgery Via Vision-Language-Action Model |
|
| Fang, Zhiwei | The Chinese University of Hong Kong |
| Ng, Chi Kit | The Chinese University of Hong Kong |
| Gao, Huxin | The Chinese University of Hong Kong |
| Zhang, Tao | Chinese University of Hong Kong |
| Tang, Zhiqing | The Chinese University of Hong Kong |
| Chan, Tat-Ming | Prince of Wales Hospital |
| Liu, Hongbin | Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences |
| Wang, Renzhi | The Chinese University of Hong Kong, Shenzhen |
| Ren, Hongliang | Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS) |
|
|
| |
| 15:00-16:30, Paper TuI2I.288 | Add to My Program |
| Efficient Active Search Via Amortized Path-Integral Policies |
|
| Gupta, Tejus | Carnegie Mellon University |
| Verma, Arsh | Carnegie Mellon University |
| Song, Raymond | Carnegie Mellon University |
| Guttendorf, David | Carnegie Mellon University |
| Igoe, Conor | Carnegie Mellon University |
| Navarro-Serment, Luis E. | Carnegie Mellon University |
| Schneider, Jeff | Carnegie Mellon University |
Keywords: Search and Rescue Robots, Field Robots, Planning under Uncertainty
Abstract: This work presents amortized path-integral policies that enable efficient and real-time active search for robotic systems. We model search as an active sensing problem where agents select actions to maximize information about target locations. Unlike previous approaches that only consider information gain at final waypoints, our method accounts for observations along entire paths. To address the computational expense of path-integral policies, we amortize costs through Graph Neural Network (GNN) policies trained via behavior cloning. GNNs provide equivariance to spatial transformations and generalize across diverse maps. We validate our approach through field experiments in a 75,000 m² forested environment using an autonomous ground vehicle, along with simulated testing. Our experiments demonstrate successful policy amortization, cross-map transfer, and improved search efficiency.
|
| |
| 15:00-16:30, Paper TuI2I.289 | Add to My Program |
| Efficient and Versatile Quadrupedal Skating: Optimal Co-Design Via Reinforcement Learning and Bayesian Optimization |
|
| Wang, Hanwen | University of Wisconsin, Madison |
| Fang, Zhenlong | University of Minnesota, Twin Cities |
| Hanna, Josiah | University of Wisconsin -- Madison |
| Xiong, Xiaobin | Shanghai Innovation Institute |
Keywords: Legged Robots, Reinforcement Learning, Optimization and Optimal Control
Abstract: In this paper, we present a hardware-control co-design approach that enables efficient and versatile roller skating on quadrupedal robots equipped with passive wheels. Passive-wheel skating reduces leg inertia and improves energy efficiency, particularly at high speeds. However, the absence of direct wheel actuation tightly couples mechanical design and control. To unlock the full potential of this modality, we formulate a bilevel optimization framework: an upper-level Bayesian Optimization searches the mechanical design space, while a lower-level Reinforcement Learning trains a motor control policy for each candidate design. The resulting design-policy pairs not only outperform human-engineered baselines, but also exhibit versatile behaviors such as hockey stop (rapid braking by turning sideways to maximize friction) and self-aligning motion (automatic reorientation to improve energy efficiency in the direction of travel), offering the first system-level study of dynamic skating motion on quadrupedal robots.
|
| |
| 15:00-16:30, Paper TuI2I.290 | Add to My Program |
| SToRM: Supervised Token Reduction for Multi-Modal LLMs Toward Efficient End-To-End Autonomous Driving |
|
| Kim, Seo Hyun | Sungkyunkwan University |
| Park, Jin Bok | Sungkyunkwan University |
| Koo, Do Yeon | Sungkyunkwan University |
| Park, Hogun | Sungkyunkwan University |
| Chun, Il Yong | Sungkyunkwan University |
Keywords: Deep Learning Methods, AI-Based Methods, Machine Learning for Robot Control
Abstract: In autonomous driving, end-to-end(E2E) driving systems that predict control commands directly from sensor data achieved significant advancements. For safe autonomous driving in unexpected scenarios, one may additionally rely on human interventions such as natural language instructions.Using a multi-modal large language model(MLLM) in autonomous driving facilitates human–vehicle interactions, and may improve driving performances in unexpected scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and many visual tokens from sensor inputs, that are inherently limited in autonomous vehicles. Many MLLM studies have explored reducing the number of visual tokens, and many approaches tend to exhibit some end-task performance degradation compared to using all tokens. For efficient E2E driving while maintaining driving performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for Multi-modal LLMs(SToRM). The proposed SToRM framework consists of three key elements. First, we propose a lightweight importance predictor with short-term sliding windows that pre-dicts the importance scores of visual tokens. Second, we propose a supervised learning approach for the importance predictor, that uses an auxiliary path to obtain pseudo-supervision signals from an all-token pass through the LLM. Third, guided by predicted importance scores, we propose an anchor–context merging module that partitions tokens into “anchors” and “context” tokens, then merges the latter into their most relevant anchors to reduce redundancy while minimizing information loss. Experiments with the LangAuto benchmark dataset show that the proposed SToRM outperforms state-of-the-art E2E driving MLLM under an equal reduced-token budget and maintains all-token performance while substantially reducing computational cost, by up to 30×.
|
| |
| 15:00-16:30, Paper TuI2I.291 | Add to My Program |
| Safe and Optimal Variable Impedance Control Via Certified Reinforcement Learning |
|
| Kumar, Shreyas | Indian Institute of Science Bengaluru |
| Prakash, Ravi | Indian Institute of Science |
Keywords: Reinforcement Learning, Compliance and Impedance Control, Safety in HRI
Abstract: Reinforcement learning (RL) offers a powerful approach for robots to learn complex, collaborative skills by combining Dynamic Movement Primitives (DMPs) for motion and Variable Impedance Control (VIC) for compliant interaction. However, this model-free paradigm often risks instability and unsafe exploration due to the time-varying nature of impedance gains. This work introduces Certified Gaussian-Manifold Sampling (C-GMS), a novel trajectory-centric RL framework that learns combined DMP and VIC policies while guaranteeing Lyapunov stability and actuator feasibility by construction. Our approach reframes policy exploration as sampling from a mathematically defined manifold of stable gain schedules. This ensures every policy rollout is guaranteed to be stable and physically realizable, thereby eliminating the need for reward penalties or post-hoc validation. Furthermore, we provide a theoretical guarantee that our approach ensures bounded tracking error even in the presence of bounded model errors and deployment-time uncertainties. We demonstrate the effectiveness of C-GMS in simulation and verify its efficacy on a real robot, paving the way for reliable autonomous interaction in complex environments.
|
| |
| 15:00-16:30, Paper TuI2I.292 | Add to My Program |
| Path Planning for Four-Wheel Steering Forklifts in Constrained Spaces: A State-Embedded Hamiltonian Fast Marching Approach |
|
| Pascal, Julien | Institut Pascal, UMR 6602 CNRS, Université Clermont Auvergne |
| Mirebeau, Jean-Marie | Centre Borelli, ENS Paris-Saclay, UMR 9010 CNRS, University Paris-Saclay |
| Thuilot, Benoit | Clermont-Ferrand University |
| Checchin, Paul | Institut Pascal - UMR 6602 CNRS |
|
|
| |
| 15:00-16:30, Paper TuI2I.293 | Add to My Program |
| Reinforcement Learning for Active Perception in Autonomous Navigation |
|
| Malczyk, Grzegorz | NTNU - Norwegian University of Science and Technology |
| Kulkarni, Mihir | NTNU: Norwegian University of Science and Technology |
| Alexis, Kostas | NTNU - Norwegian University of Science and Technology |
Keywords: Aerial Systems: Perception and Autonomy, Reinforcement Learning
Abstract: This paper addresses the challenge of active perception within autonomous navigation in complex, unknown environments. Revisiting the foundational principles of active perception, we introduce an end-to-end reinforcement learning framework in which a robot must not only reach a goal while avoiding obstacles, but also actively control its onboard camera to enhance situational awareness. The policy receives observations comprising the robot state, the current depth frame, and a particularly local geometry representation built from a short history of depth readings. To couple collision-free motion planning with information-driven active camera control, we augment the navigation reward with a voxel-based information metric. This enables an aerial robot to learn a robust policy that balances goal-directed motion with exploratory sensing. Extensive evaluation demonstrates that our strategy achieves safer flight compared to using fixed, non-actuated camera baselines while also inducing intrinsic exploratory behaviors.
|
| |
| 15:00-16:30, Paper TuI2I.294 | Add to My Program |
| AMG-SLAM: Adaptive Monocular Gaussian SLAM for Efficient Surface Reconstruction |
|
| Pan, Youqi | Peking University |
| Zhou, Wugen | Peking University |
| Zha, Hongbin | Peking University |
Keywords: SLAM, Localization, Mapping
Abstract: Dense SLAM with a monocular camera remains a highly challenging task. In this paper, we present AMG-SLAM, a novel dense monocular SLAM system that tightly couples sparse tracking with dense Gaussian mapping to achieve fully online and high-quality surface reconstruction. In the frontend, learning-based modules enable efficient pose tracking and Gaussian proposal with sparse depth initialization. Specifically, we propose a fidelity-aware Gaussian proposal strategy that adaptively adds new Gaussians based on reconstruction completeness, effectively avoiding redundancy. In the backend, we propose a focus-and-balance online refinement strategy, which adaptively selects under-optimized Gaussians for focused refinement while ensuring globally balanced optimization by maximizing scene view coverage. We evaluated our method on synthetic and real-world datasets, including Replica, ScanNet, and EuRoC. Thanks to efficient system coupling and adaptive Gaussian proposal and refinement, our system achieves trajectory accuracy, rendering precision, and geometric accuracy comparable to or exceeding current state-of-the-art methods, while also demonstrating high efficiency.
|
| |
| 15:00-16:30, Paper TuI2I.295 | Add to My Program |
| Scalable Vision-Language-Action Model Pretraining for Robotic Dexterous Manipulation with Real-Life Human Activity Videos |
|
| Li, Qixiu | Tsinghua University |
| Deng, Yu | Microsoft Research Asia |
| Liang, Yaobo | Microsoft Research Asia |
| Luo, Lin | Microsoft Research Asia |
| Zhou, Lei | Microsoft Research Asia |
| Yao, Chengtang | Microsoft Research Asia |
| Zeng, Lingqi | Microsoft Research Asia |
| Feng, ZhiYuan | Tsinghua University |
| Liang, Huizhi | Tsinghua University |
| Xu, Sicheng | Microsoft Research Asia |
| Zhang, Yizhong | Microsoft Research Asia |
| Chen, Xi | Microsoft Research Asia |
| Chen, Hao | Microsoft Research Asia |
| Sun, Lily | Microsoft Research Asia |
| Chen, Dong | Microsoft Research Asia |
| Yang, Jiaolong | Microsoft Research Asia |
| Guo, Baining | Microsoft Research Asia |
Keywords: Imitation Learning, Big Data in Robotics and Automation, Deep Learning in Grasping and Manipulation
Abstract: This paper presents an approach for pretraining robotic manipulation Vision-Language-Action (VLA) models using a large corpus of unscripted real-life video recordings of human hand activities. Treating human hand as dexterous robot end-effector, we show that "in-the-wild" egocentric human videos without any annotations can be transformed into data formats fully aligned with existing robotic V-L-A training data in terms of task granularity and labels. This is achieved by the development of a fully-automated holistic human activity analysis approach for arbitrary human hand videos. This approach can generate atomic-level hand activity segments and their language descriptions, each accompanied with framewise 3D hand motion and camera motion. We process a large volume of egocentric videos and create a hand-VLA training dataset containing 1M episodes and 26M frames. This training data covers a wide range of objects and concepts, dexterous manipulation tasks, and environment variations in real life, vastly exceeding the coverage of existing robot data. We also design a dexterous hand VLA model architecture and pretrain the model on this dataset. The model exhibits strong zero-shot capabilities on completely unseen real-world observations. Additionally, fine-tuning it on a small amount of real robot action data significantly improves task success rates and generalization to novel objects in real robotic experiments. We believe this work establishes a promising pathway for scaling up VLA pretraining.
|
| |
| 15:00-16:30, Paper TuI2I.296 | Add to My Program |
| Collaborative Learning of Local 3D Occupancy Prediction and Versatile Global Occupancy Mapping |
|
| Yuan, Shanshuai | Fudan University |
| Wei, Julong | Fudan University |
| Tie, Muer | Fudan University |
| Ren, Xiangyun | Chongqing Changan Automobile CO., Ltd |
| Gan, Zhongxue | Fudan University |
| Ding, Wenchao | Fudan University |
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation
Abstract: Vision-based 3D semantic occupancy prediction is vital for autonomous driving, enabling unified modeling of static infrastructure and dynamic agents. Global occupancy maps serve as long-term memory priors, providing valuable historical context that enhances local perception. This is particularly important in challenging scenarios such as occlusion or poor illumination, where current and nearby observations may be unreliable or incomplete. Priors aggregated from previous traversals under better conditions help fill gaps and enhance the robustness of local 3D occupancy prediction. In this paper, we propose Long-term Memory Prior Occupancy (LMPOcc), a plug-and-play framework that incorporates global occupancy priors to boost local prediction and simultaneously updates global maps with new observations. To realize the information gain from global priors, we design an efficient and lightweight Current-Prior Fusion module that adaptively integrates prior and current features. Meanwhile, we introduce a model-agnostic prior format to enable continual updating of global occupancy and ensure compatibility across diverse prediction baselines. LMPOcc achieves state-of-the-art local occupancy prediction performance validated on the Occ3D-nuScenes benchmark, especially on static semantic categories. Furthermore, we verify LMPOcc’s capability to build largescale global occupancy maps through multi-vehicle crowdsourcing, and utilize occupancy-derived dense depth to support the construction of 3D open-vocabulary maps. Our method opens up a new paradigm for continuous global information updating and storage, paving the way towards more comprehensive and scalable scene understanding in large outdoor environments.
|
| |
| 15:00-16:30, Paper TuI2I.297 | Add to My Program |
| Efficient Trajectory-Conditioned Text-To-4D Gaussian Splatting |
|
| Shao, Lin | Tongji University |
| Lu, Fan | Tongji University |
| Wei, Haiyun | Tongji University |
| Qu, Sanqing | Tongji University |
| Knoll, Alois | Tech. Univ. Muenchen TUM |
| Chen, Guang | Tongji University |
Keywords: Simulation and Animation
Abstract: Recent text-to-4D generation methods have achieved remarkable progress thanks to advances in text-to-video models. Existing approaches typically reconstruct 4D scenes from generated videos or distill them from pre-trained text-to-video models. However, these methods often restrict the scene to a local region or lack spatial controllability. TC4D pioneered trajectory-controllable 4D asset generation by decomposing motion into global transformation and local deformation. While it achieves high visual quality, TC4D suffers from extremely low generation efficiency due to its NeRF-based framework. To overcome this limitation, we propose Efficient TC4DGS, which replaces NeRF with 4D Gaussian Splatting (4DGS) to significantly improve efficiency. Nevertheless, the discrete representation of 4DGS makes optimization challenging, leading to noticeable degradation in visual and motion quality. Thus, we propose a HexPlane-based 4D representation combined with a key-node control scheme. By computing the deformation only for the control nodes and getting overall deformation through interpolation, we greatly improve generation efficiency while maintaining quality. Compared with TC4D, the previous SOTA, we have improved the generation efficiency by 13times (reducing the generation time from 26 hours to 2 hours), while also achieving superior performance in terms of the dynamic quality of the generated objects.
|
| |
| 15:00-16:30, Paper TuI2I.298 | Add to My Program |
| MOGS: Monocular Object-Guided Gaussian Splatting in Large Scenes |
|
| Zhang, Shengkai | Wuhan University of Technology |
| Liu, Yuhe | Wuhan University of Technology |
| He, Jianhua | University of Essex |
| Xiao, Xuedou | Wuhan University of Technology |
| Chen, Mozi | Wuhan University of Technology |
| Liu, Kezhong | Wuhan University of Technology |
Keywords: Computer Vision for Automation, Mapping, Visual-Inertial SLAM
Abstract: Recent advances in 3D Gaussian Splatting (3DGS) deliver striking photorealism, and extending it to large scenes opens new opportunities for semantic reasoning and prediction in applications such as autonomous driving. Today's state-of-the-art systems for large scenes primarily originate from LiDAR-based pipelines that utilize long-range depth sensing. However, they require costly high-channel sensors whose dense point clouds strain memory and computation, limiting scalability, fleet deployment, and optimization speed. We present MOGS, a monocular 3DGS framework that replaces active LiDAR depth with object-anchored, metrized dense depth derived from sparse visual-inertial (VI) structure-from-motion (SfM) cues. Our key idea is to exploit image semantics to hypothesize per-object shape priors, anchor them with sparse but metrically reliable SfM points, and propagate the resulting metric constraints across each object to produce dense depth. To address two key challenges, i.e., insufficient SfM coverage within objects and cross-object geometric inconsistency, MOGS introduces 1) a multi-scale shape consensus module that adaptively merges small segments into coarse objects best supported by SfM and fits them with parametric shape models, and 2) a cross-object depth refinement module that optimizes per-pixel depth under a combinatorial objective combining geometric consistency, prior anchoring, and edge-aware smoothness. Experiments on public datasets show that, with a low-cost VI sensor suite, MOGS reduces training time by up to 30.4% and memory consumption by 19.8%, while achieving high-quality rendering competitive with costly LiDAR-based approaches in large scenes. The source code is publicly available at https://github.com/ClarenceZSK/MOGS/.
|
| |
| 15:00-16:30, Paper TuI2I.299 | Add to My Program |
| UAV-SAR: Simultaneous Radar-Based Odometry and Synthetic-Array Sensing for Unmanned Aerial Vehicles |
|
| Hunt, David | Duke University |
| Luo, Shaocheng | Duke University |
| Rivera, Samuel | Duke University |
| Prakash, Aarav | Durham Academy |
| Morris, Cameron | Durham Academy |
| Chen, Tingjun | Duke University |
| Pajic, Miroslav | Duke University |
Keywords: Aerial Systems: Perception and Autonomy, Aerial Systems: Applications, Range Sensing
Abstract: Unmanned aerial vehicles (UAVs) require accurate odometry—i.e., estimating the position and velocity of the vehicle over time—as well as high-resolution sensing to safely and effectively operate in complex environments. Traditionally, GPS, cameras, and/or lidar sensors have been used to perform these functions. However, GPS can be jammed in contested environments while cameras and lidars fail in visually degraded conditions, limiting UAV operations in these scenarios. In this work, we present UAV-SAR, a unified architecture that utilizes mmWave radars to simultaneously achieve precise odometry measurements and perform high-resolution synthetic-array sensing. Here, UAV-SAR measures a UAV’s altitude and velocity from downward- and outward-facing radars and fuses these measurements within a commercially available flight controller to produce accurate odometry estimates. These odometry estimates are then used to dynamically construct synthetic arrays by coherently integrating multiple radar frames together over a duration of 0.5 s, improving the angular resolution by an order of magnitude compared to the physical array alone. Finally, a lightweight deep learning model is utilized to convert high-resolution range-angle responses into 2D point clouds suitable for downstream perception tasks. UAV-SAR is validated on a custom UAV prototype where it is integrated with ROS2 and the PX4 autopilot to demonstrate stable flight, reliable odometry, and high-resolution radar sensing in indoor environments.
|
| |
| 15:00-16:30, Paper TuI2I.300 | Add to My Program |
| A Multi-String Traversing Violin-Playing Robot for Carnatic Music |
|
| Sankaranarayanan, Raghavasimhan | Georgia Institute of Technology |
| Weinberg, Gil | Georgia Institute of Technology |
Keywords: Art and Entertainment Robotics, Human-Robot Collaboration, Human Performance Augmentation
Abstract: Over the past several decades, robotic musicianship researchers have mainly focused on Western music with only limited efforts addressing musical styles from other regions, such as South Indian Classical music (a.k.a. Carnatic music) - a music form popular in the southern part of India. In this work, we present Hathaani v2, a robotic system capable of performing Carnatic music on the violin. The robot is designed to translate pitch information into left-hand finger placement and amplitude information into bowing changes and dynamics, based on any monophonic audio recording. The left-hand mechanism is capable of reaching arbitrary finger positions along the strings, allowing the robot to play gamakas - continuous pitch ornamentations that are fundamental to Carnatic music. The differential bowing mechanism provides both pressure and angle modulation, while maintaining mechanical rigidity and allowing visual engagement for the audience. We assessed the system’s ability to perform gamakas through expert listening studies involving ten professional musicians on intonation, timbre, quality of bowing, hand coordination, gamaka authenticity, and clarity. The proposed robot outperforms the baseline on all of the evaluated parameters, achieving average scores exceeding 4 on a 5-point Likert scale (0.5 increments). This work has the potential to transform education and production of Carnatic music by offering programmatic solutions that support complex gamakas. Compared to software-based emulations, this physical violin-playing robot offers an accurate and expressive medium for conveying the nuances of Carnatic music performance.
|
| |
| 15:00-16:30, Paper TuI2I.301 | Add to My Program |
| Implicit Maximum Likelihood Estimation for Real-Time Generative Model Predictive Control |
|
| Lee, Grayson | Simon Fraser University |
| Bui, Minh | Simon Fraser University |
| Zhou, Shuzi | Simon Fraser University |
| Li, Yankai | Simon Fraser University |
| Chen, Mo | Simon Fraser University |
| Li, Ke | Simon Fraser University |
Keywords: Machine Learning for Robot Control, Integrated Planning and Learning, Motion and Path Planning
Abstract: Diffusion-based models have recently shown strong performance in trajectory planning, as they are capable of capturing the diverse, multi-modal distributions of complex behaviors. A key limitation of these models, however, is their slow inference speed due to the iterative denoising process. This makes them less suitable for real-time applications such as closed-loop model predictive control (MPC), where plans must be generated quickly and adapted continuously to a changing environment. In this paper, we investigate Implicit Maximum Likelihood Estimation (IMLE) as an alternative generative modeling approach for planning. IMLE offers strong mode coverage while enabling inference that is two orders of magnitude faster, making it particularly well-suited for real-time MPC tasks. Our results demonstrate that IMLE achieves competitive performance on standard offline reinforcement learning benchmarks compared to the standard diffusion-based planner, while substantially improving planning speed in both open-loop and closed-loop settings. We further validate IMLE in a real-time closed-loop human navigation scenario, demonstrating how it enables rapid and adaptive plan generation in dynamic environments.
|
| |
| 15:00-16:30, Paper TuI2I.302 | Add to My Program |
| Dynamic UGV-UAV Cooperative Path Planning in Uncertain Environments |
|
| Nguyen, Ninh | University of North Carolina at Charlotte |
| Akella, Srinivas | University of North Carolina at Charlotte |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Cooperating Robots, Aerial Systems: Applications
Abstract: This paper addresses the Dynamic UGV-UAV Cooperative Path Planning (DUCPP) problem involving one unmanned ground vehicle (UGV) assisted by one or more unmanned aerial vehicles (UAVs) operating on an uncertain road network with potentially impassable edges. DUCPP is particularly relevant for scenarios such as disaster response, emergency supply transport, and rescue operations, where a UGV must reach a specified destination in the presence of partially unknown road conditions. To enable the UGV to travel safely and efficiently to its destination, the UAV(s) dynamically inspect edges in the environment to identify and prune damaged or impassable edges from consideration. We present multiple strategies, including a bidirectional approach, to optimize UGV-UAV cooperation for finding a safe path in an uncertain road network. Furthermore, we explore the impact of using multiple UAVs on reducing the UGV’s travel time, and evaluate the associated computation time. The proposed strategies are implemented and evaluated on 100 urban road networks. The results demonstrate that the bidirectional strategy achieves the best performance in most instances, and using multiple UAVs further reduces UGV travel time at the expense of increased computation time. This paper presents a robust framework for DUCPP to achieve efficient UGV-UAV cooperation for path planning and inspection, offering practical solutions for navigation in challenging and uncertain conditions.
|
| |
| 15:00-16:30, Paper TuI2I.303 | Add to My Program |
| Learning to Drive by Imitating Surrounding Vehicles |
|
| Sonmez, Yasin | University of California, Berkeley |
| Krasowski, Hanna | University of California, Berkeley |
| Arcak, Murat | University of California, Berkeley |
Keywords: Imitation Learning, Motion and Path Planning, Autonomous Vehicle Navigation
Abstract: Imitation learning is a promising approach for training autonomous vehicles (AV) to navigate complex traffic environments by mimicking expert driver behaviors. While existing imitation learning frameworks focus on leveraging expert demonstrations, they often overlook the potential of additional complex driving data from surrounding traffic participants. In this paper, we study a data augmentation strategy that leverages the observed trajectories of nearby vehicles, captured by the AV’s sensors, as additional demonstrations. We introduce a simple vehicle-selection sampling and filtering strategy that prioritizes informative and diverse driving behaviors, contributing to a richer dataset for training. We evaluate this idea with a representative learning-based planner on a large real-world dataset and demonstrate improved performance in complex driving scenarios. Specifically, the approach reduces collision rates and improves safety metrics compared to the baseline. Notably, even when using only 10 percent of the original dataset, the method matches or exceeds the performance of the full dataset. Through ablations, we analyze selection criteria and show that naive random selection can degrade performance. Our findings highlight the value of leveraging diverse real-world trajectory data in imitation learning and provide insights into data augmentation strategies for autonomous driving.
|
| |
| 15:00-16:30, Paper TuI2I.304 | Add to My Program |
| Reinforcement Learning-Based Robust Wall Climbing Locomotion Controller in Ferromagnetic Environment |
|
| Um, Yong | Korea Advanced Institute of Science and Technology |
| Shin, Young-Ha | KAIST |
| Kim, Joon-Ha | Diden Robotics Co., Ltd |
| Kwon, Soonpyo | Korea Advanced Institute of Science and Technology, KAIST |
| Park, Hae-Won | Korea Advanced Institute of Science and Technology |
Keywords: Legged Robots, Climbing Robots, Reinforcement Learning
Abstract: We present a reinforcement learning framework for quadrupedal wall-climbing locomotion that explicitly addresses uncertainty in magnetic foot adhesion. A physics-based adhesion model of a quadrupedal magnetic climbing robot is incorporated into simulation to capture partial contact, air-gap sensitivity, and probabilistic attachment failures. To stabilize learning and enable reliable transfer, we design a three-phase curriculum: (1) acquire a crawl gait on flat ground without adhesion, (2) gradually rotate the gravity vector to vertical while activating the adhesion model, and (3) inject stochastic adhesion failures to encourage slip recovery. The learned policy achieves a high success rate, strong adhesion retention, and rapid recovery from detachment in simulation under degraded adhesion. Compared with a model predictive control (MPC) baseline that assumes perfect adhesion, our controller maintains locomotion when attachment is intermittently lost. Hardware experiments with the untethered robot further confirm robust vertical crawling on steel surfaces, maintaining stability despite transient misalignment and incomplete attachment. These results show that combining curriculum learning with realistic adhesion modeling provides a resilient sim-to-real framework for magnetic climbing robots in complex environments.
|
| |
| 15:00-16:30, Paper TuI2I.305 | Add to My Program |
| Viability-Preserving Passive Torque Control |
|
| Zhang, Zizhe | University of Pennsylvania |
| Wang, Yicong | University of Pennsylvania |
| Zhang, Zhiquan | University of Illinois-Urbana Champaign |
| Li, Tianyu | University of Pennsylvania |
| Figueroa, Nadia | University of Pennsylvania |
Keywords: Robot Safety, Collision Avoidance, Physical Human-Robot Interaction
Abstract: Conventional passivity-based torque controllers for manipulators are typically unconstrained, which can lead to safety violations under external perturbations. In this paper, we employ viability theory to pre-compute safe sets in the state-space of joint positions and velocities. These viable sets, constructed via data-driven and analytical methods for self-collision avoidance, external object collision avoidance and joint-position and joint-velocity limits, provide constraints on joint accelerations and thus joint torques via the robot dynamics. A quadratic programming-based control framework enforces these constraints on a passive controller tracking a dynamical system, ensuring the robot states remain within the safe set in an infinite time horizon. We validate the proposed approach through simulations and hardware experiments on a 7-DoF Franka Emika manipulator. In comparison to a baseline constrained passive controller, our method operates at higher control-loop rates and yields smoother trajectories.
|
| |
| 15:00-16:30, Paper TuI2I.306 | Add to My Program |
| Beyond Pairwise Costs: Hyper Graph of Convex Sets for Smoothness-Aware Trajectory Planning |
|
| Zhang, Yunyi | Peking University |
| Guo, Meng | Peking University |
| Li, Zhongkui | Peking University |
Keywords: Motion and Path Planning, Optimization and Optimal Control, Constrained Motion Planning
Abstract: Classical graph of convex sets (GCS) formulations rely on pairwise edge costs, which are insufficient to capture higher-order geometric interactions relevant to trajectory refinement. This paper proposes a hyper graph of convex sets (HGCS), which extends GCS by introducing hyperedges over multiple vertices. Using a 3-uniform construction, a second-order smoothness cost is incorporated to favor path sequences that are more suitable for dynamically feasible trajectory generation. To preserve tractability, the HGCS is converted into an equivalent classical GCS, so the resulting shortest-path problem can still be solved with existing GCS methods. The discrete path is then refined by trajectory optimization within the corresponding safe corridor. Numerical simulations and quadrotor experiments show that the proposed method provides better initialization for downstream optimization, achieves shorter trajectory duration than hierarchical GCS baselines, and is faster than joint spatio-temporal optimization.
|
| |
| 15:00-16:30, Paper TuI2I.307 | Add to My Program |
| MM-TRELLIS: Point-Cloud Guided Multi-Modal 3D Vehicle Generation in Autonomous Driving |
|
| Xiao, Hongli | Shanghai Jiao Tong University |
| Zhang, Youjian | University of Sydney |
| Bai, Yucai | Bosch |
| Jin, Yaohui | Shanghai Jiao Tong University |
| Wang, Chaoyue | University of Sydney |
| Ren, Xiaoguang | Academy of Military Sciences |
| Yang, Wenjing | State Key Laboratory of High Performance Computing (HPCL), School of Computer, National University of Defense Technology |
| Lan, Long | national university of defense technology |
|
|
| |
| 15:00-16:30, Paper TuI2I.308 | Add to My Program |
| Bandwidth-Efficient Multi-Agent Communication through Information Bottleneck and Vector Quantization |
|
| Farooq, Ahmad | University of Arkansas at Little Rock |
| Iqbal, Kamran | University of Arkansas at Little Rock |
Keywords: Multi-Robot Systems, Reinforcement Learning, Intelligent Transportation Systems
Abstract: Multi-agent reinforcement learning systems deployed in real-world robotics applications face severe communication constraints that significantly impact coordination effectiveness. We present a framework that combines information bottleneck theory with vector quantization to enable selective, bandwidth-efficient communication in multi-agent environments. Our approach learns to compress and discretize communication messages while preserving task-critical information through principled information-theoretic optimization. We introduce a gated communication mechanism that dynamically determines when communication is necessary based on environmental context and agent states. Experimental evaluation on challenging coordination tasks demonstrates that our method achieves 181.8% performance improvement over no-communication baselines while reducing bandwidth usage by 41.4%. Pareto frontier analysis shows dominance across the entire success-bandwidth spectrum, with an area under the curve of 0.198 vs 0.142 for next-best methods. Our approach significantly outperforms existing communication strategies and establishes a theoretically grounded framework for deploying multi-agent systems in bandwidth-constrained environments such as robotic swarms, autonomous vehicle fleets, and distributed sensor networks.
|
| |
| 15:00-16:30, Paper TuI2I.309 | Add to My Program |
| Humanoid Everyday: A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation |
|
| Zhao, Zhenyu | University of Southern California |
| Jing, Hongyi | University of Southern California |
| Liu, Xiawei | University of Southern California |
| Mao, Jiageng | University of Southern California |
| Jha, Abha | University of Southern California |
| Yang, Hanwen | University of Southern California |
| Xue, Rong | University of Southern California |
| Zakharov, Sergey | Toyota Research Institute |
| Guizilini, Vitor | Toyota Research Institute |
| Wang, Yue | University of Southern California |
Keywords: Data Sets for Robot Learning, Humanoid Robot Systems, Dexterous Manipulation
Abstract: From loco-motion to dextrous manipulation, humanoid robots have made remarkable strides in demonstrating complex full-body capabilities. However, the majority of current robot learning datasets and benchmarks mainly focus on stationary robot arms, and the few existing humanoid datasets are either confined to fixed environments or limited in task diversity, often lacking human-humanoid interaction and lower-body locomotion. Moreover, there are a few standardized evaluation platforms for benchmarking learning-based policies on humanoid data. In this work, we present Humanoid Everyday, a large-scale and diverse humanoid manipulation dataset characterized by extensive task variety involving dextrous object manipulation, human-humanoid interaction, locomotion-integrated actions, and more. Leveraging a highly efficient human-supervised teleoperation pipeline, Humanoid Everyday aggregates high-quality multimodal sensory data—including RGB, depth, LiDAR, and tactile inputs—together with natural language annotations, comprising 10.3k trajectories and over 3 million frames of data across 260 tasks across 7 broad categories. In addition, we conduct an analysis of representative policy learning methods on our dataset, providing insights into their strengths and limitations across different task categories. For standardized evaluation, we introduce a cloud-based evaluation platform that allows researchers to seamlessly deploy their policies in our controlled setting and receive performance feedback. By releasing Humanoid Everyday along with our policy learning analysis and a standardized cloud-based evaluation platform, we intend to advance research in general-purpose humanoid manipulation and lay the groundwork for more capable and embodied robotic agents in real-world scenarios. Our dataset, data collection code, and cloud evaluation website are made publicly available on our project website: https://github.com/anonymouse5202077/Humanoid-Everyday
|
| |
| 15:00-16:30, Paper TuI2I.310 | Add to My Program |
| GRACE: A Unified 2D Multi-Robot Path Planning Simulator & Benchmark for Grid, Roadmap, and Continuous Environments |
|
| Zang, Chuanlong | Robert Bosch GmbH, Corporate Research |
| Mannucci, Anna | Robert Bosch GmbH, Corporate Research |
| Barz, Isabelle | Robert Bosch GmbH, Corporate Research |
| Schillinger, Philipp | Robert Bosch GmbH, Corporate Research |
| Lier, Florian | Robert Bosch GmbH, Corporate Research |
| Hoenig, Wolfgang | Technical University of Berlin |
Keywords: Software Tools for Benchmarking and Reproducibility, Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning
Abstract: Advancing Multi-Agent Pathfinding (MAPF) and Multi-Robot Motion Planning (MRMP) requires platforms that enable transparent, reproducible comparisons across modeling choices. Existing tools either scale under simplifying assumptions (grids, homogeneous agents) or offer higher fidelity with less comparable instrumentation. We present GRACE, a unified 2D simulator+benchmark that instantiates the same task at multiple abstraction levels (grid, roadmap, continuous) via explicit, reproducible operators and a common evaluation protocol. Our empirical results on public maps and representative planners enable commensurate comparisons on a shared instance set. Furthermore, we quantify the expected representation–fidelity trade-offs (MRMP solves instances at higher fidelity but lower speed, while grid/roadmap planners scale farther). By consolidating representation, execution, and evaluation, GRACE thereby aims to make cross-representation studies more comparable and provides a means to advance multi-robot planning research and its translation to practice.
|
| |
| 15:00-16:30, Paper TuI2I.311 | Add to My Program |
| Closed-Loop Multimodal Sensory Training Enhances the Proprioceptive-Motor Pathway: Low-Load Automaticity and Fine Motor Control |
|
| Cheng, Qian | Tianjin University |
| Yin, Zhouhaotian | Tianjin University |
| Liu, Yuan | Tianjin University |
Keywords: Haptics and Haptic Interfaces, Human Performance Augmentation, Human-Centered Robotics
Abstract: Designing reliable upper-limb human-machine interfaces (HMIs) with low attentional demand requires strengthening the proprioceptive-motor pathway (PMP). We propose a closed-loop multimodal sensory training that maps three robot joint angles to six bidirectional electrotactile channels and combines visual fading with degrees-of-freedom (DoF) progression to shift reliance from vision to tactile-proprioceptive guidance. The objective is low-load automaticity for supplemental cues and improved native-limb fine motor control. Twenty right-handed adults completed a six-day protocol. Using synchronized kinematics and EEG, we evaluated electrotactile-driven tasks: eyes-closed continuous tracking and static posture reproduction, dual-task posture reproduction with serial subtraction, reversed-mapping generalization, and a proprioceptively constrained maze. Training produced robust gains under tactile-proprioceptive dominance: errors decreased (~30%) and response time shortened. Under dual-task load, posture error and response time decreased while correct subtractions increased and mistakes decreased, supporting low-load automaticity of electrotactile decoding. Although group-level β-event-related desynchronization (ERD) changes were not significant, contralateral ERD reductions and post-movement beta rebound (PMBR) enhancements during tactile decoding were consistent with reduced cortical effort and emerging automatic control. Performance generalized to reversed mapping, and maze completion time decreased significantly, evidencing improved fine motor control. These findings show that closed-loop vision-tactile-proprioceptive integration offers a compact, reproducible route to PMP enhancement, enabling low-load automaticity and finer control, with actionable design targets for prosthetics, exoskeleton rehabilitation, and vision-limited teleoperation.
|
| |
| 15:00-16:30, Paper TuI2I.312 | Add to My Program |
| SINGER: An Onboard Generalist Vision-Language Navigation Policy for Drones |
|
| Adang, Maximilian | Stanford University |
| Low, JunEn | Stanford University |
| Shorinwa, Ola | Stanford University |
| Schwager, Mac | Stanford University |
Keywords: Vision-Based Navigation, Semantic Scene Understanding, Aerial Systems: Perception and Autonomy
Abstract: Large vision-language models have driven remarkable progress in open-vocabulary robot policies, e.g., generalist robot manipulation policies, that enable robots to complete complex tasks specified in natural language. Despite these successes, open-vocabulary autonomous drone navigation remains an unsolved challenge due to the scarcity of large-scale demonstrations, real-time control demands of drones for stabilization, and lack of reliable external pose estimation modules. In this work, we present SINGER for language-guided autonomous drone navigation in the open world using only onboard sensing and compute. To train robust, open-vocabulary navigation policies, SINGER leverages three central components: (i) a photorealistic language-embedded flight simulator with minimal sim-to-real gap using Gaussian Splatting for efficient data generation, (ii) an RRT-inspired multi-trajectory generation expert for collision-free navigation demonstrations, and these are used to train (iii) a lightweight end-to-end visuomotor policy for real-time closed-loop control. Through extensive hardware flight experiments, we demonstrate superior zero-shot sim-to-real transfer of our policy to unseen environments and unseen language-conditioned goal objects. When trained on ~700k-1M observation action pairs of language conditioned visuomotor data and deployed on hardware, SINGER outperforms a velocity-controlled semantic guidance baseline by reaching the query 23.33% more on average, and maintains the query in the field of view 16.67% more on average, with 10% fewer collisions.
|
| |
| 15:00-16:30, Paper TuI2I.313 | Add to My Program |
| Differentiable Particle Optimization for Fast Sequential Manipulation |
|
| Chen, Lucas | Purdue University |
| Iyer, Shrutheesh Raman | Purdue University |
| Kingston, Zachary | Purdue University |
Keywords: Motion and Path Planning, Task and Motion Planning
Abstract: Sequential robot manipulation tasks require finding collision-free trajectories that satisfy geometric constraints across multiple object interactions in potentially high-dimensional configuration spaces. Solving these problems in real-time and at large scales has remained out of reach due to computational requirements. Recently, GPU-based acceleration has shown promising results, but prior methods achieve limited performance due to CPU-GPU data transfer overhead and complex logic that prevents full hardware utilization. To this end, we present SPaSM (Sampling Particle optimization for Sequential Manipulation), a fully GPU-parallelized framework that compiles constraint evaluation, sampling, and gradient-based optimization into optimized CUDA kernels for end-to-end trajectory optimization without CPU coordination. The method consists of a two-stage particle optimization strategy: first solving placement constraints through massively parallel sampling, then lifting solutions to full trajectory optimization in joint space. Unlike hierarchical approaches, SPaSM jointly optimizes object placements and robot trajectories to handle scenarios where motion feasibility constrains placement options. Experimental evaluation on challenging benchmarks demonstrates solution times in the realm of milliseconds with a 100% success rate; a 4000x speedup compared to existing approaches. Code and examples are available at commalab.org/papers/spasm.
|
| |
| 15:00-16:30, Paper TuI2I.314 | Add to My Program |
| APEX-Glove: An Actuated, Open-Source, Hand-Exoskeleton Glove for Finger Motion Tracking and Kinesthetic 3D Force Feedback |
|
| Kosanovic, Nicolas | University of Louisville |
| Chagas Vaz, Jean | University of Louisville |
Keywords: Haptics and Haptic Interfaces, Wearable Robotics, Telerobotics and Teleoperation
Abstract: Accurate motion tracking and haptics are pivotal to building platforms for immersive Virtual Reality, dexterous robotic hand teleoperation, or embodied AI data collection. Existing technologies fail to provide accurate finger motion tracking and multidimensional force feedback simultaneously, complicating robotic hand control. This work develops the APEX-Glove: the world’s first dorsal-mounted wearable hand exoskeleton yielding both accurate finger motion tracking and active kinesthetic 3D force feedback. Data-driven modeling of the exoskeleton and its Dynamixel XL330 actuators compensates gravity, Coriolis, and friction forces to improve transparency and comfort. Biomechanically-informed analytical inverse kinematics estimates human finger joint angles at 300 Hz with an average Root Mean Squared Error of 18.5◦ when compared to industrial-grade datagloves (MANUS Quantum Metagloves). Stationary testing finds that the APEX-Glove can generate up to 0.8 N, 0.7 N, and 1.4 N of force feedback in the x, y, and z directions, on average. Motion retargeting to humanoid robot hands is also detailed, with hardware experimentation demonstrating haptic hand teleoperation. Lastly, we open-source the APEX-Glove’s cost-effective (<700 USD) design to disseminate its motion capture and force feedback capabilities to the community.
|
| |
| 15:00-16:30, Paper TuI2I.315 | Add to My Program |
| Probabilistically-Safe Bipedal Navigation Over Uncertain Terrain Via Conformal Prediction and Contraction Analysis |
|
| Muenprasitivej, Kasidit | Northeastern University |
| Zhao, Ye | Georgia Institute of Technology |
| Chou, Glen | Georgia Institute of Technology |
Keywords: Humanoid and Bipedal Locomotion, Planning under Uncertainty, Motion and Path Planning
Abstract: We address the challenge of enabling bipedal robots to traverse rough terrain by developing probabilistically safe planning and control strategies that ensure dynamic feasibility and centroidal robustness under terrain uncertainty. Specifically, we propose a high-level Model Predictive Control (MPC) navigation framework for a bipedal robot with a specified confidence level of safety that (i) enables safe traversal toward a desired goal location across a terrain map with uncertain elevations, and (ii) formally incorporates uncertainty bounds into the centroidal dynamics of locomotion control. To model the rough terrain, we employ Gaussian Process (GP) regression to estimate elevation maps and leverage Conformal Prediction (CP) to construct calibrated confidence intervals that capture the true terrain elevation. Building on this, we formulate contraction-based reachable tubes that explicitly account for terrain uncertainty, ensuring state convergence and tube invariance. In addition, we introduce a contraction-based flywheel torque control law for the reduced-order Linear Inverted Pendulum Model (LIPM), which stabilizes the angular momentum about the center-of-mass (CoM). This formulation provides both probabilistic safety and goal reachability guarantees. For a given confidence level, we establish the forward invariance of the proposed torque control law by demonstrating exponential stabilization of the actual CoM phase-space trajectory and the desired trajectory prescribed by the high-level planner. Finally, we evaluate the effectiveness of our planning framework through physics-based simulations of the Digit bipedal robot in MuJoCo.
|
| |
| 15:00-16:30, Paper TuI2I.316 | Add to My Program |
| MGS-Track: Monocular 6DoF Pose Tracking Via Masked 3D Prior and Online Gaussian Splatting |
|
| Chen, Zhiyuan | Tongji University |
| Lu, Fan | Tongji University |
| Yu, Guo | Tongji University |
| Qu, Sanqing | Tongji University |
| Wu, Ya | Tongji University |
| Yuan, Huang | Beijing Institute of Control Engineering |
| Knoll, Alois | Tech. Univ. Muenchen TUM |
| Chen, Guang | Tongji University |
Keywords: Perception for Grasping and Manipulation, SLAM, Visual-Inertial SLAM
Abstract: Tracking the 6DoF pose of previously unseen objects from monocular RGB videos is crucial for robotic manipulation, yet remains challenging due to depth ambiguity and limited object-centric visual context. Existing trackers often rely on accurate depth sensors, which constrains deployment in low-cost settings, while substituting monocular pseudo-depth frequently introduces geometric errors that reduce tracking robustness. In this end, We propose textbf{MGS-Track}, an object-centric online tracking and reconstruction framework that combines learning-based geometric priors with differentiable 3D Gaussian Splatting (3DGS). Specifically, we first introduce a mask-augmented DUSt3R network (DUSt3R-M) to establish pairwise correspondences and predict point maps, which serve as geometric priors for initializing and guiding an online 3DGS representation. We then jointly optimize Gaussian parameters and 6DoF object poses in a coarse-to-fine manner, enabling robust tracking and high-fidelity reconstruction. To control model growth and maintain efficiency over time, we further introduce adaptive Gaussian management with capacity-aware selection and mask-consistent pruning. Experiments on YCBInEOAT and HO3D show that MGS-Track consistently outperforms competitive monocular baselines on both pose tracking and object reconstruction in challenging object-centric scenarios.
|
| |
| 15:00-16:30, Paper TuI2I.317 | Add to My Program |
| Cooperative Informed Tree (CoIT*): Cooperative Bi-Directional Multi-Resolution Motion Planning with Adaptive Edge Screening |
|
| Tan, Xiao | Hunan University |
| Wang, Yaonan | Hunan University |
| Ding, Renjie | Hunan University |
| Liu, Min | Hunan University |
| Zhang, Zhe | Hunan University |
| Yu, Xiaoqian | Hunan University |
Keywords: Motion and Path Planning, Collision Avoidance, Task and Motion Planning
Abstract: In informed search-based path planning, heuristic functions that incorporate problem knowledge are essential for guiding the search and improving efficiency. The accuracy and computational cost of these heuristics are therefore critical to performance. However, accuracy and computational efficiency are often contradictory, making it difficult to select an appropriate heuristic for a given problem. In this paper, we present CoIT* (Cooperative Informed Tree*), an almost-asymptotically optimal asymmetric bi-directional planning algorithm designed to address these challenges. CoIT* introduces a multi-resolution and multi-heuristic queue cooperation mechanism between forward and reverse searches: the forward search interacts with the reverse search to provide cooperative information exchange, which enhances both local and global edge screening. This cooperation improves the accuracy of the reverse search, while multi-resolution exploration enables lazy edge validation in the forward search, thereby reducing planning time. We validate CoIT* on high-dimensional benchmark problems as well as simulated and real surgical robot planning tasks. Experimental results demonstrate that CoIT* achieves higher accuracy and significantly lower planning time compared with state-of-the-art planners.
|
| |
| 15:00-16:30, Paper TuI2I.318 | Add to My Program |
| REACT: Real-Time Entanglement-Aware Coverage Path Planning for Tethered Underwater Vehicles |
|
| Amer, Abdelhakim | Aarhus University |
| Mehndiratta, Mohit | NestAI |
| Brodskiy, Yury | EIVA |
| Wehbe, Bilal | German Research Center for Artificial Intelligence |
| Kayacan, Erdal | Paderborn University |
Keywords: Marine Robotics, Motion and Path Planning, Autonomous Vehicle Navigation
Abstract: Inspection of underwater structures with tethered underwater vehicles is often hindered by the risk of tether entanglement. We propose REACT (real-time entanglement- aware coverage path planning for tethered underwater ve- hicles), a framework designed to overcome this limitation. REACT comprises a computationally efficient geometry-based tether model using the signed distance field (SDF) map for accurate, real-time simulation of taut tether configurations around arbitrary structures in 3D. This model enables an efficient online replanning strategy by enforcing a maximum tether length constraint, thereby actively preventing entanglement. By integrating REACT into a coverage path planning framework, we achieve safe and entanglement-free inspection paths, previously challenging due to tether constraints. The complete REACT framework’s efficacy is validated in a pipe inspection scenario, demonstrating safe navigation and full coverage inspection. Simulation results show that REACT achieves complete coverage while maintaining tether constraints and completing the total mission 20% faster than conventional planners, despite a longer inspection time due to proactive avoidance of entanglement that eliminates extensive post-mission disentanglement. Real-world experiments confirm these benefits, where REACT completes the full mission, while the baseline planner fails due to physical tether entanglement.
|
| |
| 15:00-16:30, Paper TuI2I.319 | Add to My Program |
| TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking |
|
| Liu, Jiahang | Harbin Institute of Technology(ShenZhen) |
| Qi, Yunpeng | University of Science of Technology of China |
| Zhang, Jiazhao | Peking University |
| Li, Minghan | Galbot |
| Wang, Shaoan | Peking University |
| Kui, Wu | Beihang University |
| Ye, Hanjing | Southern University of Science and Technology |
| Zhang, Hong | SUSTech |
| Chen, Zhibo | University of Science and Technology of China |
| Zhong, Fangwei | Peking Univesity |
| Zhang, Zhizheng | Beijing Galbot Co., Ltd |
| Wang, He | Peking University |
Keywords: Visual Tracking, Vision-Based Navigation, Learning from Demonstration
Abstract: Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack explicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar-looking distractors. To address these challenges, we present TrackVLA++, a novel Vision–Language–Action (VLA) model that enhances embodied visual tracking with two key modules: a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of-Thought paradigm, termed Polar-CoT, which infers the target’s relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long-horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1% and 12% respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real-world tracking in dynamic and occluded scenarios.
|
| |
| 15:00-16:30, Paper TuI2I.320 | Add to My Program |
| Pragmatic Embodied Spoken Instruction Following in Human-Robot Collaboration with Theory of Mind |
|
| Ying, Lance | Harvard University |
| Li, Xinyi | Johns Hopkins University |
| Aarya, Shivam | Johns Hopkins University |
| Fang, Yizirui | Johns Hopkins University |
| Liu, Jason Xinyu | Brown University |
| Yin, Yifan | Johns Hopkins University |
| Tellex, Stefanie | Brown University |
| Tenenbaum, Joshua | Massachusetts Institute of Technology |
| Shu, Tianmin | Johns Hopkins University |
Keywords: Human-Robot Collaboration, Embodied Cognitive Science, Human-Robot Teaming
Abstract: Spoken language instructions are ubiquitous in agent collaboration. However, in real-world human-robot collaboration, following human spoken instructions can be challenging due to various speaker and environmental factors, such as background noise or mispronunciation. When faced with noisy auditory inputs, humans can leverage the collaborative context in the embodied environment to interpret noisy spoken instructions and take pragmatic assistive actions. In this paper, we present a cognitively inspired neurosymbolic model, Spoken Instruction Following through Theory of Mind (SIFToM), which leverages a Vision-Language Model with model-based mental inference to enable robots to pragmatically follow human instructions under diverse speech conditions. We test SIFToM in both simulated environments (VirtualHome) and real-world human-robot collaborative settings. Results show that SIFToM can significantly improve the performance of a lightweight base VLM (Gemini 2.5 Flash), outperforming state-of-the-art VLMs (Gemini 2.5 Pro) and approaching human-level accuracy on challenging spoken instruction following tasks.
|
| |
| 15:00-16:30, Paper TuI2I.321 | Add to My Program |
| SRCF-UAV: Sparse Radar-Camera Fusion for 3D UAV Detection |
|
| Zhao, Yiming | The Hong Kong University of Science and Technology (Guangzhou) |
| Gong, Zijun | The Hong Kong University of Science and Technology (Guangzhou) |
| Yang, Yang | HKUST Shanghai Center |
| Cui, Ying | The Hong Kong University of Science and Technology (Guangzhou) |
Keywords: Object Detection, Segmentation and Categorization, Aerial Systems: Perception and Autonomy, Deep Learning for Visual Perception
Abstract: With the rapid development of the low-altitude economy, accurate detection and localization of UAVs have become increasingly important. Conventional radar and visual detection methods have low accuracy, whereas current radar-camera fusion methods are computationally intensive. To overcome these issues, we propose a novel 3D UAV detection approach based on sparse radar-camera fusion, called SRCF-UAV, to achieve high-precision, low-complexity UAV detection in diverse scenarios. Specifically, we first propose an improved query initialization method that incorporates locations from 2D image proposals and radar point clouds. Then, we propose a query update method that sparsely fuses radar and image queries based on features, velocity, and spatial distance. Furthermore, we develop a radar-camera multimodal data collection platform based on real-time kinematic positioning (RTK) and collect a dataset of centimeter-level precision, comprising over 20,000 UAV instances that cover various scenarios, UAV models, and lighting conditions. Finally, extensive experiments on this dataset demonstrate that the proposed approach can achieve an average precision of up to 91.65% and an inference latency as low as 17 ms, validating its effectiveness and efficiency. The dataset and code will be publicly available to support further research.
|
| |
| 15:00-16:30, Paper TuI2I.322 | Add to My Program |
| Shape Control for Modular Continuum Soft Arms: A Distributed Approach to Address Redundancy |
|
| Bordini, Samuele | University of Pisa |
| Caradonna, Daniele | School of Advanced Studies Sant'Anna |
| Bicchi, Antonio | Fondazione Istituto Italiano Di Tecnologia |
| Falotico, Egidio | Scuola Superiore Sant'Anna |
Keywords: Modeling, Control, and Learning for Soft Robots, Distributed Robot Systems
Abstract: Modular continuum soft arms represent an emerging class of robotic systems characterized by flexible, highly deformable structures. Designing shape controllers for these arms poses significant challenges due to their modeling complexity and hyper-redundant nature. Our goal is to develop a scalable control framework for modular arms, where each module is self-contained. Starting from distributed control theory, we assign a collaborative controller to each soft module. Through collaboration among modules, the framework enables the system to achieve the desired tip position and shape. Each controller relies on the minimal model, such as the Constant Curvature, of its self-contained module and the local transformation shared by adjacent modules. We present three kinematic control strategies - Consensus, Bipartite Consensus, and Formation Control - for a modular continuum soft arm, that progressively relax constraints to achieve more complex, adaptable shapes. In addition, we develop a decentralized curvature-based dynamic controller to manage dynamic coupling among modules. The validation is carried out through numerical analysis and dynamic simulations of soft arms with varying numbers of modules.
|
| |
| 15:00-16:30, Paper TuI2I.323 | Add to My Program |
| 3D Dynamics-Aware Manipulation: Endowing Manipulation Policies with 3D Foresight |
|
| He, Yuxin | The Hong Kong University of Science and Technology (Guangzhou) |
| Zhang, Ruihao | The Hong Kong University of Science and Technology (Guangzhou) |
| Wu, Xianzu | The Hong Kong University of Science and Technology (Guangzhou) |
| Zhang, Zhiyuan | The Hong Kong University of Science and Technology (Guangzhou) |
| Ding, Cheng | JAKA Robotics Co., Ltd |
| Nie, Qiang | Hong Kong University of Science and Technology (Guangzhou) |
Keywords: Deep Learning in Grasping and Manipulation, RGB-D Perception, Imitation Learning
Abstract: The incorporation of world modeling into manipulation policy learning has pushed the boundary of manipulation performance. However, existing efforts simply model the 2D visual dynamics, which is insufficient for robust manipulation when target tasks involve prominent depth-wise movement. To address this, we present a 3D dynamics-aware manipulation framework that seamlessly integrates 3D world modeling and policy learning. Three self-supervised learning tasks (current depth estimation, future RGB-D prediction, 3D flow prediction) are introduced within our framework, which complement each other and endow the policy model with 3D foresight. Extensive experiments on simulation and the real world show that 3D foresight can greatly boost the performance of manipulation policies without sacrificing inference speed.
|
| |
| 15:00-16:30, Paper TuI2I.324 | Add to My Program |
| MipSLAM: Alias-Free Gaussian Splatting SLAM |
|
| Li, Yingzhao | Harbin Institute of Technology |
| Li, Yan | National University of Singapore |
| Tian, Shixiong | Harbin Institute of Technology |
| Liu, Yanjie | Harbin Institute of Technology |
| Zhao, Lijun | Harbin Institute of Technology |
| Lee, Gim Hee | National University of Singapore |
Keywords: SLAM
Abstract: This paper introduces MipSLAM, a frequency-aware 3D Gaussian Splatting (3DGS) SLAM framework capable of high-fidelity anti-aliased novel view synthesis and robust pose estimation under varying camera configurations. Existing 3DGS-based SLAM systems often suffer from aliasing artifacts and trajectory drift due to inadequate filtering and purely spatial optimization. To overcome these limitations, we propose an Elliptical Adaptive Anti-aliasing (EAA) algorithm that approximates Gaussian contributions via geometry-aware numerical integration, avoiding costly analytic computation. Furthermore, we present a Spectral-Aware Pose Graph Optimization (SA-PGO) module that reformulates trajectory estimation in the frequency domain, effectively suppressing high-frequency noise and drift through graph Laplacian analysis. Extensive evaluations on Replica and TUM datasets demonstrate that MipSLAM achieves state-of-the-art rendering quality and localization accuracy across multiple resolutions.
|
| |
| 15:00-16:30, Paper TuI2I.325 | Add to My Program |
| EgoTraj-Bench: Towards Robust Trajectory Prediction under Ego-View Noisy Observations |
|
| Liu, Jiayi | The Hong Kong University of Science and Technology (Guangzhou) |
| Zhou, Jiaming | Hong Kong University of Science and Technology (Guangzhou) |
| Ye, Ke | The Hong Kong University of Science and Technology (Guangzhou) |
| Lin, Kun-Yu | The University of Hong Kong |
| Wang, Allan | Miraikan |
| Liang, Junwei | HKUST (Guangzhou) |
Keywords: Data Sets for Robotic Vision, Human-Aware Motion Planning
Abstract: Reliable trajectory prediction from an ego-centric perspective is crucial for robotic navigation in human-centric environments. However, existing methods typically assume noiseless observation histories, failing to account for the perceptual artifacts inherent in first-person vision, such as occlusions, ID switches, and tracking drift. This discrepancy between training assumptions and deployment reality severely limits model robustness. To bridge this gap, we introduce EgoTraj-Bench, built upon the TBD dataset, which is the first real-world benchmark that aligns noisy, first-person visual histories with clean, bird’s-eye-view future trajectories, enabling robust learning under realistic perceptual constraints. Building on this benchmark, we propose BiFlow, a dual-stream flow matching model that concurrently denoises historical observations and forecasts future motion. To better model agent intent, BiFlow incorporates our EgoAnchor mechanism, which conditions the prediction decoder on distilled historical features via feature modulation. Extensive experiments show that BiFlow achieves state-of-the-art performance, reducing minADE and minFDE by 10–15% on average and demonstrating superior robustness. We anticipate that our benchmark and model will provide a critical foundation for robust real-world ego-centric trajectory prediction. The benchmark library is available at: https://github.com/zoeyliu1999/EgoTraj-Bench.
|
| |
| 15:00-16:30, Paper TuI2I.326 | Add to My Program |
| Eventually Optimal and Scalable Multi-Agent Planning for Block Cave Mining |
|
| Leet, Christopher | University of Southern California |
| Forte, Paolo | Örebro University |
| Köckemann, Uwe | Orebro Universitet |
| Andreasson, Henrik | Örebro University |
| Koenig, Sven | University of California, Irvine |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Planning, Scheduling and Coordination, Mining Robotics
Abstract: Automation in underground mining has the potential to significantly enhance safety, operational efficiency, and sustainability. However, effectively coordinating fleets of autonomous vehicles in dynamic mine environments introduces substantial challenges in both optimization and motion planning. To address these challenges, we introduce and formalize the Block Cave Mining (BCM) problem, which focuses on computing a transport plan that maximizes ore throughput while satisfying draw ratio constraints. To solve this problem, we propose SAMM, an eventually optimal anytime solver that jointly integrates task assignment, scheduling, and path planning via a mixed-integer linear programming formulation. To improve scalability, we also introduce SAMMS, a variant of SAMM that trades optimality guarantees for efficiency by decomposing the problem into shorter planning subcycles. Experimental evaluations using realistic industrial mine scenarios demonstrate that SAMMS achieves near-optimal throughput and scales effectively to larger fleets and mine layouts.
|
| |
| 15:00-16:30, Paper TuI2I.327 | Add to My Program |
| CableSense: MuJoCo Simulation-Guided Neural Networks for Force Estimation in Cable-Driven Manipulators |
|
| Yang, Chunru | Tsinghua University |
| Xu, Xinruo | Tsinghua University |
| Cui, Zhongrui | Tsinghua |
| Li, Yanan | Tsinghua University |
| Wang, Xueqian | Center for Artificial Intelligence and Robotics, Graduate School at Shenzhen, Tsinghua University, 518055 Shenzhen, China |
Keywords: Tendon/Wire Mechanism, Force and Tactile Sensing, Contact Modeling
Abstract: Cable-driven serial manipulator (CDSM) has advantages of lightweight structure, high flexibility, and inherent safety, making it suitable for operations in constrained spaces. However, interaction with the environment is inevitable. To address this limitation, we propose CableSense, a novel force-sensing approach that leverages actuation cable tension information exclusively, thereby eliminating the requirement for additional contact sensors. We first develop a high-fidelity MuJoCo simulation model based on the physical system, reducing the sim-to-real gap through careful calibration of physical and mechanical parameters. Leveraging this simulation model, we generate a comprehensive dataset encompassing diverse external force scenarios. We then implement a multi-task deep learning framework CableSense, for both single-point and multi-point force identification. Experiments demonstrate that CableSense achieves over 98% accuracy in contact location estimation, maintaining a mean absolute direction error of 5.96°.
|
| |
| 15:00-16:30, Paper TuI2I.328 | Add to My Program |
| Quasimetric Decision Transformers: Enhancing Goal-Conditioned Reinforcement Learning with Structured Distance Guidance |
|
| Goyani, Madhav | Ontario Tech University |
| Davoudi, Heidar | Ontario Tech University |
| Ebrahimi, Mehran | Ontario Tech University |
Keywords: Reinforcement Learning, Representation Learning, Imitation Learning
Abstract: Recent works have shown that tackling offline reinforcement learning (RL) with a conditional policy produces promising results. Decision Transformers (DT) have shown promising results in offline reinforcement learning by leveraging sequence modeling. However, standard DT methods rely on return-to-go (RTG) tokens, which are heuristically defined and often suboptimal for goal-conditioned tasks. In this work, we introduce Quasimetric Decision Transformer (QuaD), a novel approach that replaces RTG with learned quasimetric distances, providing a more structured and theoretically grounded guidance signal for long-horizon decision-making. We explore two quasimetric formulations: interval quasimetric embeddings (IQE) and metric residual networks (MRN), and integrate them into DTs. Extensive evaluations on the AntMaze benchmark demonstrate that QuaD outperforms standard Decision Transformers, achieving state-of-the-art success rates and improved generalization to unseen goals. Our results suggest that quasimetric guidance is a viable alternative to RTG, opening new directions for learning structured distance representations in offline RL.
|
| |
| 15:00-16:30, Paper TuI2I.329 | Add to My Program |
| Crowd-FM: Learned Optimal Selection of Conditional Flow Matching-Generated Trajectories for Crowd Navigation |
|
| Singha, Antareep | Nanyang Technological University, Singapore |
| Nanwani, Laksh | Robotics Research Center, IIIT Hyderabad, India |
| Pulicken, Mathai Mathew | IIIT Hydrabad |
| Jain, Samkit | International Institute of Information Technology, Hyderabad |
| Singamaneni, Phani Teja | LAAS-CNRS |
| Singh, Arun Kumar | University of Tartu |
| Krishna, Madhava | IIIT Hyderabad |
Keywords: Integrated Planning and Learning, Motion and Path Planning, Collision Avoidance
Abstract: Safe and computationally efficient local planning for mobile robots in dense, unstructured human crowds remains a fundamental challenge. Moreover, ensuring that robot trajectories are similar to how a human moves will increase the acceptance of the robot in human environments. In this paper, we present Crowd-FM, a learning-based approach to address both safety and human-likeness challenges. Our approach has two novel components. First, we train a Conditional Flow-Matching (CFM) policy over a dataset of optimally controlled trajectories to learn a set of collision-free primitives that a robot can choose at any given scenario. The chosen optimal control solver can generate multi-modal collision-free trajectories, allowing the CFM policy to learn a diverse set of maneuvers. Secondly, we learn a score function over a dataset of human demonstration trajectories that provides a human-likeness score for the flow primitives. At inference time, computing the optimal trajectory requires selecting the one with the highest score. Our approach improves the state-of-the-art by showing that our CFM policy alone can produce collision-free navigation with a higher success rate than existing learning-based baselines. Furthermore, when augmented with inference-time refinement, our approach can outperform even expensive optimisation-based planning approaches. Finally, we validate that our scoring network can select trajectories closer to the expert data than a manually designed cost function.
|
| |
| 15:00-16:30, Paper TuI2I.330 | Add to My Program |
| SpaRC: Sparse Radar-Camera Fusion for 3D Object Detection |
|
| Wolters, Philipp | Technical University of Munich |
| Gilg, Johannes | Technical University of Munich |
| Teepe, Torben | Technical University of Munich |
| Herzog, Fabian | Technical University of Munich |
| Fent, Felix | TU Munich |
| Rigoll, Gerhard | Technische Universität München |
Keywords: Sensor Fusion, Deep Learning for Visual Perception
Abstract: In this work, we present SpaRC, a novel sparse fusion transformer for 3D perception that integrates multi-view image semantics with Radar and Camera point features. The fusion of radar and camera modalities has emerged as an efficient perception paradigm for autonomous driving systems. While conventional approaches utilize dense Bird's Eye View (BEV)-based architectures for depth estimation, contemporary query-based transformers excel in camera-only detection through object-centric methodology. However, these query-based approaches exhibit limitations in false positive detections and localization precision due to implicit depth modeling. We address these challenges through three key contributions: (1) sparse frustum fusion (SFF) for cross-modal feature alignment, (2) range-adaptive radar aggregation (RAR) for precise object localization, and (3) local self-attention (LSA) for focused query aggregation. In contrast to existing methods requiring computationally intensive BEV-grid rendering, SpaRC operates directly on encoded point features, yielding substantial improvements in efficiency and accuracy. Empirical evaluations on the nuScenes and TruckScenes benchmarks demonstrate that SpaRC significantly outperforms existing dense BEV-based and sparse query-based detectors. Our method achieves state-of-the-art performance of 67.1 NDS and 63.1 AMOTA. The code is available at https://phi-wol.github.io/sparc/.
|
| |
| 15:00-16:30, Paper TuI2I.331 | Add to My Program |
| Failure-Aware RL: Reliable Offline-To-Online Reinforcement Learning with Self-Recovery for Real-World Manipulation |
|
| Li, Huanyu | Shanghai Jiao Tong University |
| Lei, Kun | Shanghai Qizhi Institute |
| Zang, Sheng | Nanyang Technological University |
| Hu, Kaizhe | Tsinghua University |
| Liang, Yongyuan | University of Maryland |
| An, Bo | Nanyang Technological University |
| Li, Xiaoli | Institute for Infocomm Research |
| Xu, Huazhe | Tsinghua University |
Keywords: Reinforcement Learning, Deep Learning in Grasping and Manipulation, Robot Safety
Abstract: Post-training algorithms based on deep reinforcement learning can push the limits of robotic models for specific objectives, such as generalizability, accuracy, and robustness. However, Intervention-requiring Failures (IR Failures) (e.g., a robot spilling water or breaking fragile glass) during real-world exploration happen inevitably, hindering the practical deployment of such a paradigm. To tackle this, we introduce Failure-Aware Offline-to-Online Reinforcement Learning (FARL), a framework for minimizing failures during real-world reinforcement learning. We create FailureBench, a benchmark that incorporates common failure scenarios requiring human intervention, and propose an algorithm that integrates a world-model-based safety critic and a recovery policy trained offline to prevent failures during online exploration. Extensive simulation and real-world experiments demonstrate the effectiveness of FARL in significantly reducing IR Failures while improving performance and generalization during online reinforcement learning post-training. FARL reduces IR Failures by 73.1% while elevating performance by 11.3% on average during real-world RL post-training.
|
| |
| 15:00-16:30, Paper TuI2I.332 | Add to My Program |
| NavGSim: High-Fidelity Gaussian Splatting Simulator for Large-Scale Navigation |
|
| Liu, Jiahang | Harbin Institute of Technology(ShenZhen) |
| Duan, Yuanxing | Galbot |
| Zhang, Jiazhao | Peking University |
| Li, Minghan | Galbot |
| Wang, Shaoan | Peking University |
| Zhang, Zhizheng | Beijing Galbot Co., Ltd |
| Wang, He | Peking University |
Keywords: Simulation and Animation, Integrated Planning and Learning, Learning from Demonstration
Abstract: Simulating realistic environments for robots is widely recognized as a critical challenge in robot learning, particularly in terms of rendering and physical simulation. This challenge becomes even more pronounced in navigation tasks, where trajectories often extend across multiple rooms or even entire floors. In this work, we present NavGSim, a Gaussian Splatting-based simulator designed to generate high-fidelity, large-scale navigation environments. Built upon a hierarchical 3D Gaussian Splatting framework, NavGSim enables photorealistic rendering in expansive scenes spanning hundreds of square meters. To simulate navigation collisions, we introduce a Gaussian Splatting-based slice technique that directly extracts navigable areas from reconstructed Gaussians. Additionally, for ease of use, we provide comprehensive NavGSim APIs supporting multi-GPU development, including tools for custom scene reconstruction, robot configuration, policy training, and evaluation. To evaluate NavGSim’s effectiveness, we train a Vision-Language-Action (VLA) model using trajectories collected from the NavGSim and assess its performance in both simulated and real-world environments. Our results demonstrate that NavGSim significantly enhances the VLA model’s scene understanding, enabling the policy to handle diverse navigation queries effectively.
|
| |
| 15:00-16:30, Paper TuI2I.333 | Add to My Program |
| Fully Distributed Real-Time MPC for Cooperative Mobile Manipulation Via Box-iLQR and ADMM with an Object-Centric Planar Projection |
|
| Lee, Jeong tae | Korea Advanced Institute of Science and Technology |
| Park, Jin Ho | Korea Advanced Institute of Science and Technology |
| Yang, Seunghoon | Korea Advanced Institute of Science and Technology |
| Choi, Keun Ha | Korea Advanced Institute of Science and Technology |
| Kim, Kyung-Soo | KAIST(Korea Advanced Institute of Science and Technology) |
Keywords: Control Architectures and Programming, Optimization and Optimal Control, Agent-Based Systems
Abstract: We propose a emph{fully distributed} real-time model predictive control framework for transporting a single rigid object with multiple mobile manipulators. Each robot rapidly solves a local optimal control problem via Box-iLQR, while ADMM enforces consensus on the shared object state without centralized computation. The core idea is an object-centric planar orthographic projection that reduces the whole-body state and input dimensions, substantially lowering the computational load of linearization and the Riccati backward pass. Simulations demonstrate accurate trajectory tracking and consistent convergence. Specifically, the proposed dimension-reduced Box-iLQR solver operates at an average of 6.32 ms per iteration—approximately 4 times faster than a full 6-DoF model and cutting the computational cost of SQP-based methods nearly in half. Despite this significant reduction, our controller achieves comparable tracking accuracy, offering a practical alternative for real-time cooperative manipulation under limited compute and communication resources. The framework scales naturally with the number of robots and provides a concise and effective design for cooperative mobile manipulation grounded in real-time distributed optimization.
|
| |
| 15:00-16:30, Paper TuI2I.334 | Add to My Program |
| Large-Language-Model-Guided State Estimation for Partially Observable Task and Motion Planning |
|
| Kim, Yoonwoo | University of Texas at Austin |
| Arora, Raghav | University of Texas at Austin |
| Martín-Martín, Roberto | University of Texas at Austin |
| Stone, Peter | The University of Texas at Austin |
| Abbatematteo, Ben | The University of Texas at Austin |
| Sung, Yoonchang | Nanyang Technological University |
Keywords: Task and Motion Planning
Abstract: Robot planning in partially observable environments, where not all objects are known or visible, is a challenging problem, as it requires reasoning under uncertainty through partially observable Markov decision processes. During the execution of a computed plan, a robot may unexpectedly observe task-irrelevant objects, which are typically ignored by naive planners. In this work, we propose incorporating two types of common-sense knowledge: (1) certain objects are more likely to be found in specific locations; and (2) similar objects are likely to be co-located, while dissimilar objects are less likely to be found together. Manually engineering such knowledge is complex, so we explore leveraging the powerful common-sense reasoning capabilities of large language models (LLMs). Our planning and execution framework, CoCo-TAMP, introduces a hierarchical state estimation that uses LLM-guided information to shape the belief over task-relevant objects, enabling efficient solutions to long-horizon task and motion planning problems. In experiments, CoCo-TAMP achieves an average reduction of 62.7% in planning and execution time in simulation, and 72.6% in real-world demonstrations, compared to a baseline that does not incorporate either type of common-sense knowledge.
|
| |
| 15:00-16:30, Paper TuI2I.335 | Add to My Program |
| Conformal Koopman for Embedded Nonlinear Control with Statistical Robustness: Theory and Real-World Validation |
|
| Hirano, Koki | University of Illinois Urbana-Champaign |
| Tsukamoto, Hiroyasu | University of Illinois Urbana-Champaign/NASA Jet Propulsion Laboratory |
|
|
| |
| 15:00-16:30, Paper TuI2I.336 | Add to My Program |
| The Price Is Not Right: Neuro-Symbolic Methods Outperform VLAs on Structured Long-Horizon Manipulation Tasks with Significantly Lower Energy Consumption |
|
| Duggan, Timothy R. | Tufts University |
| Lorang, Pierrick | AIT Austrian Institute of Technology GmbH - Tufts University |
| Lu, Hong | Tufts University |
| Scheutz, Matthias | Tufts University |
Keywords: Deep Learning in Grasping and Manipulation, Task and Motion Planning, Perception for Grasping and Manipulation
Abstract: Vision-Language-Action (VLA) models have recently been proposed as a pathway toward generalist robotic policies capable of interpreting natural language and visual inputs to generate manipulation actions. However, their effectiveness and efficiency on structured, long-horizon manipulation tasks remain unclear. In this work, we present a head-to-head empirical comparison between a fine-tuned open-weight VLA model (π 0) and a neuro-symbolic architecture that combines PDDL-based symbolic planning with learned low-level control. We evaluate both approaches on structured variants of the Towers of Hanoi manipulation task in simulation while measuring both task performance and energy consumption during training and execution. On the 3-block task, the neuro-symbolic model achieves 95% success compared to 34% for the best-performing VLA. The neuro-symbolic model also generalizes to an unseen 4-block variant (78% success), whereas both VLAs fail to complete the task. During training, VLA fine-tuning consumes nearly two orders of magnitude more energy than the neuro-symbolic approach. These results highlight important trade-offs between end-to-end foundation-model approaches and structured reasoning architectures for long-horizon robotic manipulation, emphasizing the role of explicit symbolic structure in improving reliability, data efficiency, and energy efficiency. Code and models are available at https://price-is-not-right.github.io
|
| |
| 15:00-16:30, Paper TuI2I.337 | Add to My Program |
| Spectral Decomposition of Inverse Dynamics for Fast Exploration in Model-Based Manipulation |
|
| Sigurdson, Solvin | California Institute of Technology |
| Riviere, Benjamin | New York University |
| Burdick, Joel | California Institute of Technology |
Keywords: Manipulation Planning, Dexterous Manipulation, Motion and Path Planning
Abstract: Planning long duration robotic manipulation sequences is challenging because of the complexity of exploring feasible trajectories through nonlinear contact dynamics and many contact modes. Moreover, this complexity grows with the problem's horizon length. We propose a search tree method that generates trajectories using the spectral decomposition of the inverse dynamics equation. This equation maps actuator displacement to object displacement, and its spectrum is efficient for exploration because its components are orthogonal and they approximate the reachable set of the object while remaining dynamically feasible. These trajectories can be combined with any search based method, such as Rapidly-Exploring Random Trees (RRT), for long-horizon planning. Our method performs similarly to recent work in model-based planning for short-horizon tasks, and differentiates itself with its ability to solve long-horizon tasks: whereas existing methods fail, ours can generate 45 second duration, 10+ contact mode plans using 15 seconds of computation, demonstrating real-time capability in highly complex domains.
|
| |
| 15:00-16:30, Paper TuI2I.338 | Add to My Program |
| Cognition-Inspired Dual-Stream Semantic Enhancement for Vision-Based Dynamic Emotion Modeling |
|
| Wang, Huanzhen | Fudan University |
| Zhou, Ziheng | Fudan University |
| Tao, Zeng | Fudan University |
| Li, Aoxing | Fudan University |
| Zhao, Yingkai | Fudan University |
| Lin, Yuxuan | Fudan University |
| Wang, Yan | East China Normal University |
| Zhang, Wenqiang | Fudan University |
Keywords: Embodied Cognitive Science, Gesture, Posture and Facial Expressions
Abstract: The human brain constructs emotional percepts not by processing facial expressions in isolation, but through a dynamic, hierarchical integration of sensory input with semantic and contextual knowledge. However, existing vision-based dynamic emotion modeling approaches often neglect emotion perception and cognitive theories. To bridge this gap between machine and human emotion perception, we propose cognition-inspired Dual-stream Semantic Enhancement (DuSE). Our model instantiates a dual-stream cognitive architecture. The first stream, a Hierarchical Temporal Prompt Cluster (HTPC), operationalizes the cognitive priming effect. It simulates how linguistic cues pre-sensitize neural pathways, modulating the processing of incoming visual stimuli by aligning textual semantics with fine-grained temporal features of facial dynamics. The second stream, a Latent Semantic Emotion Aggregator (LSEA), computationally models the knowledge integration process, akin to the mechanism described by the Conceptual Act Theory. It aggregates sensory inputs and synthesizes them with learned conceptual knowledge, reflecting the role of the hippocampus and default mode network in constructing a coherent emotional experience. By explicitly modeling these neuro-cognitive mechanisms, DuSE provides a more neurally plausible and robust framework for dynamic facial expression recognition (DFER). Extensive experiments on challenging in-the-wild benchmarks validate our cognition-centric approach, demonstrating that emulating the brain's strategies for emotion processing yields state-of-the-art performance and enhances model interpretability.
|
| |
| 15:00-16:30, Paper TuI2I.339 | Add to My Program |
| Diffusion-Based Low-Light Image Enhancement with Color and Luminance Priors |
|
| Fu, Xuanshuo | Computer Vision Center |
| Kang, Lei | Computer Vision Center |
| Vazquez-Corral, Javier | Computer Vision Center |
Keywords: Computer Vision for Transportation, Deep Learning for Visual Perception, Visual Learning
Abstract: Low-light images often suffer from low contrast, noise, and color distortion, degrading visual quality and impairing downstream vision tasks. We propose a novel conditional diffusion framework for low-light image enhancement that incorporates a Structured Control Embedding Module (SCEM). SCEM decomposes a low-light image into four informative components including illumination, illumination-invariant features, shadow priors, and color-invariant cues. These components serve as control signals that condition a U-Net–based diffusion model trained with a simplified noise-prediction loss. Thus, the proposed SCEM equipped Diffusion method enforces structured enhancement guided by physical priors. In experiments, our model is trained only on the LOLv1 dataset and evaluated without fine-tuning on LOLv2-real, LSRW, DICM, MEF, and LIME. The method achieves state-of-the-art performance in quantitative and perceptual metrics, demonstrating strong generalization across benchmarks. https://casted.github.io/scem/.
|
| |
| 15:00-16:30, Paper TuI2I.340 | Add to My Program |
| Design Analysis of Flexible Rod Jamming Based Pneumatically Soft Actuator |
|
| Seleem, Ibrahim | University of Montpellier |
| Takemura, Hiroshi | Tokyo University of Science |
Keywords: Soft Robot Applications, Soft Robot Materials and Design, Soft Sensors and Actuators
Abstract: Soft grippers exhibit exceptional adaptability to novel objects and tasks, making them suitable for safe and effective operation in human-centered applications. To improve their stiffness and gripping force, jamming techniques has been frequently used in manipulating objects of diverse shapes and weights. However, the existing jamming based grippers suffer from significant limitations including complex and expensive fabrication, excessive weight, slow recovery response and bending instability due to requiring a high level of vacuum to achieve jamming. This paper presents a novel design of pneumatically actuated flexible rod jamming based soft gripper. It consists of zigzag-based driving chambers to allow bending of the actuator upon pressurization. Additionally, a zigzag jamming chamber filled with flexible rods, that are fabricated by activating the internal support. The design is fabricated completely from Elastic 50A resin using Stereolithography (SLA) fabrication process without the need of additional fabrication procedure. The prototype’s stiffness is achieved by regulating the vacuum inside the jamming chamber. A nonlinear static analysis based on 3rd Yeoh model is conducted to investigate the actuator performance in terms of safety and deflection under various operating conditions. The performance of the prototype is evaluated against conventional actuator, while concerning its bending repeatability and payload capacity. The experimental results show that the proposed design achieves bending angle of 178◦ and carrying external load of 200 g. Additionally, it exhibits low deflection during bending compared to traditional zigzag actuator.
|
| |
| 15:00-16:30, Paper TuI2I.341 | Add to My Program |
| Unified Magnetic 5-DoF Localization Framework for Capsule Robots Via PMMN-DBO: From Single to Multi-Robot Scenarios with Real-Time Control–Localization Co-Design |
|
| Zeng, Zijin | Beihang University |
| Li, Chan | Beihang University |
| Chen, Zaiyang | Beihang University |
| Huang, Shunxiao | Beihang University |
| Niu, Wenyan | Beihang University |
| Sun, Hongyan | Beihang University |
| Tan, Menglu | Beijing Institute of Petrochemical Technology |
| Guo, Yingjian | Beihang University |
| Feng, Lin | Beihang University |
Keywords: Medical Robots and Systems
Abstract: Motivated by clinical needs for precise navigation and safety, low-latency and high-precision localization has become a key enabler for capsule robots. A unified magnetic 5-DoF high-precision localization framework for capsule robots is presented. Building on layered multi-source magnetic field modeling, online external-field compensation, and global optimization-based inversion, the framework achieves real-time decoupling between control and localization fields, while providing a unified interface compatible with diverse hardware configurations and operation modes. On this basis, the PMMN-DBO algorithm is proposed, delivering high-accuracy and efficient localization in single- and multi-capsule scenarios, and supports synchronized control–localization. Experimentally, for single-capsule localization, mean errors are 0.59 mm/0.69° with a 20.2 ms computation time, surpassing conventional methods. In multi-capsule settings, localization errors remain low with stable convergence: mean errors are 1.28 mm/1.13° for two capsules and 2.56 mm/2.83° for three capsules. Under synchronized control–localization, trajectory-tracking errors reach 1.33 mm/1.85°. Overall, the proposed framework is unified, high-precision, efficient, and flexible, laying a general and reusable foundation for clinical-grade precise navigation and closed-loop magnetic control.
|
| |
| 15:00-16:30, Paper TuI2I.342 | Add to My Program |
| A Passive Elastic-Folding Mechanism for Stackable Airdrop Sensors |
|
| Kim, Damyon | The University of Tokyo |
| Honjo, Yuichi | The University of Tokyo |
| Iizuka, Tatsuya | Nippon Telegraph and Telephone Corporation |
| Okubo, Naomi | The University of Tokyo |
| Endo, Naoto | Nippon Telegraph and Telephone Corporation |
| Matsubara, Hiroshi | Nippon Telegraph and Telephone Corporation |
| Kawahara, Yoshihiro | The University of Tokyo |
| Morita, Naoto | Nihon University |
| Sasatani, Takuya | The University of Tokyo |
Keywords: Sensor Networks, Environment Monitoring and Management, Actuation and Joint Mechanisms
Abstract: Air-dispersed sensor networks deployed from aerial robotic systems (e.g., UAVs) provide a low-cost approach to wide-area environmental monitoring. However, existing methods often rely on active actuators for mid-air shape or trajectory control, increasing both power consumption and system cost. Here, we introduce a passive elastic-folding hinge mechanism that transforms sensors from a flat, stackable form into a three-dimensional structure upon release. Hinges are fabricated by laminating commercial sheet materials with rigid printed circuit boards (PCBs) and programming fold angles through a single oven-heating step, enabling scalable production without specialized equipment. Our geometric model links laminate geometry, hinge mechanics, and resulting fold angle, providing a predictive design methodology for target configurations. Laboratory tests confirmed fold angles between 10 deg and 100 deg, with a standard deviation of 4 deg and high repeatability. Field trials further demonstrated reliable data collection and LoRa transmission during dispersion, while the Horizontal Wind Model (HWM)-based trajectory simulations indicated strong potential for wide-area sensing exceeding 10 km.
|
| |
| 15:00-16:30, Paper TuI2I.343 | Add to My Program |
| Flexible and Efficient Spatio-Temporal Transformer for Sequential Visual Place Recognition |
|
| Lau, Yu Kiu (Idan) | New York University |
| Chen, Chao | New York University |
| Jin, Ge | New York University |
| Feng, Chen | New York University |
Keywords: Deep Learning for Visual Perception, Recognition, Visual Learning
Abstract: Sequential Visual Place Recognition (Seq-VPR) leverages transformers to capture spatio-temporal features effectively. In practice, a transformer-based Seq-VPR model should be flexible to the number of frames per sequence (sequence length), deliver fast inference, and use little memory to meet real-time constraints; however, existing approaches often prioritize performance at the expense of flexibility and efficiency. To address this gap, we propose Adapt-STformer, a Seq-VPR method built around our novel Recurrent Deformable Transformer Encoder (Recurrent-DTE), which uses an iterative recurrent mechanism to fuse information from multiple sequential frames. This design naturally supports variable sequence lengths, fast inference, and low memory usage. Experiments on the Nordland, Oxford, and NuScenes datasets show that Adapt-STformer boosts recall by up to 17% while reducing sequence extraction time by 36% and lowering memory usage by 35% relative to our best comparable baseline. Our code is released at https://ai4ce.github.io/Adapt-STFormer/
|
| |
| 15:00-16:30, Paper TuI2I.344 | Add to My Program |
| Miniature Testbed for Validating Multi-Agent Cooperative Autonomous Driving |
|
| Bae, Hyunchul | Korea Advanced Institute of Science and Technology (KAIST) |
| Lee, Eunjae | Korea Advanced Institute of Science and Technology (KAIST) |
| Han, Jehyeop | Korea Advanced Institute of Science and Technology (KAIST) |
| Kang, Minhee | Korea Advanced Institute of Science and Technology (KAIST) |
| Kim, Jaehyeon | Korea Advanced Institute of Science and Technology (KAIST) |
| Seo, Junggeun | Korea Advanced Institute of Science and Technology (KAIST) |
| Noh, Minkyun | Korea Advanced Institute of Science and Technology (KAIST) |
| Ahn, Heejin | Korea Advanced Institute of Science and Technology (KAIST) |
Keywords: Intelligent Transportation Systems, Automation Technologies for Smart Cities, Multi-Robot Systems
Abstract: Cooperative autonomous driving, which extends vehicle autonomy by enabling real-time collaboration between vehicles and smart roadside infrastructure, remains a challenging yet essential problem. However, none of the existing testbeds employ smart infrastructure equipped with sensing, edge computing, and communication capabilities. To address this gap, we design and implement a 1:15-scale miniature testbed, CIVAT, for validating cooperative autonomous driving, consisting of a scaled urban map, autonomous vehicles with onboard sensors, and smart infrastructure. The proposed testbed integrates V2V and V2I communication with the publish-subscribe pattern through a shared Wi-Fi and ROS2 framework, enabling information exchange between vehicles and infrastructure to realize cooperative driving functionality. As a case study, we validate the system through infrastructure-based perception and intersection management experiments.
|
| |
| 15:00-16:30, Paper TuI2I.345 | Add to My Program |
| Visual Category-Guided One-Shot Open Affordance Grounding |
|
| Wang, Yangfan | Sun Yat-Sen University |
| Yu, Hongyang | Pengcheng Laboratory |
| Li, Xiying | Sun Yat-Sen University |
Keywords: Deep Learning for Visual Perception, Perception for Grasping and Manipulation, Recognition
Abstract: Affordance grounding is a challenging task that aims to locate functional regions in object images enabling potential human-object interactions. One-shot open affordance grounding leverages the generalization capability of visual foundation models to overcome limitations of training data scale. However, existing methods often fail to locate functional regions in complex scenarios due to the lack of fine-grained perception, function-appearance heterogeneity, and the overfitting of affordance prompts to known categories. To improve generalization to unseen categories, we introduce a category-conditioned affordance prompt learning, which constructs a complete semantic category-affordance prompt from instance-level visual features. To further improve the accuracy of affordance localization for objects with complex structures, we propose a coarse-to-fine semantic-guided Transformer decoder. This design enhances the decoder's ability to understand the semantic mapping between the affordance words and corresponding object part-level regions. On multiple standard benchmarks, our method achieves competitive performance compared to related methods with less than 1% of the training cost. Notably, our approach shows more robust generalization to unseen objects and novel affordances than the recent SOTA baseline methods.
|
| |
| 15:00-16:30, Paper TuI2I.346 | Add to My Program |
| Multi-Robot Segregation Using Finite-Time MPC with Chernoff Bound-Based Asynchronous Motion Smoothing |
|
| Dubey, Richa | Indian Institute of Technology, Jodhpur |
| Gupta, Shreyash | Indian Institute of Technology, Jodhpur |
| Chaudhary, Saurabh | Indian Institute of Technology, Jodhpur |
| Tripathy, Niladri Sekhar | IIT Jodhpur |
| Shah, Suril Vijaykumar | Indian Institute of Technology Jodhpur |
Keywords: Multi-Robot Systems, Optimization and Optimal Control, Motion Control
Abstract: For multi-robot systems operating in dynamic environments, collision-free segregation into a desired set of groups in finite time is an essential task requirement in many applications. This work presents a control framework for such systems, utilizing Finite-time Model Predictive Control. The objective is to guide the robots toward a segregated formation while adhering to leader-follower dynamics and effectively avoiding collisions. To ensure finite-time convergence, the concept of a control invariant set is incorporated. Furthermore, the paper derives an upper bound on the required time steps for the robots to achieve the segregated formation. In order to maintain a smooth motion profile in the face of external state perturbations, this work proposes a data-driven Chernoff bound-based triggering method that enables Asynchronous Motion Smoothing for the robots. To validate the effectiveness of the proposed control framework, both simulations and hardware experiments are conducted, focusing on the segregation of five robots into two distinct groups.
|
| |
| 15:00-16:30, Paper TuI2I.347 | Add to My Program |
| Efficient Plane Segmentation in Depth Image Based on Adaptive Patch-Wise Region Growing |
|
| Zhang, Lantao | Shanghai Jiao Tong University |
| Niu, Haochen | Shanghai Jiao Tong University |
| Liu, Peilin | Shanghai Jiao Tong University |
| Wen, Fei | Shanghai Jiao Tong University |
| Ying, Rendong | Shanghai Jiao Tong University |
Keywords: RGB-D Perception, Object Detection, Segmentation and Categorization
Abstract: Plane segmentation algorithms are widely used in robotics, serving key roles in scenarios such as indoor localization, scene understanding, and robotic manipulation. These applications typically require real-time, precise, and robust plane segmentation processing, which presents a significant challenge. Existing methods based on pixel-wise or fix-sized patch-wise operation is redundant, as planar regions in real-world scenes are of diverse sizes. In this paper, we introduce a highly efficient method for plane segmentation, namely Adaptive Patch-wise Region Growing (APRG). APRG begins with data sampling to construct a data pyramid. To avoid redundant planer fitting in large planar regions, we introduce an adaptive patch-wise plane fitting algorithm with the pyramid accessed in a top-down manner. The largest possible planar patches are obtained in this process. Subsequently we introduce a region growing algorithm specially designed for our patch representation. Overall, APRG achieves more than 600 FPS at a 640x480 resolution on a mid-range CPU without using parallel acceleration techniques, which outperforms the state-of-the-art method by a factor of 1.46. Besides, in addition to its speedup in run-time, APRG significantly improves the segmentation quality, especially on real-world data.
|
| |
| 15:00-16:30, Paper TuI2I.348 | Add to My Program |
| MTIL: Encoding Full History with Mamba for Temporal Imitation Learning |
|
| Zhou, Yulin | Huazhong University of Science and Technology |
| Lin, Yuankai | Huazhong University of Science and Technology |
| Peng, Fanzhe | HUST |
| Chen, Jiahui | Huazhong University of Science and Technology |
| Huang, Kaiji | Huazhong university of science and technology |
| Yang, Hua | Huazhong University of Science and Technology |
| Yin, Zhouping | Professor, School of Mechanical Scienceand Engineering,Huazhong University of Science &Technology,China |
|
|
| |
| 15:00-16:30, Paper TuI2I.349 | Add to My Program |
| Stability and Transparency in Mixed Reality Bilateral Human Teleoperation |
|
| Black, David Gregory | University of British Columbia |
| Salcudean, Septimiu E. | University of British Columbia |
Keywords: Telerobotics and Teleoperation, Virtual Reality and Interfaces, Human Factors and Human-in-the-Loop, Haptics and Haptic Interfaces
Abstract: Recent work introduced human teleoperation (HT), where the remote robot typically used in conventional bilateral teleoperation is replaced by a novice person wearing a mixed reality headset and tracking the motion of a virtual tool controlled by an expert. HT has advantages in cost, complexity, and patient acceptance for telemedicine in remote or low-resource communities. However, the stability, transparency, and performance of bilateral HT are unexplored. In this paper, we therefore develop a mathematical model of the HT system using test data. We then analyze various control architectures with this model and implement them with the HT system, testing volunteer operators and a virtual fixture-based simulated patient to find the achievable performance, investigate stability, and determine the most promising teleoperation scheme in the presence of time delays. We show that instability in HT, while not destructive or dangerous, makes the system unusable. However, stable and transparent teleoperation are possible with small time delays (<200 ms) through 3-channel teleoperation, or with large delays through model-mediated teleoperation with local pose and force feedback for the novice.
|
| |
| 15:00-16:30, Paper TuI2I.350 | Add to My Program |
| LAMP: Implicit Language Map for Robot Navigation |
|
| Lee, Sibaek | Sungkyunkwan University (SKKU) |
| Yu, Hyeonwoo | SungKyunKwan University |
| Kim, Giseop | DGIST (Daegu Gyeongbuk Institute of Science and Technology) |
| Choi, Sunwook | NAVER LABS Corp |
Keywords: Vision-Based Navigation, Mapping, Semantic Scene Understanding
Abstract: Recent advances in vision-language models have made zero-shot navigation feasible, enabling robots to interpret and follow natural language instructions without requiring labeling. However, existing methods that explicitly store language vectors in grid or node-based maps struggle to scale to large environments due to excessive memory requirements and limited resolution for fine-grained planning. We introduce LAMP (Language Map), a novel neural language field-based navigation framework that learns a continuous, language-driven map and directly leverages it for fine-grained path generation. Unlike prior approaches, our method encodes language features as an implicit neural field rather than storing them explicitly at every location. By combining this implicit representation with a sparse graph, LAMP supports efficient coarse path planning and then performs gradient-based optimization in the learned field to refine poses near the goal. Our two-stage pipeline of coarse graph search followed by language-driven, gradient-guided optimization is the first application of an implicit language map for precise path generation. This refinement is particularly effective at selecting goal regions not directly observed by leveraging semantic similarities in the learned feature space. To further enhance robustness, we adopt a Bayesian framework that models embedding uncertainty via the von Mises–Fisher distribution, thereby improving generalization to unobserved regions. To scale to large environments, LAMP employs a graph sampling strategy that prioritizes spatial coverage and embedding confidence, retaining only the most informative nodes and substantially reducing computational overhead. Our experimental results, both in NVIDIA Isaac Sim and on a real multi-floor building, demonstrate that LAMP outperforms existing explicit methods in both memory efficiency and fine-grained goal-reaching accuracy, opening new possibilities for scalable, language-driven robot navigation.
|
| |
| 15:00-16:30, Paper TuI2I.351 | Add to My Program |
| Enhancing Indoor Occupancy Prediction Via Sparse Query-Based Multi-Level Consistent Knowledge Distillation |
|
| Li, Xiang | Tsinghua University |
| Zheng, Yupeng | School of Artificial Intelligence, University of Chinese Academy of Sciences |
| Li, Pengfei | Institute for AI Industry Research (AIR), Tsinghua University |
| Chen, Yilun | Tsinghua University |
| Zhang, Ya-Qin | Institute for AI Industry Research(AIR), Tsinghua University |
| Ding, Wenchao | Fudan University |
|
|
| |
| 15:00-16:30, Paper TuI2I.352 | Add to My Program |
| MARG: MAstering Risky Gap Terrains for Legged Robots with Elevation Mapping |
|
| Dong, Yinzhao | The University of Hong Kong |
| Ma, Ji | The University of Hong Kong |
| Zhao, Liu | The University of Hong Kong |
| Li, Wanyue | The University of Hong Kong |
| Lu, Peng | The University of Hong Kong |
| |
| 15:00-16:30, Paper TuI2I.353 | Add to My Program |
| SpikeClouds: Streaming Spike-Based Processing of LiDAR for Fast and Efficient Object Detection |
|
| Neumeier, Michael | Fortiss GmbH |
| Fasfous, Nael | BMW AG |
| Li, Bing | University of Siegen |
| von Arnim, Axel | Fortiss |
Keywords: Range Sensing, Object Detection, Segmentation and Categorization, Neurorobotics
Abstract: LiDAR sensors are used to provide three-dimensional information about the environment in many robotics applications. The information, accumulated in 3D point clouds, is first acquired by the sensor and then processed further, which leads to high end-to-end latencies and large memory footprints. Streaming approaches tackle this problem by processing partial point cloud data during scanning of the environment. In contrast to existing work that is limited to power hungry, rotating mechanical scanners, in this paper, we present a streaming method for more efficient scanline-based LiDAR sensors. We process the sequence of scanlines in form of SpikeClouds with a Spiking Neural Network (SNN) backbone and perform 3D object detection from the accumulated information using a Convolutional Neural Network (CNN) head. Our method acieves close to state-of-the-art detection performance on datasets KITTI and JRDB22 while reducing the end-to-end latency by 10% and the average memory footprint by 95% on standard GPU hardware. Additionally, when ported onto neuromorphic hardware, our backbone requires 25× less energy compared to reference backbones. SpikeClouds achieves fast and efficient environmental perception for robotic applications by streaming LiDAR to enable spike-based processing.
|
| |
| 15:00-16:30, Paper TuI2I.354 | Add to My Program |
| Non-Rigid Structure-From-Motion Via Differential Geometry with Recoverable Conformal Scale |
|
| Chen, Yongbo | Shanghai Jiao Tong University |
| Zhang, Yanhao | Beijing Academy of Artificial Intelligence |
| Parashar, Shaifali | Epfl |
| Zhao, Liang | The University of Edinburgh |
| Huang, Shoudong | University of Technology, Sydney |
Keywords: SLAM, Non-rigid structure-from-motion, Mapping, Computer Vision for Medical Robotics
Abstract: Non-rigid structure-from-motion (NRSfM), a promising technique for addressing the mapping challenges in monocular visual deformable simultaneous localization and mapping (SLAM), has attracted growing attention. We introduce a novel method, called Con-NRSfM, for NRSfM under conformal deformations, encompassing isometric deformations as a subset. Our approach performs point-wise reconstruction using 2D selected image warps optimized through a graph-based framework. Unlike existing methods that rely on strict assumptions—such as locally planar surfaces or locally linear deformations—and fail to recover the conformal scale, our method eliminates these constraints and accurately computes the local conformal scale. Additionally, our framework decouples constraints on depth and conformal scale, which are inseparable in other approaches, enabling more precise depth estimation. To address the sensitivity of the formulated problem, we employ a parallel separable iterative optimization strategy. Furthermore, a self-supervised learning framework, utilizing an encoder-decoder network, is incorporated to generate dense 3D point clouds with texture. Simulation and experimental results using both synthetic and real datasets demonstrate that our method surpasses existing approaches in terms of reconstruction accuracy and robustness.
|
| |
| 15:00-16:30, Paper TuI2I.355 | Add to My Program |
| Explicit Memory through Online 3D Gaussian Splatting Improves Class-Agnostic Video Segmentation |
|
| Opipari, Anthony | University of Michigan |
| Krishnan, Aravindhan | Amazon Lab126 |
| Gayaka, Shreekant | Amazon |
| Sun, Min | National Tsing Hua University |
| Kuo, Cheng-Hao | Amazon |
| Sen, Arnab | Amazon |
| Jenkins, Odest Chadwicke | University of Michigan |
Keywords: Object Detection, Segmentation and Categorization, RGB-D Perception
Abstract: Remembering where object segments were predicted in the past is useful for improving the accuracy and consistency of class-agnostic video segmentation algorithms. Existing video segmentation algorithms typically use either no object-level memory (e.g. FastSAM) or they use implicit memories in the form of recurrent neural network features (e.g. SAM2). In this paper, we augment both types of segmentation models using an explicit 3D memory and show that the resulting models have more accurate and consistent predictions. For this, we develop an online 3D Gaussian Splatting (3DGS) technique to store predicted object-level segments generated throughout the duration of a video. Based on this 3DGS representation, a set of fusion techniques are developed, named FastSAM-Splat and SAM2-Splat, that use the explicit 3DGS memory to improve their respective foundation models' predictions. Ablation experiments are used to validate the proposed techniques' design and hyperparameter settings. Results from both real-world and simulated benchmarking experiments show that models which use explicit 3D memories result in more accurate and consistent predictions than those which use no memory or only implicit neural network memories.
|
| |
| 15:00-16:30, Paper TuI2I.356 | Add to My Program |
| Multi-Agent Motion Forecasting Via Mixed Supervision |
|
| Wan, Wenkang | Xidian University |
| Ouyang, Nan | Xidian University |
| Zeng, Mingjin | Xidian University |
| Ao, Lei | Xidian University |
| Cai, Qing | Xidian University |
| Gao, Yuan | Xidian University |
| Sheng, Kai | Xidian University |
| |
| 15:00-16:30, Paper TuI2I.357 | Add to My Program |
| Sce2DriveX: A Generalized MLLM Framework for Scene-To-Drive Learning |
|
| Rui, Zhao | Jilin University |
| Qirui, Yuan | Jilin University |
| Jinyu, Li | Jilin University |
| Haofeng, Hu | Jilin University |
| Yun, Li | The University of Tokyo |
| Zhenhai, Gao | Jilin University |
| Fei, Gao | Jilin University |
Keywords: Autonomous Vehicle Navigation, Autonomous Agents, Semantic Scene Understanding
Abstract: End-to-end autonomous driving, which directly maps raw sensor inputs to low-level vehicle controls, is an crucial part of Embodied AI. Despite successes in applying Multimodal Large Language Models (MLLMs) for high-level traffic scene semantic understanding, it remains challenging to effectively translate these conceptual semantics understandings into low-level motion control commands and achieve cross-scene driving generalization and consensus. We propose Sce2DriveX, a human like chain-of-thought (CoT) driving reasoning MLLM framework, designed to achieve progressive learning from multi-view scene understanding to behavior analysis, motion planning, and vehicle control driving process. Sce2DriveX utilizes multimodal joint learning of local scene videos and global Bird's Eye View (BEV) maps to deeply understand long-range spatiotemporal relationships and road topology, enhancing its 3D dynamic/static scene perception and reasoning capabilities and achieving cross-scene generalization. Meanwhile, it reconstructs the implicit cognitive chain inherent in human driving, further enhancing the consensus between autonomous driving and human thought. To improve model performance, we construct the first comprehensive Visual Question Answering (VQA) driving instruction dataset, which tailored for 3D spatial understanding and long-axis task reasoning, and introduce a task-oriented three-stage training pipeline to support supervised fine-tuning. Extensive experiments demonstrate that Sce2DriveX achieves state-of-the-art performance across tasks from scene understanding to end-to-end driving, as well as robust generalization in handling diverse driving scenes on the CARLA Bench2Drive benchmark.
|
| |
| 15:00-16:30, Paper TuI2I.358 | Add to My Program |
| An Alignment-Based Approach to Learning Motions from Demonstrations |
|
| Cuellar, Alex | Massachusetts Institute of Technology |
| Fourie, Christopher K | Massachusetts Institute of Technology (MIT) |
| Shah, Julie A. | MIT |
Keywords: Learning from Demonstration, Imitation Learning, Probabilistic Inference
Abstract: Learning from Demonstration (LfD) is a well-studied field shown to provide robots with fundamental motion skills for a variety of domains. Significant research into various branches of LfD (e.g., learned dynamical systems and movement primitives) can generally be classified into those that learn ``time-dependent” or ``time-independent” systems. Each paradigm provides fundamental benefits and drawbacks -- time-independent methods cannot learn overlapping trajectories, while time-dependence can result in undesirable behavior under perturbation. In this paper, we introduce Cluster Alignment for Learned Motions (CALM), an LfD framework dependent upon an alignment with a representative ``mean" trajectory of demonstrated motions rather than pure time- or state-dependence. We also discuss the convergence properties of CALM and introduce an alignment technique able to handle the sudden shifts in alignment possible under perturbation. We show how CALM mitigates the drawbacks of time-dependent and time-independent techniques on 2D datasets and implement our system on a 7-DoF robot learning tasks in three domains.
|
| |
| 15:00-16:30, Paper TuI2I.359 | Add to My Program |
| Breaking the Static Assumption: A Dynamic-Aware LIO Framework Via Spatio-Temporal Normal Analysis |
|
| Chen, Zhiqiang | The University of Hong Kong |
| Le Gentil, Cedric | University of Toronto |
| Lin, Fuling | The University of Hong Kong |
| Lu, Minghao | The University of Hong Kong |
| Qiao, Qiyuan | The University of Hong Kong |
| Xu, Bowen | The University of Hong Kong |
| Qi, Yuhua | Sun Yat-Sen University |
| Lu, Peng | The University of Hong Kong |
Keywords: Localization, SLAM, Mapping
Abstract: This paper addresses the challenge of Lidar-Inertial Odometry (LIO) in dynamic environments, where conventional methods often fail due to their static-world assumptions. Traditional LIO algorithms perform poorly when dynamic objects dominate the scenes, particularly in geometrically sparse environments. Current approaches to dynamic LIO face a fundamental challenge: accurate localization requires a reliable identification of static features, yet distinguishing dynamic objects necessitates precise pose estimation. Our solution breaks this circular dependency by integrating dynamic awareness directly into the point cloud registration process. We introduce a novel dynamic-aware iterative closest point algorithm that leverages spatio-temporal normal analysis, complemented by an efficient spatial consistency verification method to enhance static map construction. Experimental evaluations demonstrate significant performance improvements over state-of-the-art LIO systems in challenging dynamic environments with limited geometric structure. The code and dataset are available at url{https://github.com/thisparticle/btsa}.
|
| |
| 15:00-16:30, Paper TuI2I.360 | Add to My Program |
| DQ-NMPC: Dual-Quaternion NMPC for Quadrotor Flight |
|
| Recalde, Luis F. | Worcester Polytechnic Institute |
| Agrawal, Dhruv Madhusudan | Worcester Polytechnic Institute |
| Arrizabalaga, Jon | Massachusetts Institute of Technology (MIT) |
| Li, Guanrui | Worcester Polytechnic Institute |
Keywords: Aerial Systems: Applications, Aerial Systems: Mechanics and Control, Optimization and Optimal Control
Abstract: MAVs have great potential to assist humans in complex tasks, with applications ranging from logistics to emergency response. Their agility makes them ideal for operations in complex and dynamic environments. However, achieving precise control in agile flights remains a significant challenge, particularly due to the underactuated nature of quadrotors and the strong coupling between their translational and rotational dynamics. In this work, we propose a novel NMPC framework based on dual-quaternions (DQ-NMPC) for quadrotor flight. By representing both quadrotor dynamics and the pose error directly on the dual-quaternion manifold, our approach enables a compact and globally non-singular formulation that captures the quadrotor coupled dynamics. We validate our approach through simulations and real-world experiments, demonstrating better numerical conditioning and significantly improved tracking performance, with reductions in position and orientation errors of up to 56.11% and 56.77%, compared to a conventional baseline NMPC method. Furthermore, our controller successfully handles aggressive trajectories, reaching maximum speeds up to 13.66 m/s and max accelerations up to 4.2 g, under which the baseline controller fails.
|
| |
| 15:00-16:30, Paper TuI2I.361 | Add to My Program |
| Optical LiDAR Communication: Repurposing Existing LiDAR Sensors for Infrastructure-To-Vehicle Communication |
|
| Ikeda, Kazuma | Keio University |
| Hayakawa, Yuki | Keio University |
| Suzuki, Ryo | Keio University |
| Nagai, Shota | Keio University |
| Sako, Ozora | Keio University |
| Nagata, Rokuto | Keio University |
| Yoshida, Ryo | Keio University |
| Yoshioka, Kentaro | Keio University |
Keywords: Automation Technologies for Smart Cities, Engineering for Robotic Systems, Sensor-based Control
Abstract: As autonomous mobile robots increasingly operate in real-world environments, safety has emerged as a critical challenge, particularly regarding obstacle and pedestrian detection in building blind spots and reliable traffic signal recognition. While traditional Vehicle-to-Infrastructure (V2I) systems adopt high-capacity communication through 5G networks or via Optical Wireless Communication (OWC), these approaches require dedicated communication hardware that proves impractical for small, low-cost robots. Additionally, the communication bandwidth required for robot-oriented V2I, such as blind spot object detection and traffic signal states, is relatively limited; the high-capacity communication of 5G is often unnecessary. To address these challenges, we propose a novel optical communication system named Optical LiDAR Communication (OLC), which repurposes existing LiDAR sensors as communication devices. By integrating LiDAR Injection with 2D Code technology, OLC achieves cost-effectiveness through V2I communication without requiring additional hardware on robots. Real-world experiments confirmed that the proposed method achieves a communication success rate of over 76% at distances up to 30 meters. Furthermore, as a proof-of-concept, we develop two key V2I systems utilizing OLC: traffic signal information transmission and blind-spot obstacle detection, and real-time communication performance was demonstrated. These results indicate that the proposed method has potential as a V2I platform for next-generation robotics infrastructure.
|
| |
| 15:00-16:30, Paper TuI2I.362 | Add to My Program |
| Empirical Contact Models for Soft Spherical Robots in Drake |
|
| Oevermann, Micah | Texas A&M University |
| Datta, Dhruv | Texas A&M University |
| Hilburn, Dylan | Belmont University |
| Pravecek, Derek | Texas A&M University |
| Jangale, Rishi | Texas A&M University |
| Villanueva, Aaron | Texas A&M University |
| Ambrose, Robert | Texas A&M University |
Keywords: Modeling, Control, and Learning for Soft Robots, Simulation and Animation, Dynamics
Abstract: Accurate dynamic modeling of soft-shelled spherical robots is challenging due to coupled rigid–soft body interactions and pressure-dependent contact behavior. This letter presents a modeling strategy for an empirically tuned pendulum-driven inflatable spherical robot. The approach combines a rigid-body dynamics engine in Drake with non-conservative effects. The robot’s rigid-body model is generated from a custom URDF and augmented with interchangeable joint friction modules. Three alternative outer shell contact models are also considered: Drake’s native hydroelastic contact, a pressure-dependent injected stiffness–damping model derived from isolated shell experiments, and a rigid point-contact baseline. Shell dynamics are characterized in the steering direction using a custom locking fixture, yielding empirical pressure-related frequency and damping relationships to parameterize the models. Ramp descent experiments across multiple inflation pressures validate the framework, showing that an appropriate model reduces drive velocity prediction error compared to a rigid point-contact case. The approach enables modular integration of additional dynamic effects, supports data-driven parameter tuning, and provides a reproducible pathway for accurate simulation of soft spherical robots.
|
| |
| 15:00-16:30, Paper TuI2I.363 | Add to My Program |
| TIPS: Tiered Information-Rich Planning Strategy for Efficient UGV Autonomous Exploration |
|
| Wang, Zhuoxuan | Southeast University |
| Pan, Shuguo | Southeast University |
| Xu, Jinle | Southeast University |
| Tao, Xianlu | Southeast University |
| Gao, Wang | Southeast University |
| Wang, Qiang | Nanjing University of Posts and Telecommunications |
Keywords: Autonomous Vehicle Navigation, Reactive and Sensor-Based Planning, Probabilistic Inference
Abstract: In this letter, we propose a tiered systematic framework to enhance the overall efficiency and environmental coverage of autonomous exploration for Autonomous Ground Vehicle (AGV) in complex environments with narrow regions. At the local level, we introduce a novel Multi-cause Triggering Sensor Model (MTSM) to improve informative observation acquisition in narrow regions. Furthermore, the Frontier set is defined from a probabilistic distribution perspective and utilized to optimize the initial training pool of Bayesian optimization, thereby accelerating convergence toward the optimal navigation target point. At the global level, we incrementally maintain an Information-Rich Sparse Roadmap (IRSR) by leveraging accumulated historical exploration knowledge. When a dead zone situation is detected, the heuristic guidance is activated and realized by graph search considering information content and distance between IRSR vertices, enabling AGV to maintain a continuous and sustained exploration process. Three simulation scenarios with increasing complexity are designed, in which comprehensive comparisons and evaluations against different types of state-of-the-art approaches are conducted. The results demonstrate that our framework achieves a favorable balance between algorithm runtime, exploration efficiency and coverage completeness, with superior performance in narrow regions. Subsequent real-world experiments further validate the strong potential of our proposed method for practical applications.
|
| |
| 15:00-16:30, Paper TuI2I.364 | Add to My Program |
| ILeSiA: Interactive Learning of Robot Situational Awareness from Camera Input |
|
| Vanc, Petr | CIIRC, Czech Technical University in Prague |
| Franzese, Giovanni | Technology Innovation Institute |
| Behrens, Jan Kristof | Czech Technical University in Prague, CIIRC |
| Della Santina, Cosimo | TU Delft |
| Stepanova, Karla | Czech Technical University |
| Kober, Jens | University of Stuttgart |
| Babuska, Robert | Delft University of Technology |
Keywords: Learning from Demonstration, Safety in HRI, Perception for Grasping and Manipulation
Abstract: Learning from demonstration is a promising way to teach robots new skills. However, a central challenge in executing acquired skills is the ability to recognize faults and prevent failures. This is essential since the demonstrations usually cover only a limited number of mostly successful cases. During task execution, unexpected situations that were not encountered during demonstrations may occur. Examples include changes in the robot's environment or interaction with human operators. To recognize such situations, this paper focuses on teaching the robot situational awareness by using a camera input and labeling frames as safe or risky. We train a Gaussian Process regression model fed by a low-dimensional latent space representation of the input images. The model outputs a continuous risk score ranging from zero to one, quantifying the level of risk evidence at each timestep. This allows for pausing task execution in unsafe situations and directly adding new training data, labeled by the human user. Our experiments on a robotic manipulator show that our proposed method can reliably detect both known and novel faults using only a small amount of user-provided data. In contrast, a standard Multi-Layer Perceptron performs well only on faults it has encountered during training. Our method enables the next generation of cobots to be rapidly deployed with easy-to-set-up, vision-based risk assessment, proactively safeguarding humans and detecting misaligned parts or missing objects before failures occur.
|
| |
| 15:00-16:30, Paper TuI2I.365 | Add to My Program |
| VLM-SFD: VLM-Assisted Siamese Flow Diffusion Framework for Dual-Arm Cooperative Manipulation |
|
| Chen, Jiaming | The University Fo Manchester |
| Jiang, Yiyu | University of Manchester |
| Huang, Aoshen | Shandong University |
| Li, Yang | Shanghai Jiao Tong University |
| Pan, Wei | The University of Manchester |
Keywords: Dual Arm Manipulation, Visual Learning, Manipulation Planning
Abstract: Dual-arm cooperative manipulation holds great promise for tackling complex real-world tasks that demand seamless coordination and adaptive dynamics. Despite substantial progress in learning-based motion planning, most approaches struggle to generalize across diverse manipulation tasks and adapt to dynamic, unstructured environments, particularly in scenarios involving interactions between two objects such as assembly, tool use, and bimanual grasping. To address these challenges, we introduce a novel VLM-Assisted Siamese Flow Diffusion (VLM-SFD) framework for efficient imitation learning in dual-arm cooperative manipulation. The proposed VLM-SFD framework exhibits outstanding adaptability, significantly enhancing the ability to rapidly adapt and generalize to diverse real-world tasks from only a minimal number of human demonstrations. Specifically, we propose a Siamese Flow Diffusion Network (SFDNet) employs a dual-encoder-decoder Siamese architecture to embed two target objects into a shared latent space, while a diffusion-based conditioning process—conditioned by task instructions—generates two-stream object-centric motion flows that guide dual-arm coordination. We further design a dynamic task assignment strategy that seamlessly maps the predicted 2D motion flows into 3D space and incorporates a pre-trained vision-language model (VLM) to adaptively assign the optimal motion to each robotic arm over time. Experiments validate the effectiveness of the proposed method, demonstrating its ability to generalize to diverse manipulation tasks while maintaining high efficiency and adaptability. The code and demo videos are publicly available on our project website~url{https://sites.google.com/view/vlm-sfd/}.
|
| |
| 15:00-16:30, Paper TuI2I.366 | Add to My Program |
| NOVA: Navigation Via Object-Centric Visual Autonomy for High-Speed Target Tracking in Unstructured GPS-Denied Environments (I) |
|
| Saviolo, Alessandro | New York University |
| Loianno, Giuseppe | UC Berkeley |
Keywords: Aerial Systems: Perception and Autonomy, Visual Tracking, Vision-Based Navigation
Abstract: Autonomous aerial target tracking in unstructured and GPS-denied environments remains a fundamental challenge in robotics. Many existing methods rely on motion capture systems, pre-mapped scenes, or feature-based localization to ensure safety and control, limiting their deployment in real-world conditions. We introduce NOVA, a fully onboard, object-centric framework that enables robust target tracking and collision-aware navigation using only a stereo camera and an IMU. Rather than constructing a global map or relying on absolute localization, NOVA formulates perception, estimation, and control entirely in the target’s reference frame. A tightly integrated stack combines a lightweight object detector with stereo depth completion, followed by histogram-based filtering to infer robust target distances under occlusion and noise. These measurements feed a visual-inertial state estimator that recovers the full 6-DoF pose of the robot relative to the target. A nonlinear model predictive controller (NMPC) plans dynamically feasible trajectories in the target frame. To ensure safety, high-order control barrier functions (CBFs) are constructed online from a compact set of high-risk collision points extracted from depth, enabling real-time obstacle avoidance without maps or dense representations. We validate NOVA across challenging real-world scenarios, including urban mazes, forest trails, and repeated transitions through buildings with intermittent GPS loss and severe lighting change
|
| |
| 15:00-16:30, Paper TuI2I.367 | Add to My Program |
| Efficient Learning of Object Placement with Intra-Category Transfer |
|
| Röfer, Adrian | University of Freiburg |
| Buchanan, Russell | University of Waterloo |
| Argus, Maximilian | University of Freiburg |
| Vijayakumar, Sethu | University of Edinburgh |
| Valada, Abhinav | University of Freiburg |
Keywords: Learning from Demonstration, Probabilistic Inference
Abstract: Efficient learning from demonstration for long-horizon tasks remains an open challenge in robotics. While significant effort has been directed toward learning trajectories, a recent resurgence of object-centric approaches has demonstrated improved sample efficiency, enabling transferable robotic skills. Such approaches model tasks as a sequence of object poses over time. In this work, we propose a scheme for transferring observed object arrangements to novel object instances by learning these arrangements on canonical class frames. We then employ this scheme to enable a simple yet effective approach for training models from as few as five demonstrations to predict arrangements of a wide range of objects including tableware, cutlery, furniture, and desk spaces. We propose a method for optimizing the learned models to enable efficient learning of tasks such as setting a table or tidying up an office with intra-category transfer, even in the presence of distractors. We present extensive experimental results in simulation and on a real robotic system for table setting which, based on human evaluations, scored 73.3% compared to a human baseline. We make the code and trained models publicly available upon acceptance.
|
| |
| 15:00-16:30, Paper TuI2I.368 | Add to My Program |
| Suppressing Self-Excitation in Adaptive Sliding Mode Control: Observer-Based Design and Experimental Validation of Flapping-Wing Micro Aerial Vehicle |
|
| Park, Heetae | Chungnam National University |
| Kim, Seungkeun | Chungnam National University |
| Suk, Jinyoung | Chungnam National University |
Keywords: Biologically-Inspired Robots, Aerial Systems: Mechanics and Control, Robust/Adaptive Control
Abstract: This work covers the design of a sliding mode control to stabilize the attitude of a flapping-wing micro aerial vehicle. The approach employs an auxiliary observer loop to avoid system excitation from unmodeled actuator dynamics, a common issue in sliding mode control applications. A proportional-integral observer is constituted in the auxiliary loop to minimize interactions with the actuator dynamics and to handle parametric uncertainties in the low bandwidth. Then, the observer-based sliding mode control is designed to track the attitude command with the reconstructed state variables from the observer loop. Furthermore, a barrier function-based adaptive gain strategy is utilized to modulate the control input according to the system’s current state, ensuring efficient use of control effort. Flight experiments were conducted with a freely movable dummy mass attached to the bottom of the vehicle, simulating external disturbances. The proposed sliding mode control outperforms PD, classical, and super-twisting sliding mode controls in tracking performance and control efficiency, while mitigating self-excitation due to discontinuous input.
|
| |
| 15:00-16:30, Paper TuI2I.369 | Add to My Program |
| Multivariate Active Learning and Adaptive Sampling with Multi-Kernel Gaussian Processes |
|
| Nguyen, Thien Hoang | University of Sydney |
| Wallace, Nathan Daniel | University of Sydney |
| Harrison, Nicholas | The University of Sydney: The Australian Centre for Field Robotics |
| Sukkarieh, Salah | The University of Sydney: The Australian Centre for Field Robotics |
|
|
| |
| 15:00-16:30, Paper TuI2I.370 | Add to My Program |
| Entropy-Based Incremental Coverage Path Planning for Multi-UAV Persistent Monitoring |
|
| Luo, Cai | China University of Pertroleum (East China) |
| Wang, Lijun | China University of Petroleum (East China) |
| Jin, Jiucai | First Institute of Oceanography, Ministry of Natural Resources |
| Du, Zhenpeng | University of Exeter |
| Miao, Wang | University of Exeter |
Keywords: Motion and Path Planning, Environment Monitoring and Management, Multi-Robot Systems
Abstract: Oil spills continuously affect marine ecosystems and require rapid monitoring for effective emergency response. This letter tackles the problem of persistent monitoring for continuously changing and scattered oil spill regions through Entropy-Based Incremental Coverage Path Planning (EICPP). By using contour comparison between monitoring cycles, an incremental coverage mechanism is first introduced to focus on newly emerged oil spill regions. Then, a balanced region division algorithm is incorporated to handle scattered oil spill areas while ensuring equal workload distribution among UAVs. The entropy-based path planning enhances oil spill monitoring effectiveness by Drift Information Freshness (DIF) through prioritizing high-entropy regions under limited UAV resources. We evaluate the robustness and effectiveness of our method across multiple scenarios. Our method demonstrates clear advantages in DIF, achieving 19–25% improvements over strong baselines across different spill scales and about 19.6–24% on real-world oil spill datasets. It also substantially reduces total flight distance while consistently satisfying the 90% coverage requirement.
|
| |
| 15:00-16:30, Paper TuI2I.371 | Add to My Program |
| Modular Acoustic Graph SLAM for Underwater Monitoring with Autonomous Underwater Vehicles (I) |
|
| Real, Marta | Universitat De Girona |
| Vial, Pau | Universitat De Girona ESQ6750002E |
| Pi, Roger | Universitat De Girona |
| Palomeras, Narcis | Universitat De Girona |
| Carreras, Marc | Universitat De Girona |
Keywords: Marine Robotics, SLAM, Environment Monitoring and Management
Abstract: This work was developed under the need for an acoustic localization system to monitor marine protected areas (MPAs) with the help of autonomous underwater vehicles (AUVs). Although the use of acoustic signals for underwater localization has been previously studied, most of the solutions rely on filter-based optimization, which is prone to linearization problems in long-term applications. Instead, we implemented a Modular Acoustic Graph Simultaneous Localization and Mapping (SLAM) algorithm that, using a factor graph framework, tracks acoustic beacons with either ranges or bearings. In addition, we developed several novel methods, like a delayed-position update for ultra-short baseline (USBL) position factor integration process, an initialization algorithm for acoustic landmarks, and the creation of a new 3D bearing factor that combines two angles. After developing the algorithm, field experiments were carried out in different areas on the coast of Catalonia. Besides the localization, some monitoring tasks were also tested, such as visual mapping of localized landmarks or optical transmission of data with seafloor stations, which helped validate the accuracy of the acoustic localization system. The results of such experiments are presented and discussed.
|
| |
| 15:00-16:30, Paper TuI2I.372 | Add to My Program |
| A-SPAM: A Novel Asynchronous Semantic Padding and Matching Integrated Framework for Dynamic Loop Closure Detection |
|
| He, QiBin | Macau Polytechnic University |
| Wang, Yapeng | Macao Polytechnic University |
| Chai, Yanming | Macao Polytechnic University |
| Huang, Qiyue | Macau Polytechnic University |
| Zhang, Tiankui | Beijing University of Posts and Telecommunications |
| Sio, KeiIm | Macao Polytechnic University |
| Zhang, Jie | Macao Polytechnic University |
Keywords: SLAM, RGB-D Perception, Localization
Abstract: Loop closure detection in dynamic SLAM faces critical challenges when dynamic objects dominate camera views, degrading frame-to-frame methods reliant on static landmarks. We propose A-SPAM, an asynchronous framework that constructs spatiotemporal semantic graphs via semantic padding (entity tracking + rigid structure analysis) and validates loops via semantic matching (topology-feature hybrid correlation). Evaluated on TUM and BONN datasets, A-SPAM achieves at least 76.8% recall rate at 100% precision in dynamic environments, while maintaining a mean translational error of less than 0.07m across dynamic sequences under degraded odometry conditions. The proposed framework corrects erroneous trajectories and enhances robustness against odometry failures in dynamic environments.
|
| |
| 15:00-16:30, Paper TuI2I.373 | Add to My Program |
| Scale-Invariant and View-Relational Representation Learning for Full Surround Monocular Depth |
|
| Hwang, Kyumin | DGIST |
| Choi, Wonhyeok | DGIST |
| Han, Kiljoon | Daegu Gyeongbuk Institute of Science and Technology |
| Choi, Wonjoon | DGIST |
| Choi, Minwoo | DGIST |
| Na, Yongcheon | Hyundai Motors |
| Park, Minwoo | Hyundai Motor Company |
| Im, Sunghoon | DGIST |
Keywords: Computer Vision for Automation, Deep Learning for Visual Perception, Recognition
Abstract: Recent foundation models demonstrate strong generalization capabilities in monocular depth estimation. However, directly applying these models to Full Surround Monocular Depth Estimation (FSMDE) presents two major challenges: (1) high computational cost, which limits real-time performance, and (2) difficulty in estimating metric-scale depth, as these models are typically trained to predict only relative depth. To address these limitations, we propose a novel knowledge distillation strategy that transfers robust depth knowledge from a foundation model to a lightweight FSMDE network. Our approach leverages a hybrid regression framework combining the knowledge distillation scheme--traditionally used in classification--with a depth binning module to enhance scale consistency. Specifically, we introduce a cross-interaction knowledge distillation scheme that distills the scale-invariant depth bin probabilities of a foundation model into the student network while guiding it to infer metric-scale depth bin centers from ground-truth depth. Furthermore, we propose view-relational knowledge distillation, which encodes structural relationships among adjacent camera views and transfers them to enhance cross-view depth consistency. Experiments on DDAD and nuScenes demonstrate the effectiveness of our method compared to conventional supervised methods and existing knowledge distillation approaches. Moreover, our method achieves a favorable trade-off between performance and efficiency, meeting real-time requirements.
|
| |
| 15:00-16:30, Paper TuI2I.374 | Add to My Program |
| Controlling Deformable Objects with Non-Negligible Dynamics: A Shape-Regulation Approach to End-Point Positioning |
|
| Tiburzio, Sebastien | Delft University of Technology |
| Coleman, Tomás | TU Delft |
| Feliu, Daniel | Delft University of Technology (TU Delft) |
| Della Santina, Cosimo | TU Delft |
Keywords: Motion Control of Manipulators, Underactuated Robots, Deformable Object Manipulation, Modeling, Control, and Learning for Soft Robots
Abstract: Model-based manipulation of deformable objects has traditionally dealt with objects while neglecting their dynamics, thus mostly focusing on very lightweight objects at steady state. At the same time, soft robotic research has made considerable strides toward general modeling and control, despite soft robots and deformable objects being very similar from a mechanical standpoint. In this work, we leverage these recent results to develop a control-oriented, fully dynamic framework of slender deformable objects grasped at one end by a robotic manipulator. We introduce a dynamic model of this system using functional strain parameterizations and describe the manipulation challenge as a regulation control problem. This enables us to define a fully model-based control architecture, for which we can prove analytically closed-loop stability and provide sufficient conditions for steady state convergence to the desired state. The nature of this work is intended to be markedly experimental. We provide an extensive experimental validation of the proposed ideas, tasking a robot arm with controlling the distal end of six different cables, in a given planar position and orientation in space.
|
| |
| 15:00-16:30, Paper TuI2I.375 | Add to My Program |
| Bimanual Regrasp Planning and Control for Active Reduction of Object Pose Uncertainty |
|
| Nagahama, Ryuta | Osaka University |
| Wan, Weiwei | Osaka University |
| Hu, Zhengtao | Shanghai University |
| Harada, Kensuke | Osaka University |
Keywords: Grasping, Bimanual Manipulation, Planning under Uncertainty
Abstract: Precisely grasping an object is a challenging task due to pose uncertainties. Conventional methods have used cameras and fixtures to reduce object uncertainty. They are effective but require intensive preparation, such as designing jigs based on the object geometry and calibrating cameras with high-precision tools fabricated using lasers. In this study, we propose a method to reduce the uncertainty of the position and orientation of a grasped object without using a fixture or a camera. Our method is based on the concept that the flat finger pads of a parallel gripper can reduce uncertainty along its opening/closing direction through flat surface contact. Three orthogonal grasps by parallel grippers with flat finger pads collectively constrain an object's position and orientation to a unique state. Guided by the concepts, we develop a regrasp planning and admittance control approach that sequentially finds and leverages three orthogonal grasps of two robotic arms to actively reduce uncertainties in the object pose. We evaluated the proposed method on different initial object uncertainties and verified that it had good repeatability. The deviation levels of the experimental trials were on the same order of magnitude as those of an optical tracking system, demonstrating strong relative inference performance.
|
| |
| 15:00-16:30, Paper TuI2I.376 | Add to My Program |
| Energy-Based Kinematic Analysis on Magnetic Soft Continuum Robot with Asymmetric Magnetization |
|
| Lee, Junyeong | DGIST |
| Park, Joowon | University of Ulsan |
| Park, Sukho | DGIST |
Keywords: Micro/Nano Robots, Medical Robots and Systems, Modeling, Control, and Learning for Soft Robots
Abstract: Magnetically actuated soft continuum robots (MSCRs), which offer remote and wireless control via external magnetic fields along with high flexibility, have recently emerged as a promising technology for minimally invasive surgery (MIS). However, the magnetic actuation forces of MSCRs are generally limited, resulting in inherent workspace constraints. To overcome these limitations, various design strategies have been explored, including the development of an asymmetric magnetized soft continuum robot (AMSCR). Although AMSCRs have demonstrated a significantly larger workspace than conventional MSCRs, a quantitative relationship between the magnetization patterns of embedded magnetic particles and the resulting workspace has not yet been fully clarified. In this study, an energy-based kinematic analysis of AMSCR was conducted to address this issue. Specifically, the equilibrium posture of the AMSCR was determined by minimizing the total potential energy, considering different combinations of external magnetic field directions and internal magnetization patterns. Based on the resulting potential energy graph, the workspace of the AMSCR was quantitatively analyzed, and an optimal linear asymmetric magnetization pattern was identified. Furthermore, the proposed energy-based kinematic model was validated through finite element analysis (FEA) conducted using COMSOL Multiphysics, as well as through experiments performed on a fabricated AMSCR prototype. As a result, an optimal magnetization design method for linearly asymmetric AMSCRs was proposed and experimentally confirmed. The proposed approach is expected to be further applicable to the kinematic performance evaluation and design optimization of AMSCRs with various other magnetization pat
|
| |
| 15:00-16:30, Paper TuI2I.377 | Add to My Program |
| Learning to Drive Anywhere with Model-Based Reannotation |
|
| Hirose, Noriaki | UC Berkeley / TOYOTA Motor North America |
| Ignatova, Lydia | University of Southern California |
| Stachowicz, Kyle | University of California, Berkeley |
| Glossop, Catherine | University of California, Berkeley |
| Levine, Sergey | UC Berkeley |
| Shah, Dhruv | Google DeepMind |
Keywords: Big Data in Robotics and Automation, Vision-Based Navigation
Abstract: Developing broadly generalizable visual navigation policies for robots is a significant challenge, primarily constrained by the availability of large-scale, diverse training data. While curated datasets collected by researchers offer high quality, their limited size restricts policy generalization. To overcome this, we explore leveraging abundant, passively collected data sources, including large volumes of crowd-sourced teleoperation data and unlabeled YouTube videos, despite their potential for lower quality or missing action labels. We propose Model-Based ReAnnotation (MBRA), a framework that utilizes a learned short-horizon, model-based expert model to relabel or generate high-quality actions for these passive datasets. This relabeled data is then distilled into LogoNav, a long-horizon navigation policy conditioned on visual goals or GPS waypoints. We demonstrate that LogoNav, trained using MBRA-processed data, achieves state-of-the-art performance, enabling robust navigation over distances exceeding 300 meters in previously unseen indoor and outdoor environments. Our extensive real-world evaluations, conducted across a fleet of robots (including quadrupeds) in six cities on three continents, validate the policy's ability to generalize and navigate effectively even amidst pedestrians in crowded settings.
|
| |
| 15:00-16:30, Paper TuI2I.378 | Add to My Program |
| Grasp Independent Indirect Tool Force Estimation Using Vision-Based Tactile Sensors |
|
| Li, Luchen | University College London |
| George Thuruthel, Thomas | University College London |
Keywords: Force and Tactile Sensing, Soft Sensors and Actuators, Modeling, Control, and Learning for Soft Robots
Abstract: Humans possess the capability to seamlessly integrate tools into their body schema, enabling precise and adaptive interactions with the environment. This touch-mediated ability allows us to dexterously use tools in everyday tasks, an ability currently lacking in robotic systems. In this work, we propose a novel method for indirect force estimation in robotic tool use, a prerequisite for advanced tool use, leveraging vision-based tactile sensing (VTS) and deep learning techniques. By capturing high-resolution spatial deformations from tactile images, our model implicitly infers force transmission dynamics without requiring explicit knowledge of tool properties or material characteristics. We validate our approach across multiple tool types using a single trained machine learning model, demonstrating its generalization capability. This work represents the first demonstration of indirect force estimation for tool-mediated robotic interactions, offering a pathway toward more dexterous and adaptive robotic tool use in real-world applications.
|
| |
| 15:00-16:30, Paper TuI2I.379 | Add to My Program |
| MonoMPC: Monocular Vision Based Navigation with Learned Collision Model and Risk-Aware Model Predictive Control |
|
| Sharma, Basant | University of Tartu |
| Jadhav, Prajyot | University of Tartu |
| Paul, Pranjal | International Institute of Information Technology |
| Krishna, Madhava | IIIT Hyderabad |
| Singh, Arun Kumar | University of Tartu |
Keywords: Planning under Uncertainty, Collision Avoidance, Vision-Based Navigation
Abstract: Navigating unknown environments with a single RGB camera is challenging, as the lack of depth information prevents reliable collision-checking. While some methods use estimated depth to build collision maps, we found that depth estimates from vision foundation models are too noisy for zero-shot navigation in cluttered environments. We propose an alternative approach: instead of using noisy estimated depth for direct collision-checking, we use it as a rich context input to a learned collision model. This model predicts the distribution of minimum obstacle clearance that the robot can expect for a given control sequence. At inference, these predictions inform a risk-aware MPC planner that minimizes estimated collision risk. We proposed a joint learning pipeline that co-trains the collision model and risk metric using both safe and unsafe trajectories. Crucially, our joint-training ensures well calibrated uncertainty in our collision model that improves navigation in highly cluttered environments. Consequently, real-world experiments show reductions in collision-rate and improvements in goal reaching and speed over several strong baselines.
|
| |
| 15:00-16:30, Paper TuI2I.380 | Add to My Program |
| One Filter to Deploy Them All: Robust Safety for Quadrupedal Navigation in Unknown Environments |
|
| Lin, Albert | Stanford University |
| Peng, Shuang | University of Southern California |
| Bansal, Somil | Stanford University |
Keywords: Robot Safety, Legged Robots, Collision Avoidance, Hamilton-Jacobi Reachability Analysis
Abstract: As learning-based methods for legged robots rapidly grow in popularity, it is important that we can provide safety assurances efficiently across different controllers and environments. Existing works either rely on a priori knowledge of the environment and safety constraints to ensure system safety or provide assurances for a specific locomotion policy. To address these limitations, we propose an observation-conditioned reachability-based (OCR) safety-filter framework. Our key idea is to use an OCR value network (OCR-VN) that predicts the optimal control-theoretic safety value function for new failure regions and dynamic uncertainty during deployment time. Specifically, the OCR-VN facilitates rapid safety adaptation through two key components: a LiDAR-based input that allows the dynamic construction of safe regions in light of new obstacles and a disturbance estimation module that accounts for dynamic uncertainty in the wild. The predicted safety value function is used to construct an adaptive safety filter that overrides the nominal quadruped controller when necessary to maintain safety. Through simulation studies and hardware experiments on a Unitree Go1 quadruped in agile planar navigation tasks, we demonstrate that the proposed framework can automatically safeguard a wide range of hierarchical quadruped controllers, adapts to novel environments, and is robust to unmodeled dynamics without a priori access to the controllers or environments - hence, "One Filter to Deploy Them All."
|
| |
| 15:00-16:30, Paper TuI2I.381 | Add to My Program |
| ActiveSPN: Active Soft Polyhedral Networks with Pose Estimation for In-Finger Object Manipulation |
|
| Li, Sen | Southern University of Science and Technology |
| Dong, Chengxiao | Southern University of Science and Technology |
| Song, Chaoyang | Mohamed Bin Zayed University of Artificial Intelligence |
| Wan, Fang | Southern University of Science and Technology |
Keywords: In-Hand Manipulation, Soft Sensors and Actuators
Abstract: Robotic grippers aim to replicate the remarkable functionalities of the human hand by providing advanced perception, adaptability, stability, and dexterity for complex tasks. Achieving these capabilities demands a sophisticated design hierarchy and robust perception mechanisms that ensure accurate manipulation. This paper introduces Active Soft Polyhedral Networks (ActiveSPN), a gripper design that leverages an active, non-biomimetic surface for precise in-hand manipulation. A vision system integrated directly into the fingers further facilitates accurate pose estimation of the in-finger object. The proposed system includes: (i) a soft polyhedral network featuring a transparent active belt to deliver complete three-dimensional adaptation and dexterous in-finger motion, and (ii) a generative learning-based pipeline for in-finger pose estimation. Experimental results demonstrate the ability of ActiveSPN to execute multi-degree-of-freedom in-finger manipulations, including two-axis rotation and one-axis translation. Moreover, the integrated vision-based pose estimation provides robust, real-time predictions, supporting consistent closed-loop control. Across diverse objects, the system achieves mean translational errors of 2.59 mm and rotational errors of 7 degrees, highlighting a promising paradigm for compact, efficient, and dexterous robotic manipulation. Codes are available at https://github.com/ancorasir/ActiveSPN.
|
| |
| 15:00-16:30, Paper TuI2I.382 | Add to My Program |
| Transformer Driven Visual Servoing for Fabric Texture Matching Using Dual-Arm Manipulator |
|
| Tokuda, Fuyuki | Tohoku University |
| Seino, Akira | Centre for Transformative Garment Production |
| Kobayashi, Akinari | Centre for Transformative Garment Production |
| Tang, Kai | The University of Hong Kong |
| Kosuge, Kazuhiro | The University of Hong Kong |
Keywords: Industrial Robots, Visual Servoing, Dual Arm Manipulation
Abstract: In this paper, we propose a method to align and place a fabric piece on top of another using a dual-arm manipulator and a grayscale camera, so that their surface textures are accurately matched. We propose a novel control scheme that combines Transformer-driven visual servoing with dual-arm impedance control. This approach enables the system to simultaneously control the pose of the fabric piece and place it onto the underlying one while applying tension to keep the fabric piece flat. Our transformer-based network incorporates pre-trained backbones and a newly introduced Difference Extraction Attention Module (DEAM), which significantly enhances pose difference prediction accuracy. Trained entirely on synthetic images generated using rendering software, the network enables zero-shot deployment in real-world scenarios without requiring prior training on specific fabric textures. Real-world experiments demonstrate that the proposed system accurately aligns fabric pieces with different textures.
|
| |
| 15:00-16:30, Paper TuI2I.383 | Add to My Program |
| Data-Driven Control Optimization on Frequency Response for Fast and Precise Motion of Flexible Joint Robots |
|
| Lee, Deokjin | Daegu Gyeongbuk Institute of Science and Technology |
| Song, Junho | Daegu Gyeongbuk Institute of Science and Technology |
| Oh, Sehoon | DGIST |
Keywords: Flexible Robotics, Motion Control, Optimization and Optimal Control
Abstract: This paper presents a data-driven control optimization framework for flexible joint robots (FJR) based on frequency response function (FRF) data, enabling automated controller synthesis without explicit model identification. Unlike conventional model-based approaches that rely on accurate parameter estimation, the proposed method directly utilizes measured FRF data and formulates the controller design as a convex optimization problem. The controller maximizes control bandwidth while ensuring stability across a wide range of configurations. Experimental validation on a FJR demonstrates superior tracking accuracy, vibration suppression, and robustness compared to model-based methods. Furthermore, a high-speed drumming task demonstrates the ability of the controller to handle repeated impacts and inertia variations, highlighting the potential of FRF-based control for the fast and precise operation of flexible robotic systems.
|
| |
| 15:00-16:30, Paper TuI2I.384 | Add to My Program |
| Safety-Critical Reactive Motion Using Constrained Variable Admittance Control with Dual-Type Proximity Sensors |
|
| Moon, Seung Jae | Sungkyunkwan, Mechanical Engineering, Robottory |
| Yim, Hongsik | Sungkyunkwan University |
| Kang, Hyunchang | Sungkyunkwan University |
| Sim, Jaeyun | Sungkyunkwan University |
| Jung, Dawoon | Ajou Unversity |
| Choi, Hyouk Ryeol | Sungkyunkwan University |
Keywords: Reactive and Sensor-Based Planning, Safety in HRI, Human-Centered Robotics
Abstract: We present a method that enhances the safety and responsiveness of robotic manipulators through constrained Variable Admittance Control (VAC) combined with proximity perception. Recent studies have demonstrated that manipulators equipped with proximity sensors can avoid close obstacles in real-time. However, unavoidable collisions still remain a significant challenge in human-robot interaction (HRI). As a safety fallback, conventional reactive motion algorithms aim to avoid obstacles but often suffer from inefficient avoidance and don't consider collision. Our approach integrates proximity-based pre-contact detection and VAC with QP-based motion constraints to proactively adjust the robot’s impedance parameters while maintaining stable and controlled motion. By dynamically modulating stiffness and damping based on sensor feedback, the system improves both obstacle avoidance efficiency and smooth contact handling. Additionally, a passivity-preserving energy tank mechanism prevents instability caused by parameter variations, ensuring robust and adaptive behavior. Furthermore, experiments involving HRI demonstrate that the proposed method ensures both safe avoidance and smooth contact handling. These findings suggest that the proposed approach is well-suited for safety-critical applications in collaborative and industrial robotics.
|
| |
| 15:00-16:30, Paper TuI2I.385 | Add to My Program |
| Visual Servoing-Based Active Vision for 3D Object Reconstruction |
|
| Misimi, Ekrem | SINTEF Ocean |
| Herland, Sverre | SINTEF Ocean |
| Chaumette, François | Inria Center at University of Rennes |
Keywords: Visual Servoing, RGB-D Perception, Sensor-based Control
Abstract: In this letter, we present a novel dual-task, closed-loop, visual servoing-based active vision framework in an eye-in-hand configuration. The proposed active vision framework continuously drives the camera motion by coupling continuous Next-Best-View (NBV) planning and visual servo control within a unified formulation, is NBV-objective-agnostic, and enables real-time, closed-loop exploration of objects. We demonstrate how this approach can be applied to the 3D reconstruction of static volumetric objects. The approach is validated in the real world with a diverse set of relevant objects and we observe that the visual servo scheme produces smooth exploration trajectories that keeps the camera focused at the object. We also show that our gradient-based continuous NBV-strategy is highly competitive with baseline strategies that leverage global viewpoint sampling and results in efficient exploration with strong object coverage.
|
| |
| 15:00-16:30, Paper TuI2I.386 | Add to My Program |
| Superelastic Tendon-Like Bowden Cables: Advancing Assistive Exoskeletons |
|
| Pisaneschi, Gregorio | University of Bologna |
| Catalán, José M. | Miguel Hernández University |
| Blanco, Andrea | Miguel Hernandez University |
| Sancisi, Nicola | University of Bologna |
| Garcia-Aracil, Nicolas | Universidad Miguel Hernandez De Elche |
| Zucchelli, Andrea | Unversity of Bologna |
Keywords: Soft Robot Materials and Design, Biomimetics, Prosthetics and Exoskeletons
Abstract: This study introduces a novel Bowden cable (BC) system for hand-assistive exoskeletons employing superelastic (SE) shape memory alloy wires to address key limitations, such as efficiency and safety limitations. The unique properties of SE wires enable a single-wire transmission, offering enhanced performance, plus inherent self-sensing and self-limiting capabilities that provide tendon-like overload protection. Experimental results obtained with a setup simulating use conditions demonstrate the superior efficiency of SE wires, with 1/4 the friction of conventional steel cables. In addition, a validated force-sensing capability, achieved by monitoring electrical resistance, proves to accurately detect overload within 1% force error. This, along with the inherent passive force self-limiting behaviour during simulated collisions, demonstrates the ability of the SE BC to effectively mimic the protective function of biological tendons. Therefore, this biomimetic innovation in soft robotic transmission significantly improves safety and efficiency, presenting a promising advancement for human-robot interaction in assistive and rehabilitative robotics.
|
| |
| 15:00-16:30, Paper TuI2I.387 | Add to My Program |
| SIGN: Safety-Aware Image-Goal Navigation for Autonomous Drones Via Reinforcement Learning |
|
| Yan, Zichen | National University of Singapore |
| Huang, Rui | National University of Singapore |
| He, Lei | National University of Singapore |
| Guo, Shao | National University of Singapore |
| Zhao, Lin | National University of Singapore |
Keywords: Vision-Based Navigation, Reinforcement Learning, Aerial Systems: Perception and Autonomy
Abstract: Image-goal navigation (ImageNav) tasks a robot with autonomously exploring an unknown environment and reaching a location that visually matches a given target image. While prior works primarily study ImageNav for ground robots, enabling this capability for autonomous drones is substantially more challenging due to their need for high-frequency feedback control and global localization for stable flight. In this paper, we propose a novel sim-to-real framework that leverages reinforcement learning (RL) to achieve ImageNav for drones. To enhance visual representation ability, our approach trains the vision backbone with auxiliary tasks, including image perturbations and future transition prediction, which results in more effective policy training. The proposed algorithm enables end-to-end ImageNav with direct velocity control, eliminating the need for external localization. Furthermore, we integrate a depth-based safety module for real-time obstacle avoidance, allowing the drone to safely navigate in cluttered environments. Unlike most existing drone navigation methods that focus solely on reference tracking or obstacle avoidance, our framework supports comprehensive navigation behaviors, including autonomous exploration, obstacle avoidance, and image-goal seeking, without requiring explicit global mapping.
|
| |
| 15:00-16:30, Paper TuI2I.388 | Add to My Program |
| Mimir: Hierarchical Goal-Driven Diffusion with Uncertainty Propagation for End-To-End Autonomous Driving |
|
| Xing, Zebin | UCAS |
| Zheng, Yupeng | School of Artificial Intelligence, University of Chinese Academy of Sciences |
| Zhang, Qichao | Institute of Automation, Chinese Academy of Sciences |
| Ding, Zhixing | China University of Geosciences Beijing |
| Yang, Pengxuan | University of Chinese Academy of Sciences (UCAS) |
| Gu, Songen | Fudan University |
| Xia, Zhongpu | Baidu |
| Zhao, Dongbin | Chinese Academy of Sciences |
Keywords: Learning from Demonstration, Imitation Learning, Autonomous Vehicle Navigation
Abstract: End-to-end autonomous driving has emerged as a pivotal direction in the field of autonomous systems. Recent works have demonstrated impressive performance by incorpo-rating high-level guidance signals to steer low-level trajectory planners. However, their potential is often constrained by inaccurate high-level guidance and the computational overhead of complex guidance modules. To address these limitations, we propose Mimir, a novel hierarchical dual-system framework capable of generating robust trajectories relying on goal points with uncertainty estimation: (1) Unlike previous approaches that deterministically model, we estimate goal point uncertainty with a Laplace distribution to enhance robustness; (2) To overcome the slow inference speed of the guidance system, we introduce a multi-rate guidance mechanism that predicts extended goal points in advance. Validated on challenging Navhard and Navtest benchmarks, Mimir surpasses previous state-of-the-art methods with a 20% improvement in the driving score EPDMS, while achieving 1.6× improvement in high-level module inference speed without compromising accuracy. The code and models will be released soon to promote reproducibil-ity and further development.
|
| |
| 15:00-16:30, Paper TuI2I.389 | Add to My Program |
| Real-Time Geometric-Registration-Based Precision Localization for Autonomous Docking in Unstructured Factory Environment |
|
| Chinchilla Gutierrez, Sebastian Fernando | Toyota Motor East Japan, Inc |
| Watanabe, Manaru | Toyota Motor East Japan, Inc |
| Ooyama, Masahiro | Toyota Motor East Japan, Inc |
| Yamada, Takayuki | Toyota Motor East Japan, Inc |
| Yamada, Tomoaki | Toyota Motor East Japan, Inc |
| Toshiki, Naoto | Toyota Motor East Japan, Inc |
| Yamane, Satsuki | Toyota Motor East Japan, Inc |
| Salazar Luces, Jose Victorio | Tohoku University |
| Ravankar, Ankit A. | Tohoku University |
| Hirata, Yasuhisa | Tohoku University |
Keywords: Autonomous Vehicle Navigation, Intelligent Transportation Systems, Intelligent and Flexible Manufacturing
Abstract: In factory distribution processes, autonomous mobile robots must dock precisely at base stations. However, this task is challenging due to the dynamic and unstructured nature of factory environments, as well as the sparse point clouds caused by sensor occlusions and distance limitations. To address these challenges, we propose a geometric registration approach designed to handle sparse point clouds in changing, unstructured settings. Our method utilizes the Hough transform to detect lines, describes the point cloud based on the relationships between these lines, filters out lines that do notcorrespond to the geometric features of the target base station, and estimates the pose of both the station and the robot using global registration techniques. We evaluated our system in four typical factory scenarios across 72 trials. Results show the robot achieved docking accuracy within ±5.06 mm and ±1.11°, with a 100% success rate in docking and correctly identifying the target cart from surrounding objects. This represents a 70% reduction in errors and an 86% increase in success rate compared to existing methods.
|
| |
| 15:00-16:30, Paper TuI2I.390 | Add to My Program |
| Self-Supervised Underwater Monocular Depth Estimation Informed by Multi-Physics Processes |
|
| Xiao, Fengqi | Tsinghua University |
| Qu, Juntian | Tsinghua University |
Keywords: RGB-D Perception, Computer Vision for Transportation, Deep Learning for Visual Perception
Abstract: Depth information is crucial for underwater robotic detection and navigation tasks. However, the underwater imaging environment is complex and variable. The images captured by robots are typically sequences or videos with uniform scene content, and the ground-truth of depth is difficult to obtain. This challenge hinders the generalization of existing self-supervised monocular depth estimation (SMDE) schemes for practical underwater detection applications. To address this issue, we propose an SMDE method for underwater images informed by the physical process of optical degradation. Specifically, we developed a further degradation process for underwater images, which can constrain the image restoration process to solve the attenuation coefficient and depth map, and then combine it with the ego-motion based framework to form a self-supervised learning closed loop. Guided by inherent optical properties, this closed-loop can learn depth cues from the underwater image formation model and the geometric relationships involved in view transformation. Experimental results demonstrate that the proposed method outperforms existing techniques and generalizes well across different underwater scenes. Experiments demonstrate that the proposed method is reduced by about 9:1% in RMSE index and improved by about 3:5% in threshold accuracy compared with the SOTA method and can adapt to various underwater robot detection scenarios.
|
| |
| 15:00-16:30, Paper TuI2I.391 | Add to My Program |
| Twisted String Actuation Module for Compact Robotic Finger with Extended Stroke, Reduced Hysteresis, and Bidirectional Operation |
|
| Lee, Chunghyeon | Sogang University |
| Suthar, Bhivraj | IIT Jodhpur |
| Jeong, Seokhwan | Mechanical Eng., Sogang University |
Keywords: Mechanism Design, Tendon/Wire Mechanism, Actuation and Joint Mechanisms
Abstract: Twisted String Actuators (TSAs) are promising alternatives to conventional gear-based transmissions due to their high reduction ratios and compact form factors. However, practical limitations such as nonlinear hysteresis, limited stroke, and inherently unidirectional motion hinder their deployment in robotic systems. In this work, we propose a novel bidirectional TSA mechanism that addresses all three limitations simultaneously through an antagonistic configuration, asymmetric axis shift (AAS), and pre-tension tuning. This mechanism enables reliable bidirectional actuation by compensating for asymmetric contraction-extension behavior, suppresses hysteresis via adaptive tensioning, and extends the effective stroke. We implement the proposed design in a continuum finger module and derive a compact kinematic model for control. Extensive experiments validate the effectiveness of the approach, demonstrating the attenuation of the hysteresis, accurate bidirectional bending control across a wide range (±180◦), and the feasibility of integration into multi-finger grippers for dexterous manipulation. The results suggest that the proposed actuator design serves as a practical and scalable solution for compact robotic systems requiring precise and reversible motion.
|
| |
| 15:00-16:30, Paper TuI2I.392 | Add to My Program |
| Multi-Modal Sensing in Colonoscopy: A Data-Driven Approach |
|
| Del Bono, Viola | Boston University |
| Capaldi, Emma | Boston University |
| Kelshiker, Anushka | Boston University |
| Aktas, Ayhan | Boston University |
| Aihara, Hiroyuki | Brigham and Women's Hospital |
| Russo, Sheila | Boston University |
Keywords: Soft Sensors and Actuators, Force and Tactile Sensing, Modeling, Control, and Learning for Soft Robots
Abstract: Soft optical sensors hold potential for enhancing minimally invasive procedures like colonoscopy, yet their complex, multi-modal responses pose significant challenges. This work introduces a machine learning (ML) framework for real-time estimation of 3D shape and contact force in a soft robotic sleeve for colonoscopy. To overcome limitations of manual calibration and collect large datasets for ML, we developed an automated platform for collecting data across a range of orientations, curvatures, and contact forces. A cascaded ML architecture was implemented for sequential estimation of contact force and 3D shape, enabling an accuracy with errors of 4.7% for curvature, 2.37% for orientation, and 5.5% for force tracking. We also explored the potential of ML for contact localization by training a model to estimate contact intensity and location across 16 indenters distributed along the sleeve. The force intensity was estimated with an error ranging from 0.06 N to 0.31 N throughout the indenters. Despite the proximity of the contact points, the system achieved high localization performances, with 8 indenters reaching over 80% accuracy, demonstrating promising spatial resolution.
|
| |
| 15:00-16:30, Paper TuI2I.393 | Add to My Program |
| Distributed Collision-Free Control of MASs by Combining Reinforcement Learning with Filtered Position Barrier Certificates and Applications (I) |
|
| Qi, Qihan | Sichuan University |
| Lin, Hai | Sichuan University |
| Yang, Xinsong | Sichuan University |
| Sun, Yaping | Sichuan University |
| Ju, Xingxing | Sichuan University |
| Yu, Wenwu | Southeast University |
| |
| 15:00-16:30, Paper TuI2I.394 | Add to My Program |
| A Two-Stage Payload Dynamic Parameter Identification Method for Interactive Industrial Robots with Large Components (I) |
|
| Liu, Mingxuan | Nanjing University of Aeronautics and Astronautics |
| Li, Pengcheng | Nanjing University of Aeronautics and Astronautics |
| Duan, Jinjun | Nanjing University of Aeronautics and Astronaut |
| Liu, Lunqian | Shanghai Aircraft Manufacturing Co |
| Shen, Ye | Shanghai Aerospace Electronic Technology Research Institute |
| Tian, Wei | Nanjing University of Aeronautics and Astronautics |
| Ji, Yuqi | Shanghai Aircraft Manufacturing Co |
Keywords: Calibration and Identification, Industrial Robots
Abstract: Taking human-robot collaborative assembly as an example, the methods based on contact forces can improve the assembly efficiency of industrial robots with large components in industrial manufacturing. However, due to the large size, high payload, assembly accuracy and dynamic changes in grip position, accurately estimating the contact forces between the payload and the operator becomes challenging when handling these large components. In this paper, a two-stage method is proposed for payload dynamic parameter identification. The parameter identification equation in the sensor coordinate system is initially established. Furthermore, the identification model of recursive restricted total least squares (RRTLS) based on total least squares (TLS) is constructed to achieve low-consumption online identification. According to the assembly requirements and payload characteristics, the posture coordinate system is designed for safety, including the feasible workspace for the robot. Subsequently, the static identification postures and dynamic excitation trajectory are planned to obtain static values and dynamic inertial parameters. In the end, a high-payload human-robot collaborative assembly system is built to validate the proposed method. Experimental results show that compared with the existing methods, the proposed approach can effectively identify and compensate the payload, leading to more accurate external force sensing.
|
| |
| 15:00-16:30, Paper TuI2I.395 | Add to My Program |
| Development of a 7-DOF Position-Orientation Decoupled Microsurgical Robot with Motorized Instruments for Microvascular Anastomosis |
|
| Long, Dunfa | Tianjin University |
| Shaoan, Chen | Tianjin University |
| Ao, Shuai | Tianjin University |
| Zhang, Zhi-Qiang | University of Leeds |
| Hu, Chengzhi | Southern University of Science and Technology |
| Shi, Chaoyang | Tianjin University |
Keywords: Medical Robots and Systems, Surgical Robotics: Laparoscopy
Abstract: This work introduces a novel compact 7-degree-of-freedom (7-DOF) microsurgical robot with position-orientation decoupling capacity for microvascular anastomosis. The proposed system employs a modular architecture combining a proximal displacement platform for 3D small-stroke translation and a distal compact remote center of motion (RCM) mechanism for wide-range orientation adjustment. This design meets the workspace requirements for microvascular anastomosis, requiring extensive orientation adjustments with minimal positional movement and reducing the system footprint. The parasitic motion reverse self-compensation method has been developed for motorized surgical instruments, effectively reducing operational resistance to improve precision. Theoretical analysis has been performed on both the RCM mechanism and motorized surgical instruments, and kinematics-based parameter optimization and data-driven calibration have been conducted to enhance superior performance. A prototype has been constructed, and its experimental validation demonstrated that the system achieved repeatability of 11.24 ± 2.31 μm (XY) and 12.46 ± 4.48 μm (YZ), and absolute positioning accuracy of 29.80 ± 12.27 μm (XY) and 37.02 ± 19.47 μm (YZ), meeting super-microsurgical requirements. Experiments that include needle-threading and stamen peeling tasks demonstrate the robot's superior dexterity and manipulation capabilities.
|
| |
| 15:00-16:30, Paper TuI2I.396 | Add to My Program |
| A Phase-Change-Material-Based Variable Stiffness Sheath Inspired by a Multi-Layer Wave Spring Structure for Flexible Upper Gastrointestinal Endoscopic Robots |
|
| Song, Dezhi | Tianjin University |
| Luo, Xiangyu | Tianjin University |
| Yu, Xiangyang | , Tianjin Hospital of ITCWM/Tianjin Nankai Hospital |
| Zhang, Bo | Waseda University |
| Yang, Zhengbao | Hong Kong University of Science and Technology |
| Hu, Chengzhi | Southern University of Science and Technology |
| Shi, Chaoyang | Tianjin University |
Keywords: Surgical Robotics: Steerable Catheters/Needles, Medical Robots and Systems, Tendon/Wire Mechanism
Abstract: Continuum robots employed in flexible gastrointestinal endoscopy require the capability of transitioning between the flexible and the rigid states. Phase-change-material-based variable stiffness (VS) methods exhibit a significant stiffness change ratio but are typically time-consuming. Besides, these materials are commonly fabricated into simplistic cylindrical or tubular structures and subsequently integrated with continuum joints, overlooking the impact of the intrinsic structural characteristics of the VS module on stiffness modulation and bending performance. To maintain the combination of motion flexibility and operation stability, this work presents a stiffness-tunable sheath inspired by a multi-layer wave spring structure, which is fabricated utilizing thermoplastic material. A water-based active heating/cooling method is employed, wherein the circulation of hot/cold water through silicone tubes helically wound around the exterior of the VS sheath enables rapid thermal regulation. Structural parameters selection of the VS sheath based on the orthogonal design method has been performed to enhance its stiffness in a rigid state and reduce the maximum stress during 90° flexion in a flexible state. Experimental results indicate that the proposed VS sheath can achieve a stiffness change ratio of up to 16.5 times within 30s. After being integrated with a continuum joint, the sheath demonstrates an average positioning error of 1.48mm within a ±90° bending range in a flexible state, without structural compromise or interference with the continuum joint’s bending. In the rigid state, the proposed design can resist 400g external payload with a deflection of less than 6mm. The efficacy of this design has been validated through ex-vivo experiments conducted on a porcine stom
|
| |
| 15:00-16:30, Paper TuI2I.397 | Add to My Program |
| Understanding Lidar Variability: A Dataset and Comparative Study Featuring Dome-Shaped, Solid-State, and Spinning Lidars |
|
| Doumegna, Mawuto Koudjo Felix | Fudan University & University of Turku |
| Yu, Xianjia | University of Turku |
| Zhang, Jiaqiang | University of Turku |
| Ha, Sier | University of Turku |
| Zou, Zhuo | Fudan University |
| Westerlund, Tomi | University of Turku |
Keywords: Data Sets for SLAM, Localization, SLAM
Abstract: Lidar technology has been widely employed across various applications, such as robot localization in GNSS-denied environments and 3D reconstruction. Recent advancements have introduced different lidar types, including cost-effective solid-state lidars such as the Livox Avia and Mid-360. The Mid-360, with its dome-like design, is increasingly used in portable mapping and unmanned aerial vehicle (UAV) applications due to its low cost, compact size, and reliable performance. However, the lack of datasets that include dome-shaped lidars, such as the Mid-360, alongside other solid-state and spinning lidars, significantly hinders the comparative evaluation of novel approaches across platforms. Additionally, performance differences between low-cost solid-state and high-end spinning lidars (e.g., Ouster OS series) remain insufficiently examined, particularly without an Inertial Measurement Unit (IMU) in odometry. To address this gap, we introduce a novel dataset comprising data from multiple lidar types, including the low-cost Livox Avia and the dome-shaped Mid-360, as well as high-end spinning lidars such as the Ouster series. Notably, to the best of our knowledge, no existing dataset comprehensively includes dome-shaped lidars such as Mid-360 alongside both other solid-state and spinning lidars. In addition to the dataset, we provide a benchmark evaluation of state-of-the-art SLAM algorithms applied to this diverse sensor data. Furthermore, we present a quantitative analysis of point cloud registration techniques, specifically point-to-point, point-to-plane, and hybrid methods, using indoor and outdoor data collected from the included lidar systems. The outcomes of this study establish a foundational reference for future research in SLAM and 3D reconstruction across hete
|
| |
| 15:00-16:30, Paper TuI2I.398 | Add to My Program |
| OsmAG-LLM: Zero-Shot Open-Vocabulary Object Navigation Via Semantic Maps and Large Language Models Reasoning |
|
| Xie, Fujing | Shanghaitech University |
| Schwertfeger, Sören | ShanghaiTech University |
| Blum, Hermann | Uni Bonn | Lamarr Institute |
Keywords: Semantic Scene Understanding, AI-Enabled Robotics, Vision-Based Navigation
Abstract: Recent open-vocabulary robot mapping methods enrich dense geometric maps with pre-trained visual-language features, achieving a high level of detail and guiding robots to find objects specified by open-vocabulary language queries. While the issue of scalability for such approaches has received some attention, another fundamental problem is that high-detail object mapping quickly becomes outdated, as objects get moved around a lot. In this work, we develop a mapping and navigation system for object-goal navigation that, from the ground up, considers the possibilities that a queried object can have moved, or may not be mapped at all. Instead of striving for high-fidelity mapping detail, we consider that the main purpose of a map is to provide environment grounding and context, which we combine with the semantic priors of LLMs to reason about object locations and deploy an active, online approach to navigate to the objects. Through simulated and real-world experiments we find that our approach tends to have higher retrieval success at shorter path lengths for static objects and by far outperforms prior approaches in cases of dynamic or unmapped object queries.
|
| |
| 15:00-16:30, Paper TuI2I.399 | Add to My Program |
| A Continuum Robot with Programmable Tendon Routing for Desired Curve Generation |
|
| Zhou, Kehong | Southeast University |
| Zhu, Lifeng | Southeast University |
| Wu, Jinfeng | Southeast University |
| Wen, Yurui | Southeast University |
| Song, Aiguo | Southeast University |
| Zhang, Jessica | Carnegie Mellon University |
Keywords: Tendon/Wire Mechanism, Soft Robot Applications, Flexible Robotics
Abstract: Continuum robots are widely employed in spatially constrained environments with narrow passages. However, achieving general curves with non-constant curvature remains challenging, as existing systems typically rely on multiple flexible segments arranged in series, coupled with complex drive systems requiring numerous actuators. This paper proposes a novel continuum robot design that features a programmable wire layout capable of simulating desired curves. The system integrates modular joints and a single-actuator drive unit, enabling the generation of spatial curves with non-constant curvature. By strategically designing the arrangement of modular joints to control rotational direction and angular deflection at each joint, the system achieves a substantially expanded design workspace compared to conventional continuum robots. Simulation and prototype experiments validate the proposed design method. The relative mean distance between simulated and desired curve remains below 3.12%, while the prototype demonstrates a relative mean distance of 6.67% from desired curve. This approach offers a promising pathway to advance continuum robot by improving configurational adaptability while simultaneously achieving complex curve generation and reduced drive system complexity.
|
| |
| 15:00-16:30, Paper TuI2I.400 | Add to My Program |
| Cosserat Rods with Cross-Sectional Deformation for Soft Robot Modeling |
|
| Tobin, Samuel | The University of Tennessee |
| Gaston, Joshua | The University of Tennessee, Knoxville |
| Aloi, Vincent | Worcester Polytechnic Institute |
| Barth, Eric J. | Vanderbilt University |
| Rucker, Caleb | University of Tennessee |
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Robot Materials and Design
Abstract: Cosserat rod models are widely used to simulate, design, and control soft robots. The Cosserat framework accounts for bending, torsion, transverse shear, and elongation of a long, slender structure and correctly handles large rotations and deflections in 3D, while being far less computationally expensive than full 3D elasticity models using finite elements. However, the Cosserat model is not always appropriate for soft robotic structures since it assumes the cross sections never change size or shape. In this letter, we extend the standard Cosserat rod model to include cross-sectional deformation while retaining much of its simplicity. We add to the Cosserat model additional degrees of freedom that parameterize stretch and shear in the cross-sectional plane and their rates of change along the rod length. We then formulate several possible constitutive laws on the state variables (one linear and one non-linear) and compare them to the standard Cosserat energy expressions to gain insight. We further show how fluidic actuation and tendon actuation can be incorporated into the model, and we compare the extended Cosserat models to 3D nonlinear finite-element simulations with good agreement. Finally, we demonstrate use of this model in a robotics context to control the path-following gait of a peristaltic worm-inspired soft robot.
|
| |
| 15:00-16:30, Paper TuI2I.401 | Add to My Program |
| Smooth and Robust Trajectory Tracking of Single-Actuator Monocopters Via Incremental Nonlinear Dynamic Inversion |
|
| Tang, Emmanuel | Singapore University of Technology & Design |
| Cai, Xinyu | Singapore University of Technology and Design |
| Lee, Shawndy Michael | Singapore University of Technology and Design |
| Foong, Shaohui | Singapore University of Technology and Design |
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Perception and Autonomy
Abstract: This letter presents a comprehensive comparative study of Incremental Nonlinear Dynamic Inversion (INDI) and standard Nonlinear Dynamic Inversion (NDI) for smooth trajectory tracking on Samara Seed-Inspired Single-Actuator Monocopters (SAM). While prior work on SAMs has largely focused on hover stabilization, smooth robust control for aggressive translational motion remains a largely uncharted frontier. Leveraging the precession-prone dynamics inherent to the SAM, we analyze the tracking performance of INDI across varying flight speeds, trajectories, and wing morphologies (long, short, ultralight). Our experiment results demonstrate that INDI on the long-wing consistently achieves lower angular acceleration tracking errors, reducing mean and RMS by up to 13.8% and 13.0%, respectively, while also improving motor efficiency with up to 8.4% less PWM usage compared to NDI. Additionally, INDI produces tighter and more stable body yaw rates (± 0.1 Hz) and delivers up to 65% improvement in position tracking over traditional purely attitude control (ATT).Finally, even under severe actuation constraints with an ultralight-wing operating at reduced thrust, INDI maintains robust performance, validating its resistance towards precession and robust control of highly under-actuated SAMs.
|
| |
| 15:00-16:30, Paper TuI2I.402 | Add to My Program |
| SLIM-VDB: A Real-Time 3D Probabilistic Semantic Mapping Framework |
|
| Sheppard, Anja | University of Michigan |
| Ewen, Parker | University of Michigan |
| Wilson, Joseph | University of Michigan |
| Venkatramanan Sethuraman, Advaith | University of Michigan |
| Adewole, Benard | University of Michigan |
| Li, Anran | University of Michigan |
| Chen, Yuzhen | University of Michigan |
| Vasudevan, Ram | University of Michigan |
| Skinner, Katherine | University of Michigan |
Keywords: Engineering for Robotic Systems, Mapping, Probabilistic Inference
Abstract: This paper introduces SLIM-VDB, a new lightweight semantic mapping system with probabilistic semantic fusion for closed-set or open-set dictionaries. Advances in data structures from the computer graphics community, such as OpenVDB, have demonstrated significantly improved computational and memory efficiency in volumetric scene representation. Although OpenVDB has been used for geometric mapping in robotics applications, semantic mapping for scene understanding with OpenVDB remains unexplored. In addition, existing semantic mapping systems lack support for integrating both fixed-category and open-language label predictions within a single framework. In this paper, we propose a novel 3D semantic mapping system that leverages the OpenVDB data structure and integrates a unified Bayesian update framework for both closed- and open-set semantic fusion. Our proposed framework, SLIM-VDB, achieves significant reduction in both memory and integration times compared to current state-of-the-art semantic mapping approaches, while maintaining comparable mapping accuracy. An open-source C++ codebase with a Python interface will accompany the paper release.
|
| |
| 15:00-16:30, Paper TuI2I.403 | Add to My Program |
| Tracing Energy Flow: Learning Tactile-Based Grasping Force Control to Reduce Slippage in Dynamic Object Interaction |
|
| Kuo, Cheng-Yu | Nara Institute of Science and Technology |
| Shin, Hirofumi | Toyota Motor Corporation |
| Matsubara, Takamitsu | Nara Institute of Science and Technology |
Keywords: Model Learning for Control, Perception for Grasping and Manipulation, Grasping
Abstract: Regulating grasping force to reduce slippage during dynamic object interaction remains a fundamental challenge in robotic manipulation, especially when objects are manipulated by multiple rolling contacts, have unknown properties (such as mass or surface conditions), and when external sensing is unreliable. In contrast, humans can quickly regulate grasping force by touch, even without visual cues. Inspired by this ability, we aim to enable robotic hands to rapidly explore objects and learn tactile-driven grasping force control under motion and limited sensing. We propose a physics-informed energy abstraction that models the object as a virtual energy container. The inconsistency between the fingers’ applied power and the object’s retained energy provides a physically grounded signal for inferring slip-aware stability. Building on this abstraction, we employ model-based learning and planning to efficiently model energy dynamics from tactile sensing and perform real-time grasping force optimization. Experiments in both simulation and hardware demonstrate that our method can learn grasping force control from scratch within minutes, effectively reduce slippage, and extend grasp duration across diverse motion-object pairs, all without relying on external sensing or prior object knowledge.
|
| |
| 15:00-16:30, Paper TuI2I.404 | Add to My Program |
| VDS-Nav: Volumetric Depth-Based Safe Navigation for Aerial Robots–Bridging the Sim-To-Real Gap |
|
| Dang, Van Huyen | Paderborn University |
| Redder, Adrian | Paderborn University |
| Pham, Huy | Aarhus University |
| Sarabakha, Andriy | Aarhus University |
| Kayacan, Erdal | Paderborn University |
Keywords: Vision-Based Navigation, Aerial Systems: Applications, Reinforcement Learning
Abstract: End-to-end navigation via deep reinforcement learning has become a key approach for vision-based tasks. However, the sim-to-real gap remains a challenge, especially for aerial robots, where policies trained in simulation often fail in real-world environments. In this work, we propose a novel navigation paradigm -- volumetric depth-based safe navigation(VDS-Nav), which trains a policy to infer linear velocities and yaw rate directly from a sequence of depth images, bypassing the need for a pre-trained latent space encoder. We enhance safety with a depth-based reward design, enabling the seamless incorporation of system constraints via logarithmic barrier function methods. Most importantly, using explicit sensor information in our reward design leads to seamless sim-to-real transfer by strengthening the correlation between state-action pairs and received rewards. To evaluate the effectiveness of VDS-Nav, we compare it to a baseline that first trains a variational autoencoder to encode depth images into a latent space for policy training. The simulation results show that VDS-Nav outperforms the baseline in terms of success rate. Furthermore, real-world experiments validate the policy, with real-time performance closely matching simulation results, suggesting an effective sim-to-real transfer
|
| |
| 15:00-16:30, Paper TuI2I.405 | Add to My Program |
| LSV-Loc: LiDAR to Street View Image Crossmodal Localization |
|
| Lee, Sangmin | Korea Advanced Institute of Science and Technology |
| Choi, Donghyun | Korea Advanced Institute of Science and Technology |
| Ryu, Jee-Hwan | Korea Advanced Institute of Science and Technology |
Keywords: Localization, Autonomous Vehicle Navigation, Range Sensing
Abstract: Accurate global localization remains a fundamental challenge in autonomous vehicle navigation, especially in previously unexplored areas lacking prior map information. Traditional methods typically rely on high-definition (HD) maps generated through prior traversals or utilize auxiliary sensors such as a global positioning system~(GPS). However, the above approaches are often limited by high costs, scalability issues, and decreased reliability in environments where GPS is unavailable. Moreover, prior methods require that both query and reference data originate from the same sensor modality, restricting to generalize across different sensor types. To address limitations, we propose a novel cross-modal localization framework that enables Light Detection and Ranging~(LiDAR)-equipped vehicles to estimate their global pose by leveraging publicly available Street View images. The proposed method leverages a shared embedding space, learned via a weight-sharing Vision Transformer~(ViT) encoder, to align heterogeneous sensor modalities, specifically LiDAR intensity images and geo-tagged Street View. Shared embedding space enables cross-modal matching for global localization via place recognition, eliminating the need for prior map construction or sensor calibration. Further, to compensate for heading discrepancies between the two modalities, the framework introduces an equirectangular perspective-n-point (PnP) solver by patch-level feature correspondences. Our proposed method enables 3-degree-of-freedom~(DoF) global localization, from a single LiDAR scan and a publicly available Street View. Experiments demonstrate that the proposed method achieves high recall and accurate heading estimation, offering a scalable solution for global localization without r
|
| |
| 15:00-16:30, Paper TuI2I.406 | Add to My Program |
| Physics-Informed Passive Motion Paradigm for Parallel Robots: A High-Precision Motor-Primitives Framework |
|
| Wang, Fuli | The University of Sheffield |
| Nizar Siraj, Fazair | The University of Sheffield |
| Hutabarat, Windo | The University of Sheffield |
| Tiwari, Ashutosh | University of Sheffield |
Keywords: Parallel Robots, Compliance and Impedance Control, Deep Learning Methods
Abstract: Complex embodied systems, whether biological or robotic, must continuously generate goal-directed behaviors while preserving coherence between motor intention and physical feasibility. In parallel robots, this link between intention and mechanics becomes particularly challenging due to their nonlinear, over-constrained kinematics and the absence of intuitive motor primitives. This letter introduces a passive motion paradigm for parallel robots using self-supervised physics-informed neural networks, which reformulates motion generation as the dynamic unfolding of motor primitives driven by attractor fields in actuator space. Unlike traditional forward or optimization-based formulations, the framework integrates analytical kinematics with neural fields to ensure both physical consistency and adaptive motion generation. The method estimates the Jacobian matrix as a physically constrained neural field, merging analytical structure with data-driven learning to achieve robust and interpretable behavior without relying on iterative numerical solvers. Theoretical analysis, simulations, and physical experiments demonstrate the framework’s accuracy, stability, and adaptability across different parallel mechanisms.
|
| |
| 15:00-16:30, Paper TuI2I.407 | Add to My Program |
| A Differential Dynamic Programming Framework for Inverse Reinforcement Learning |
|
| Cao, Kun | Nanyang Technological University |
| Xu, Xinhang | Nanyang Technological University |
| Jin, Wanxin | Arizona State University |
| Johansson, Karl H. | Royal Institute of Technology |
| Xie, Lihua | NanyangTechnological University |
Keywords: Learning from Demonstration, Motion and Path Planning, Optimization and Optimal Control, Inverse Problems
Abstract: A differential dynamic programming (DDP)-based framework for inverse reinforcement learning (IRL) is introduced to recover the parameters in the cost function, system dynamics, and constraints from demonstrations. Different from existing work, where DDP was usually used for the inner forward problem, our proposed framework uses it to efficiently compute the gradient required in the outer inverse problem with equality and inequality constraints. The equivalence between the proposed and existing methods based on Pontryagin’s Maximum Principle (PMP) is established. More importantly, using this DDP-based IRL with an open-loop loss function, a closed-loop IRL framework is presented. In this framework, a loss function is proposed to capture the closed-loop nature of demonstrations. It is shown to be better than the commonly used open-loop loss function. We show that the closed-loop IRL framework reduces to a constrained inverse optimal control problem under certain assumptions. Under these assumptions and a rank condition, it is proven that the learning parameters can be recovered from the demonstration data. The proposed framework is extensively evaluated through four numerical robot examples and one real-world quadrotor system. The experiments validate the theoretical results and illustrate the practical relevance of the approach.
|
| |
| 15:00-16:30, Paper TuI2I.408 | Add to My Program |
| Surgeon Supervised Autonomous Surgical System for Oral and Maxillofacial Surgery (I) |
|
| Ma, Qingchuan | Beihang University |
| Kobayashi, Etsuko | The University of Tokyo |
| Hara, Kazuaki | The University of Tokyo |
| Wang, Junchen | Beihang University |
| Masamune, Ken | Tokyo Women's Medical University |
| Suenaga, Hideyuki | The University of Tokyo Hospital |
| Fan, Yubo | Beihang University |
Keywords: Medical Robots and Systems, Robotics and Automation in Life Sciences, Human-Robot Collaboration
Abstract: Oral and maxillofacial surgery (OMS) imposes an increasing workload on even the most experienced surgeons due to long operation time, high skill requirements, limited observation field, constrained workspace, and fast-growing patient population. Robot-assisted OMS is particularly challenging, requiring technological advancements to replicate complex surgical workflows executed by human surgeons and novel working concepts to properly address human-machine relationships. We introduced a Surgeon Supervised Autonomous Surgical System (SSASS) aiming to solve emerging bottlenecks in OMS. SSASS custom develops a deep-learning-assisted virtual planning module, a teeth-based monocular camera navigation module, and a six-degree-of-freedom compact robot module to function as surgeons’ auxiliary brain, eye, and hand, respectively. These three modules are further seamlessly integrated to autonomously complete most labor-intensive procedures, while prioritizing surgeons to supervise and be responsible for the overall procedure. Le Fort I experiments on five human head models demonstrated that the surgical results of SSASS closely matched the preoperative plan, with high drilling accuracy and acceptable cutting accuracy under a significantly simplified surgical workflow. SSASS integrates the deep learning, medical 3D printing, markerless navigation, virtual reality, and collaborative robotics, providing a comprehensive surgical solution for encompassing the entire OMS loop.
|
| |
| 15:00-16:30, Paper TuI2I.409 | Add to My Program |
| Aquaculture Robotics: Adaptive Path Planning through Real-Time Estimation of the Shape of Flexible Net Pens (I) |
|
| Amundsen, Herman Biørn | SINTEF Ocean |
| Katsidoniotaki, Eirini | Massachusetts Institute of Technology |
| Føre, Martin | NTNU |
| Kelasidi, Eleni | NTNU |
Keywords: Field Robots, Marine Robotics, Motion and Path Planning
Abstract: Aquaculture is a marine industry experiencing significant growth and an important seafood provider. Underwater vehicles such as remotely operated vehicles (ROVs) are commonly used for inspection and maintenance of the net pens where the fish are grown. These net pens are flexible structures whose position and shape change with ocean currents and waves. Any autonomous robotic operation in aquaculture is, therefore, challenging as the net pen position and shape cannot be predetermined and it is imperative that the robot does not collide with and damage the net. This article addresses this issue by proposing a novel method to estimate the full shape of aquaculture net pens in real time using an underwater vehicle equipped with a forward-looking Doppler velocity log (DVL). The method introduces a new concept for how sparse measurement data on the net pen can be fused with numerical models of the full net pen that contrasts other models in the literature by not requiring instrumentation on the net pen nor knowledge of ocean current conditions. The estimator output is then used in closed-loop vehicle control by planning and following paths relative to the estimated pen shape. The method is tested in simulations, which show a root-mean-square error (RMSE) of 0.5 m for the estimate of the entire net pen structure and centimeter-level estimation error of the distance between the vehicle and net, and in full-scale trials in an industrial fish farm, where an ROV autonomously navigated a
|
| |
| 15:00-16:30, Paper TuI2I.410 | Add to My Program |
| Robotic Exoskeleton with Mechanically Implemented Kinematic Synergy for Quadrupedal Gait of Rats |
|
| Miyamoto, Takayuki | University of Tsukuba |
| Mikhailov, Andrey | University of Tsukuba |
| Hassan, Modar | University of Tsukuba |
| Puentes, Sandra | University of Tsukuba |
| Hiraga, Taichi | University of Tsukuba |
| Soya, Hideaki | University of Tsukuba |
| Suzuki, Kenji | University of Tsukuba |
Keywords: Rehabilitation Robotics, Prosthetics and Exoskeletons, Actuation and Joint Mechanisms
Abstract: This study introduces a novel kinematic synergy-based exoskeleton designed for gait rehabilitation studies in rats. The exoskeleton assists all three hindlimb joints of the rat (hip, knee and ankle) while ensuring proper interjoint coordination and the natural quadrupedal posture. This assistance is realized through a 2-DOF bar mechanism that emulates the biomechanics of rats. Engineered to be compact, lightweight, backdrivable, and sufficiently powerful, the proposed system minimizes physical stress on the animal while allowing a wide range of assistive forces to be applied. These features are achieved through a combination of a cable power transmission system and direct-drive motors positioned outside the exoskeletal structure. The desktop experiments demonstrated that the exoskeleton could precisely replicate the rat’s kinematic gait patterns and remain backdrivable whether powered or unpowered. The feasibility of gait assistance was further confirmed in an anesthetized rat, where synergistic gait patterns were observed between the joints. Hence, the system holds the potential to enable controlled comparative neurorehabilitation studies in rats. These studies can help unveil neural recovery mechanisms and design optimal exoskeleton control strategies for rehabilitation in humans.
|
| |
| 15:00-16:30, Paper TuI2I.411 | Add to My Program |
| HiMo: High-Speed Objects Motion Compensation in Point Clouds |
|
| Zhang, Qingwen | KTH Royal Institute of Technology |
| Khoche, Ajinkya | KTH Royal Institute of Technology Stockholm, Traton, SCANIA CV AB |
| Yang, Yi | KTH Royal Institute of Technology & Scania AB |
| Ling, Li | KTH Royal Institute of Technology |
| Sharif Mansouri, Sina | Scania |
| Andersson, Olov | KTH Royal Institute of Technology |
| Jensfelt, Patric | KTH - Royal Institute of Technology |
|
|
| |
| 15:00-16:30, Paper TuI2I.412 | Add to My Program |
| Risk-Aware and Scalable Hierarchical Motion Planning for Large-Scale Robotic Swarms Via CVaR-Constrained MPC (I) |
|
| Yang, Xuru | Peking University |
| Zhao, Yuqiao | Peking University |
| Hu, Yunze | Peking University |
| Yang, Zongru | Peking University |
| Zhu, Pingping | Marshall University |
| Sun, Ying | The Pennsylvania State University |
| Liu, Chang | Peking University |
Keywords: Swarm Robotics, Path Planning for Multiple Mobile Robots or Agents, Motion Control
Abstract: Motion planning for large-scale robotic swarms presents significant challenges in terms of scalability and safety assurance in cluttered environments. To address these issues, this manuscript proposes a Closed-loop hierarchical Risk-aware swarm mOtion planner using Conditional ValuE at Risk (C-ROVER) that enables safe and efficient navigation for swarm robotic systems. The hierarchical structure of C-ROVER comprises a macroscopic planning stage that models the swarm state with Gaussian Mixture Models (GMMs) and generates trajectories for the swarm GMM, followed by a microscopic control stage that computes individual robot control using distributed model predictive control to track the GMM trajectories while achieving robot-level collision avoidance. Robot positions are periodically used to update the swarm GMM, closing the hierarchical planning and control loop. To achieve collision riskawareness between the swarm and environmental obstacles at the macroscopic stage, C-ROVER leverages the stochastic Signed Distance Function to characterize the distance between the swarm GMM and obstacles, which is proven to follow a GMM. Then C-ROVER proposes an analytical expression of Conditional Valueat-Risk (CVaR) of a GMM to enable the swarm collision risk mitigation. Furthermore, C-ROVER designs a novel risk-aware space discretization approach to enhance the ability to navigate constrained spaces.
|
| |
| 15:00-16:30, Paper TuI2I.413 | Add to My Program |
| UltraVPR: Unsupervised Lightweight Rotation-Invariant Aerial Visual Place Recognition |
|
| Chen, Chao | Beijing University of Chemical Technology |
| Li, Chunyu | Beijing Institute of Technology |
| He, Mengfan | TsinghuaUniversity |
| Wang, Jun | Beijing University of Chemical Technology |
| Xing, Fei | Tsinghua University |
| Meng, Ziyang | Tsinghua University |
Keywords: Localization, Recognition, Aerial Systems: Perception and Autonomy
Abstract: Aerial Visual Place Recognition (VPR) is critical for Unmanned Aerial Vehicles (UAVs) localization, especially in environments with unstable or unavailable GPS signals. While neural network-based VPR methods have become mainstream, they face significant challenges on UAV platforms. Traditional CNN-based VPR models are highly sensitive to image rotation, degrading their performance in aerial-domain environments. Meanwhile, Transformer-based models have high computational complexity, making them less suitable for resource-constrained UAVs. In this letter, we propose a lightweight, rotation-invariant aerial VPR method. Our approach combines a rotation-equivariant backbone network with a rotation-invariant aggregation layer to ensure descriptor consistency across different orientations. Additionally, we propose an unsupervised training strategy that constructs higher-dimensional descriptors to optimize the model, while maintaining the lower descriptor dimensionality during application. Experimental results show that our method outperforms state-of-the-art methods across multiple aerial VPR datasets. The code will be released at https://github.com/cbbhuxx/UltraVPR.
|
| |
| 15:00-16:30, Paper TuI2I.414 | Add to My Program |
| Real-Time Millimeter-Accurate Underwater Pose Estimation Via Tightly-Coupled Fusion of Vision and Optical Tracking |
|
| Gao, Yuer | The Hong Kong University of Science and Technology (Guangzhou) |
| Xu, Tongqing | The Hong Kong University of Science and Technology (Guangzhou) |
| Cai, Yi | The Hong Kong University of Science and Technology (Guangzhou) |
Keywords: Marine Robotics, Sensor Fusion, Localization
Abstract: Precise and high-frequency state estimation is required for advanced underwater robotic applications such as physical interaction and agile control, yet no single sensor can simultaneously provide both high accuracy and high update rates. Vision-based methods offer high-frequency updates but suffer from drift, while optical tracking systems are highly accurate but may not provide sufficiently high update rates for real-time control loops. This paper presents a tightly-coupled sensor fusion framework that combines a high-frequency (62 FPS) monocular vision-based pose estimator with a high-accuracy (millimeter-level) optical tracking system. Our approach uses a visual estimator for high-frequency state propagation—with a latent variable motion model to compensate for underwater disturbances—while the optical tracker provides periodic corrections. In a controlled underwater testbed, this achieves a position RMSE of 5.65 mm at 62 FPS, improving accuracy 1.6x compared to the best baseline method (EfficientPose + EKF: 9.20 mm) and 6.4x compared to vision-only estimation (36 mm). Our dataset and code are available upon request.
|
| |
| 15:00-16:30, Paper TuI2I.415 | Add to My Program |
| Onboard Ranging-Based Relative Localization and Stability for Lightweight Aerial Swarms |
|
| Li, Shushuai | Delft University of Technology |
| Shan, Feng | Southeast University |
| Liu, Jiangpeng | Southeast University |
| Coppola, Mario | Delft University of Technology |
| De Wagter, Christophe | Delft University of Technology |
| de Croon, Guido | Delft University of Technology |
Keywords: Swarm Robotics, Automation at Micro-Nano Scales
Abstract: Lightweight aerial swarms have potential applications in scenarios where larger drones fail to operate efficiently. The primary foundation for lightweight aerial swarms is efficient relative localization, which enables cooperation and collision avoidance. Computing the real-time position is challenging due to extreme resource constraints. This paper presents an autonomous relative localization technique for lightweight aerial swarms without infrastructure by fusing ultra-wideband wireless distance measurements and the shared state information (e.g., velocity, yaw rate, height) from neighbors. This is the first fully autonomous, tiny, fast, and accurate relative localization scheme implemented on a team of 13 lightweight (33 grams) and resource-constrained (168MHz MCU with 192 KB memory) aerial vehicles. The proposed resource-constrained swarm ranging protocol is scalable, and a surprising theoretical result is discovered: the unobservability poses no issues because the state drift leads to control actions that make the state observable again. By experiment, less than 0.2m position error is achieved at the frequency of 16Hz for as many as 13 drones. The code is open-sourced, and the proposed technique is relevant not only for tiny drones but can be readily applied to many other resource-restricted robots. Video and code can be found at https://shushuai3.github.io/autonomous-swarm/.
|
| |
| 15:00-16:30, Paper TuI2I.416 | Add to My Program |
| CRESCENT: Collision-Free Highly Constrained Trajectory Optimization for Driving on the Moon (I) |
|
| Cauligi, Abhishek | Johns Hopkins University |
| Albee, Keenan | University of Southern California |
| Brockers, Roland | Jet Propulsion Laboratory, California Institute of Technology |
| de la Croix, Jean-Pierre | Jet Propulsion Laboratory, California Institute of Technology |
Keywords: Space Robotics and Automation, Motion and Path Planning, Planning under Uncertainty
Abstract: Rovers have been a mainstay of planetary exploration missions, significantly expanding our knowledge in planetary science. However, past rover missions have involved significant human supervision to oversee rover operations, a state-of-practice that scales poorly for the next generation of missions. In this work, we present the development of Constrained Roving Exploration via Safe Collision-free and Environment-aware Trajectory optimization (CRESCENT), a motion planning algorithm developed for the upcoming multiagent Cooperative Autonomous Distributed Robotic Exploration (CADRE) Lunar rover mission. CRESCENT was designed to safely drive a miniature rover platform in a highly cluttered unmapped Lunar environment, executing complex motion directives from CADRE’s team-level autonomy while meeting far stricter dynamical and temporal constraints than existing onboard planetary rover planning algorithms are capable of satisfying. Our hierarchical approach formulates an efficient numerical trajectory optimization-based motion planning algorithm that makes use of nonlinear optimization to solve the planning problem in real time. We demonstrate the efficiency of our proposed approach through extensive simulations and hardware testing in a representative Lunar environment. Following CADRE’s upcoming deployment on the Lunar surface, CRESCENT will be the first nonlinear optimization-based trajectory optimization approach used on another celestial body.
|
| |
| 15:00-16:30, Paper TuI2I.417 | Add to My Program |
| Cuspidal Redundant Robots: Classification of Infinitely Many IKS of Special Classes of 7R Robots |
|
| Salunkhe, Durgesh Haribhau | EPFL |
| Gupta, Sthithpragya | Ecole Polytechnique Federale De Lausanne |
| Billard, Aude | EPFL |
Keywords: Kinematics, Redundant Robots, Motion and Path Planning
Abstract: Redundant robots, with more degrees of freedom than required for a given task, offer enhanced dexterity but can exhibit complex kinematic behaviour in motion planning. Cuspidal robots, which can change inverse kinematic solutions without crossing singularities, have been reported to pose unique challenges for motion feasibility and repeatability. While cuspidality has been extensively studied for 3R and certain 6R robots, no formal classification exists for redundant architectures. This paper presents a systematic framework for classifying 7R wrist-partitioned redundant robots based on their cuspidal properties. The method reduces the 7R structure to a parameterized 3R equivalent via the redundant joint angle, enabling the application of established theory for cuspidal robots. Using this approach, commercially available robots are analysed and categorized as cuspidal or noncuspidal. Results show that the design offsets in commercial cobots may lead to cuspidality, which can potentially cause a nonsingular change of operation mode in collaborative applications. This classification framework provides a foundation for cuspidality-aware path planning and offers practical guidelines for designing non-cuspidal redundant robots to ensure safer and more predictable operation.
|
| |
| 15:00-16:30, Paper TuI2I.418 | Add to My Program |
| Second Order Sliding Mode Control of Flying Wing Aircraft Based on Feedforward Neural Networks (I) |
|
| Song, Yuecheng | Northwestern Polytechnical University |
| Liu, Zhenbao | Northwestern Polytechnical University |
| Han, Junwei | Northwestern Polytechnical University |
| Yuan, Jinbiao | Northwestern Polytechnical University |
| Zhao, Wen | Northwestern Polytechnical University |
| Dang, Qingqing | Northwestern Polytechnical University |
Keywords: Aerial Systems: Mechanics and Control, Model Learning for Control, Reinforcement Learning
Abstract: The flying-wing aircraft control problem is a major concern. In this paper, a new control strategy is introduced. First, a Feedforward neural network (FNN) modeling is introduced. Then, a second-order sliding mode control is applied, with the parameters generated from Deep Deterministic Policy Gradient (DDPG) reinforcement learning. To study the disturbance rejection performance, wind disturbance is applied to the aircraft using a deep neural network as an disturbance observer for different types of winds. Finally, All three simulations: Simulink, Software In The Loop, and Hardware In the Loop are applied to show the effectiveness of the proposed strategy. The simulation results show that the proposed method demonstrates good robustness in various conditions.
|
| |
| 15:00-16:30, Paper TuI2I.419 | Add to My Program |
| Cooperative Grasping for Collective Object Transport in Constrained Environments |
|
| Alvear Goyes, David Felipe | King Abdullah University of Science and Technology |
| Turkiyyah, George | King Abdullah University of Science and Technology |
| Park, Shinkyu | KAUST |
Keywords: Cooperating Robots, Mobile Manipulation, Integrated Planning and Learning
Abstract: We propose a novel framework for decision-making in cooperative grasping for two-robot object transport in constrained environments. The core of the framework is a Conditional Embedding (CE) model consisting of two neural networks that map grasp configuration information into an embedding space. The resulting embedding vectors are then used to identify feasible grasp configurations that allow two robots to collaboratively transport an object. To ensure generalizability across diverse environments and object geometries, the neural networks are trained on a dataset comprising a range of environment maps and object shapes. We employ a supervised learning approach with negative sampling to ensure that the learned embeddings effectively distinguish between feasible and infeasible grasp configurations. Evaluation results across a wide range of environments and objects in simulations demonstrate the model's ability to reliably identify feasible grasp configurations. We further validate the framework through experiments on a physical robotic platform, confirming its practical applicability.
|
| |
| 15:00-16:30, Paper TuI2I.420 | Add to My Program |
| FORTE: Tactile Force and Slip Sensing on Compliant Fingers for Delicate Manipulation |
|
| Shang, Siqi | University of Texas at Austin |
| Seo, Mingyo | The University of Texas at Austin |
| Zhu, Yuke | The University of Texas at Austin |
| Chin, Lillian | UT Austin |
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation, Soft Robot Applications
Abstract: Handling fragile objects remains a major challenge for robotic manipulation. Tactile sensing and soft robotics can improve delicate object handling, but typically involve high integration complexity or slow response times. We address these issues through FORTE, an easy-to-fabricate tactile sensing system. FORTE uses 3D-printed fin-ray grippers with internal air channels to provide low-latency force and slip feedback. This feedback allows us to apply just enough force to grasp objects without damaging them. We can accurately estimate grasping forces from 0–8 N with an average error of 0.2 N, and detect slip events within 100 ms of occurring. FORTE can grasp a wide range of slippery, fragile, and deformable objects, including raspberries and potato chips with 92% success and achieves 93% accuracy in detecting slip events. These results highlight FORTE’s potential as a robust solution for delicate robotic manipulation. Project page: https://merge-lab.github.io/FORTE/
|
| |
| 15:00-16:30, Paper TuI2I.421 | Add to My Program |
| Multi-Structure Mapping for Filtering Electric Arc Noise in Power Line Environments |
|
| Boulajoul, Najlae | University of Sherbrooke |
| Lussier Desbiens, Alexis | Université De Sherbrooke |
| Ferland, François | Université De Sherbrooke |
Keywords: Aerial Systems: Applications, Aerial Systems: Perception and Autonomy, Mapping
Abstract: Electric arc noise around energized power lines corrupts drone LiDAR measurements, accumulating in occupancy grids and producing spurious obstacles that degrade navigation reliability. Existing filters designed for environmental clutter such as snow, dust, and rain fail to consistently reject these short-lived arc transients and remain difficult to deploy on resource-limited platforms. We propose a dual-structure filtering framework that dynamically separates transient arc noise from persistent environmental features. Instead of filtering scan-by-scan, the proposed filter leverages spatio-temporal neighborhood consistency across consecutive LiDAR frames to suppress short-duration particles. A transient k-d tree accelerates neighborhood queries and removes arc noise around valid structures, while a persistent octree integrates only enduring features into the global map. Experiments show up to 10 times faster filtering and mapping precision of 92.27% with F1-scores up to 95%. Real-world inspection flights over energized power lines confirm that the approach maintains accurate, up-to-date maps and robust performance in the presence of electric arc noise.
|
| |
| 15:00-16:30, Paper TuI2I.422 | Add to My Program |
| D2MFusion: An End-To-End Differentiable Trajectory Optimizer for Safe Reactive Navigation |
|
| Zhou, Xiangyu | Shanghai Jiao Tong University |
| Zhang, Shenghong | Shanghai Jiao Tong University |
| Li, Xiao | Shanghai Jiao Tong University |
Keywords: Integrated Planning and Learning, Autonomous Vehicle Navigation, Motion and Path Planning
Abstract: Data-driven methods provide effective solutions for robot trajectory generation in dynamic environments. Many physical constraints exist in the real world, and understanding these constraints to generate feasible trajectories for kinematics or dynamics is highly demanding regarding the data quantity. Due to the black box, it is also challenging to ensure the safety of the trajectories planned by data-driven models. In this paper, we propose an end-to-end model (D2MFusion) that fuses data-driven components and a model-based optimizer. D2MFusion uses a differentiable optimization layer (dLQR) that forms a backpropagation loop with a perception network. With the input BEV image, the perception network outputs the environmental feature vector to adjust the optimizer parameters to adapt to the dynamic environment. We train this fusion planner to imitate expert trajectories on a real self-driving dataset and demonstrate the planner’s explainability, data efficiency, and safe reactivity through closed-loop simulations. We also conduct experiments on a real quadrupedal robot (Unitree Go2) in three different scenarios to demonstrate the ability of our method to navigate in dynamic environments.
|
| |
| 15:00-16:30, Paper TuI2I.423 | Add to My Program |
| Actuator Dynamics-Aware Model Predictive Control of a Wheeled Inverted Pendulum with a Fan |
|
| Kim, Dohyeon | Jeonbuk National University |
| Jung, Yeongtae | Jeonbuk National University |
Keywords: Actuation and Joint Mechanisms, Optimization and Optimal Control, Motion Control
Abstract: Wheeled Inverted Pendulum (WIP) systems offer agile mobility but are challenging to control due to their unstable and underactuated dynamics. To address these limitations, we develop a Wheeled Inverted Pendulum with a Fan (WIPF), which incorporates a fan-generated bidirectional thrust force as an additional control input. This makes the system fully actuated and enhances stability; however, the limited bandwidth of the fan thrust introduces control challenges. In this letter, we propose a Frequency-Shaped Model Predictive Control (FSMPC) design framework that accounts for actuator dynamics in the optimization process, and is expandable to other systems with different actuator dynamics. The proposed FSMPC can provide improved stability by penalizing high-frequency input using the frequency response of the fan. The nonlinear solver enables control input updates at rates exceeding 1 kHz, meeting real-time control requirements. The performance of FSMPC with the proposed design framework is compared through simulations and experiments against a Linear Quadratic Regulator (LQR), a standard Model Predictive Controller (MPC), and a Frequency-Shaped LQR (FSLQR) that does not consider fan dynamics or the input constraint. The results demonstrate that FSMPC achieves improved stability and robustness compared to other controllers.
|
| |
| 15:00-16:30, Paper TuI2I.424 | Add to My Program |
| Field Calibration of Hyperspectral Cameras for Terrain Inference |
|
| Hanson, Nathaniel | Massachusetts Institute of Technology |
| Pyatski, Benjamin | Northeastern University |
| Hibbard, Sam | Northeastern University |
| Lvov, Gary | Northeastern University |
| De La Garza, Oscar | Northeastern University |
| DiMarzio, Charles A | Northeastern University |
| Dorsey, Kristen | Northeastern University |
| Padir, Taskin | Northeastern University |
Keywords: Field Robots, Robotics and Automation in Agriculture and Forestry, Calibration and Identification
Abstract: Intra-class terrain differences such as water content directly influence a vehicle’s ability to traverse terrain, yet RGB vision systems may fail to distinguish these properties. We argue that expanding the evaluation of a terrain’s spectral content beyond red-green-blue channels to the near infrared spectrum provides useful information for such intra-class identification and route planning. However, accurate analysis of this spectral information is highly dependent on ambient illumination. We demonstrate a system architecture to collect and register multi-wavelength, hyperspectral images from a mobile robot and describe an approach to reflectance calibrate cameras under varying illumination conditions. To showcase the practical applications of our system, HYPER DRIVE, we demonstrate the ability to calculate vegetative health indices and soil moisture content from data collected from an off-road mobile robot with greater consistency than at-imager radiance.
|
| |
| 15:00-16:30, Paper TuI2I.425 | Add to My Program |
| Simultaneous Extrinsic Contact and In-Hand Pose Estimation Via Distributed Tactile Sensing |
|
| Van der Merwe, Mark | University of Michigan |
| Ota, Kei | AI Robot Association |
| Berenson, Dmitry | University of Michigan |
| Fazeli, Nima | University of Michigan |
| Jha, Devesh | Mitsubishi Electric Research Laboratories |
Keywords: Perception for Grasping and Manipulation, Force and Tactile Sensing, In-Hand Manipulation
Abstract: Prehensile autonomous manipulation, such as peg insertion, tool use, or assembly, require precise in-hand understanding of the object pose and the extrinsic contacts made during interactions. Providing accurate estimation of pose and contacts is challenging. Tactile sensors can provide local geometry at the sensor and force information about the grasp, but the locality of sensing means resolving poses and contacts from tactile alone is often an ill-posed problem, as multiple configurations can be consistent with the observations. Adding visual feedback can help resolve ambiguities, but can suffer from noise and occlusions. In this work, we propose a method that pairs local observations from sensing with the physical constraints of contact. We propose a set of factors that ensure local consistency with tactile observations as well as enforcing physical plausibility, namely, that the estimated pose and contacts must respect the kinematic and force constraints of quasi-static rigid body interactions. We formalize our problem as a factor graph, allowing for efficient estimation. In our experiments, we demonstrate that our method outperforms existing geometric and contact-informed estimation pipelines, especially when only tactile information is available.
|
| |
| 15:00-16:30, Paper TuI2I.426 | Add to My Program |
| Gameplay with a Socially Supportive Virtual Robot Enhances Children's Global Self-Esteem, Peer Relationships, Interest and Engagement |
|
| Pasupuleti, Devasena | Osaka University |
| Mahzoon, Hamed | Osaka University |
| Sakai, Kazuki | Osaka University |
| Ishiguro, Hiroshi | Osaka University |
| Bangi, Yaswanth | Amrita Vishwa Vidyapeetham |
| Chittawadigi, Rajeevlochana G. | Amrita Vishwa Vidyapeetham |
| Yoshikawa, Yuichiro | Osaka University |
Keywords: Human-Robot Collaboration, Social HRI, Robot Companions
Abstract: Self-esteem plays a crucial role in children’s psychological well-being and social development. However, traditional interventions cannot provide consistent and engaging support. Recently, game-based learning has shown promise in fostering self-reliance and social confidence. Notably, socially supportive robots, offering consistent, adaptive, and peer-like reinforcement, have emerged as potential tools for enhancing children’s self-esteem. Nevertheless, their effectiveness in improving self-esteem remains unexplored. In this study, we investigated the role of a socially supportive virtual robot in boosting children’s self-esteem, social engagement, and motivation through game-based interactions. Specifically, we examined whether positive reinforcement from the robot influenced children’s global and social self-esteem, the quality and quantity of their friendships, and sustained engagement with the game. Twenty-three children in India participated in a video game with and without the virtual robot across three 30-min sessions over a month. Results indicated that children who interacted with the virtual robot showed significant improvement in global self-esteem, enhanced quantity and quality of friendships, and sustained interest and enjoyment in the task. However, no considerable change was observed in social self-esteem between the experimental and control conditions. These findings provide valuable insights into the potential of robot-mediated interventions for boosting children’s self-esteem and social engagement.
|
| |
| 15:00-16:30, Paper TuI2I.427 | Add to My Program |
| LLM-Handover: Exploiting LLMs for Task-Oriented Robot-Human Handovers |
|
| Tulbure, Andreea Roxana | ETH |
| Zurbruegg, René | ETH Zurich |
| Grigat, Timm | ETH Zurich |
| Hutter, Marco | ETH Zurich |
Keywords: Physical Human-Robot Interaction, Human-Aware Motion Planning, Human-Robot Collaboration
Abstract: Effective human-robot collaboration depends on task-oriented handovers, where robots present objects in ways that support the partner’s intended use. However, many existing approaches neglect the human’s intended action after the handover, relying on assumptions that limit generalizability. To address this gap, we propose LLM-Handover, a novel framework that integrates large language model (LLM)-based reasoning with part segmentation to enable context-aware grasp selection and execution. Given an RGB-D image and a task description, our system infers relevant object parts and selects grasps that optimize post-handover usability. To support evaluation, we introduce a new dataset of 60 household objects spanning 12 categories, each annotated with detailed part labels. We first demonstrate that our approach improves the performance of the used state-of-the-art part segmentation method, in the context of robot-human handovers. Next, we show that LLM-Handover achieves higher grasp success rates and adapts better to post-handover task constraints. During hardware experiments, we achieve a success rate of 83% in a zero-shot setting over conventional and unconventional post-handover tasks. Finally, our comparative user study underlines that our method enables more intuitive, context-aware handovers, with participants preferring it in 86% of cases.
|
| |
| 15:00-16:30, Paper TuI2I.428 | Add to My Program |
| VISTA: Monocular Segmentation-Based Mapping for Appearance and View-Invariant Global Localization |
|
| Shafferman, Hannah | Massachusetts Institute of Technology |
| Thomas, Annika | Massachusetts Institute of Technology |
| Kinnari, Jouko | Not Disclosed |
| Ricard, Michael | Draper Laboratory |
| Nino, Jose | Cornell University |
| How, Jonathan | Massachusetts Institute of Technology |
Keywords: Localization, Mapping, Object Detection, Segmentation and Categorization
Abstract: Global localization is critical for autonomous navigation, particularly in scenarios where an agent must localize within a map generated in a different session or by another agent, as agents often have no prior knowledge about the correlation between reference frames. However, this task remains challenging in unstructured environments due to appearance changes induced by viewpoint variation, seasonal changes, occlusions, and perceptual aliasing in homogeneous environments — known failure modes for traditional place recognition methods. To address these challenges, we propose VISTA (View-Invariant Segmentation-Based Tracking for Frame Alignment), a novel open-set, monocular global localization framework that combines: 1) a front-end, object-based, segmentation and tracking pipeline, followed by 2) a submap correspondence search, which exploits geometric consistencies between environment maps to align vehicle reference frames. VISTA enables consistent localization across diverse camera viewpoints and seasonal changes, without requiring any domain-specific training or finetuning. We evaluate VISTA on seasonal and oblique-angle aerial datasets, achieving up to a 69% improvement in recall over baseline methods. Furthermore, we maintain a compact object-based map that is only 0.6% the size of the most memory-conservative baseline, making our approach capable of real-time implementation on resource-constrained platforms.
|
| |
| 15:00-16:30, Paper TuI2I.429 | Add to My Program |
| E2-BKI: Evidential Ellipsoidal Bayesian Kernel Inference for Uncertainty-Aware Gaussian Semantic Mapping |
|
| Kim, Junyoung | Agency for Defense Development |
| Jeon, Minsik | Carnegie Mellon University |
| Min, Jihong | Agency for Defense Development |
| Kwak, Kiho | Agency for Defense Development |
| Seo, Junwon | Carnegie Mellon University |
Keywords: Mapping, Semantic Scene Understanding, Deep Learning for Visual Perception
Abstract: Semantic mapping aims to construct a 3D semantic representation of the environment, providing essential knowledge for robots operating in complex outdoor settings. While Bayesian Kernel Inference (BKI) addresses discontinuities of map inference from sparse sensor data, existing semantic mapping methods suffer from various sources of uncertainties in challenging outdoor environments. To address these issues, we propose an uncertainty-aware semantic mapping framework that handles multiple sources of uncertainties, which significantly degrade mapping performance. Our method estimates uncertainties in semantic predictions using Evidential Deep Learning and incorporates them into BKI for robust semantic inference. It further aggregates noisy observations into coherent Gaussian representations to mitigate the impact of unreliable points, while employing geometry-aligned kernels that adapt to complex scene structures. These Gaussian primitives effectively fuse local geometric and semantic information, enabling robust, uncertainty-aware mapping in complex outdoor scenarios. Comprehensive evaluation across diverse off-road and urban outdoor environments demonstrates consistent improvements in mapping quality, uncertainty calibration, representational flexibility, and robustness, while maintaining real-time efficiency. Our project website: https://e2-bki.github.io/
|
| |
| 15:00-16:30, Paper TuI2I.430 | Add to My Program |
| Decentralized and Fully Onboard: Range-Aided Cooperative Localization and Navigation on Micro Aerial Vehicles |
|
| Goudar, Abhishek | University of Toronto |
| Schoellig, Angela P. | TU Munich |
Keywords: Multi-Robot Systems, Distributed Robot Systems, Localization
Abstract: Controlling a team of robots in a coordinated manner is challenging, as a centralized approach (where all computation is done on a central machine) has poor scalability and a globally-referenced external localization system may not always be available. In this work, we consider the problem of range-aided decentralized localization and formation control. In such a setting, each robot estimates its relative pose by combining data only from onboard odometry sensors and distance measurements to other robots in the team. Additionally, each robot calculates the control inputs necessary to collaboratively navigate an environment to accomplish a specific task, for example, moving in a desired formation while monitoring an area. We present a block coordinate descent approach to localization that does not require strict coordination between the robots. We present a novel formulation for formation control as inference on factor graphs that takes into account the state estimation uncertainty and can be solved efficiently. Our approach to range-aided localization and formation-based navigation is completely decentralized, does not require specialized trajectories to maintain formation, and achieves decimeter-level positioning and formation control accuracy. We demonstrate our approach through multiple real experiments involving formation flights in diverse indoor and outdoor environments.
|
| |
| 15:00-16:30, Paper TuI2I.431 | Add to My Program |
| Muscle Fatigue-Aware Controller for a Semi-Rigid Knee Exoskeleton (I) |
|
| Zhang, Yifang | Istituto Italiano Di Tecnologia |
| Jiang, Jingcheng | Istituto Italiano Di Tecnologia |
| Ajoudani, Arash | Istituto Italiano Di Tecnologia |
| Tsagarakis, Nikos | Istituto Italiano Di Tecnologia |
Keywords: Physically Assistive Devices, Wearable Robotics, Prosthetics and Exoskeletons
Abstract: Wearable assistive devices that monitor muscle fatigue reduce the risk of work-related musculoskeletal disorders, enhance rehabilitation outcomes, and extend operational time by optimizing the power consumption of the device. This work proposes a muscle fatigue-aware controller (MFAC) for a semi-rigid knee exoskeleton. During an offline calibration phase, we use Gaussian Process Regression (GPR) to model the relationship between muscle activation (measured via EMG) and the corresponding joint moment and angle, enabling fatigue state estimation for the controller. The trained model then approximates muscle activation online using only joint states and moment derived from user’s kinematic data and ground reaction forces provided by the wearable device. The estimated muscle activation is used to assess the muscle fatigue state through a model-based fatigue evaluation module. Notably, EMG measurement is only required during the offline training in our approach, enabling EMG-free online estimation, which significantly enhances the feasibility for long-term mobile applications. Building on muscle fatigue and human-exoskeleton interaction models, we then developed an adaptive controller within a predictive control framework. The resulting optimization problem generates control signals that adjust assistance to reduce the fatigue progression. Two experiments validate the EMG-free fatigue estimation method and the integrated MFAC, demonstrating accurate muscle activation estimation an
|
| |
| 15:00-16:30, Paper TuI2I.432 | Add to My Program |
| Azimuth-LIO: Robust LiDAR-Inertial Odometry Via Azimuth-Aware Voxelization and Probabilistic Fusion |
|
| Zhongguan, Liu | China University of Mining and Technology |
| Li, Wei | China University of Mining and Technology |
| Che, Honglei | Beijing Key Laboratory of Metro Fire and Passenger Transportation Safety, China Academy of Safety Science and Technology |
| Pan, Lu | China University of Mining and Technology |
| Yuan, Shuaidong | China University of Mining and Technology |
|
|
| |
| 15:00-16:30, Paper TuI2I.433 | Add to My Program |
| EMKG: Embodied Memory Knowledge Graphs for Object-Goal Navigation in Dynamic Open Worlds |
|
| Li, Mingyi | Beijing Institute of Technology school |
| Liu, Hui | Beijing Institute of Technology |
| Li, Ying | Beijing Institute of Technology |
| Zhang, Shubo | Beijing University of Posts and Telecommunications |
| Gao, Chunle | Unity Embodied Intelligence Robotics Technology Co., Ltd. |
| Xiaokang, Ma | Beijing Institute of Technology |
| Hu, Hanqing | Beijing Institute of Technology |
| Mao, Weixin | Waseda University |
| |
| 15:00-16:30, Paper TuI2I.434 | Add to My Program |
| Safety Evaluation of Motion Plans Using Trajectory Predictors As Forward Reachable Set Estimators |
|
| Chakraborty, Kaustav | University of Southern California |
| Feng, Zeyuan | Stanford University |
| Veer, Sushant | NVIDIA |
| Sharma, Apoorva | NVIDIA |
| Ding, Wenhao | NVIDIA |
| Topan, Sever | Nvidia Corp |
| Ivanovic, Boris | NVIDIA |
| Pavone, Marco | Stanford University |
| Bansal, Somil | Stanford University |
Keywords: Robot Safety, Autonomous Vehicle Navigation, Planning under Uncertainty
Abstract: The advent of end-to-end autonomy stacks—often lacking interpretable intermediate modules—has placed an increased burden on ensuring that the final output, i.e., the motion plan, is safe in order to validate the safety of the entire stack. This requires a safety monitor that is both complete (able to detect all unsafe plans) and sound (does not flag safe plans). In this work, we propose a principled safety monitor that leverages modern multi-modal trajectory predictors to approximate forward reachable sets (FRS) of surrounding agents. By formulating a convex program, we efficiently extract these data-driven FRSs directly from the predicted state distributions, conditioned on scene context such as lane topology and agent history. To ensure completeness, we leverage conformal prediction to calibrate the FRS and guarantee coverage of ground-truth trajectories with high probability. To preserve soundness in out-of-distribution (OOD) scenarios or under predictor failure, we introduce a Bayesian filter that dynamically adjusts the FRS conservativeness based on the predictor’s observed performance. We then assess the safety of the ego vehicle’s motion plan by checking for intersections with these calibrated FRSs, ensuring the plan remains collision-free under plausible future behaviors of others. Extensive experiments on the nuScenes dataset show our approach significantly improves soundness while maintaining completeness, offering a practical and reliable safety monitor for learned autonomy stacks.
|
| |
| 15:00-16:30, Paper TuI2I.435 | Add to My Program |
| SIT-LMPC: Safe Information-Theoretic Learning Model Predictive Control for Iterative Tasks |
|
| Zang, Zirui | University of Pennsylvania |
| Amine, Ahmad | University of Pennsylvania |
| Kokolakis, Nick-Marios | University of Pennsylvania |
| Nghiem, Truong Xuan | University of Central Florida |
| Rosolia, Ugo | Caltech |
| Mangharam, Rahul | University of Pennsylvania |
Keywords: Optimization and Optimal Control, Machine Learning for Robot Control, Planning under Uncertainty
Abstract: Robots executing iterative tasks in complex, uncertain environments require control strategies that balance robustness, safety, and high performance. This paper introduces a safe information-theoretic learning model predictive control (SIT-LMPC) algorithm for iterative tasks. Specifically, we design an iterative control framework based on an information-theoretic model predictive control algorithm to address a constrained infinite-horizon optimal control problem for discrete-time nonlinear stochastic systems. An adaptive penalty method is developed to ensure safety while balancing optimality. Trajectories from previous iterations are utilized to learn a value function using normalizing flows, which enables richer uncertainty modeling compared to Gaussian priors. SIT-LMPC is designed for highly parallel execution on graphics processing units, allowing efficient real-time optimization. Benchmark simulations and hardware experiments demonstrate that SIT-LMPC iteratively improves system performance while robustly satisfying system constraints.
|
| |
| 15:00-16:30, Paper TuI2I.436 | Add to My Program |
| Augmented Reality for RObots (ARRO): Pointing Visuomotor Policies towards Visual Robustness |
|
| Mirjalili, Reihaneh | University of Technology Nuremberg |
| Jülg, Tobias Thomas | University of Technology Nuremberg |
| Walter, Florian | Technical University Munich |
| Burgard, Wolfram | University of Technology Nuremberg |
Keywords: Transfer Learning, Imitation Learning, Perception for Grasping and Manipulation
Abstract: Visuomotor policies trained on human expert demonstrations have recently shown strong performance across a wide range of robotic manipulation tasks. However, these policies remain highly sensitive to domain shifts stemming from background or robot embodiment changes, which limits their generalization capabilities. In this paper, we present ARRO, a novel visual representation that leverages zero-shot open-vocabulary segmentation and object detection models to efficiently mask out task-irrelevant regions of the scene in real time without requiring additional training, modeling of the setup, or camera calibration. By filtering visual distractors and overlaying virtual guides during both training and inference, ARRO improves robustness to scene variations and reduces the need for additional data collection. We extensively evaluate ARRO with Diffusion Policy on a range of tabletop manipulation tasks in both simulation and real-world environments, and further demonstrate its compatibility and effectiveness with generalist robot policies, such as Octo, OpenVLA and Pi0 . Across all settings in our evaluation, ARRO yields consistent performance gains, allows for selective masking to choose between different objects, and shows robustness even to challenging segmentation conditions. Videos showcasing our results are available at: https://augmented-reality-for-robots.github.io/
|
| |
| 15:00-16:30, Paper TuI2I.437 | Add to My Program |
| Tactile Object Recognition with Recurrent Neural Networks through a Perceptive Soft Gripper |
|
| Donato, Enrico | Scuola Superiore Sant'Anna |
| Pelliccia, David | Eagleprojects S.p.A |
| Hosseinzadeh, Matin | Kermanshah University of Medical Sciences, Kermanshah |
| Amiri, Mahmood | Kermanshah Univeristy of Medical Sciences |
| Falotico, Egidio | Scuola Superiore Sant'Anna |
Keywords: Soft Robot Applications, Deep Learning in Grasping and Manipulation, Force and Tactile Sensing
Abstract: Soft robot perception integrates information from distributed, multi-modal sensors, broadening their application to active interaction. Our work introduces recurrent learning models for tactile-based object recognition, demonstrating comparable performance in virtual and real-world scenarios. The work focuses on soft grippers, which facilitate adaptation to objects of varying shapes and sizes thanks to passive finger compliance. Our model successfully identifies over sixteen heterogeneous objects. Findings underscore the significance of sensory multi-modality over single. We highlight how spatial distribution and sensory signal dynamics influence overall estimation accuracy, and which is the minimal grasp set to achieve certain recognition.
|
| |
| TuI2LB Late Breaking Results Session, Hall C |
Add to My Program |
| Late Breaking Results 2 |
|
| |
| |
| 15:00-16:30, Paper TuI2LB.1 | Add to My Program |
| Integrating Autonomy into Bioinspired Soft Robot – the Bionic Turtle Walker |
|
| Tauber, Falk | Universität Freiburg |
| Ruppert, Sebastian | Universität of Freiburg |
| Speck, Thomas | Freiburg University |
Keywords: Robotics in Under-Resourced Settings, Robotics and Automation in Life Sciences, Search and Rescue Robots
Abstract: Drawing inspiration from living nature for soft robots enables scientist to develop bioinspired systems with more efficient motion and control schemes in comparison to classical robotic system. Because of their inherent compliance due to bodies from flexible materials are soft robots ideal for human machine interaction. With novel electronic free pneumatic logic control systems these robots can be built entirely soft and cope with changing or extreme environments, in which classical electronic robots would fail. Such electronic free control systems allow the control to be integrated directly into the body of the soft robot. Widely still lacking are feedback systems enabling the robot to change its behavior in response to environmental cues. In study we highlight an advanced integrated pneumatic control system that coupled with a soft pneumatic sensor is able to change the walking behavior of our turtle walker. Our novel bioinspired soft robot with an integrated pneumatic logic control system capable of computing sensory inputs marks the first step towards integrating autonomy into electronic free soft robots.
|
| |
| 15:00-16:30, Paper TuI2LB.2 | Add to My Program |
| An Efficient Learning-Based Task Planning Approach Using a Bio-Inspired Action Context-Free Grammar for Bimanual Manipulation |
|
| David, Carmona | National University of Singapore |
| Yang, Jun | National University of Singapore |
| Yu, Haoyong | National University of Singapore |
Keywords: Task and Motion Planning, Learning from Demonstration, Bimanual Manipulation
Abstract: Task and Motion Planning (TAMP) frameworks for bimanual robots are limited by the combinatorial explosion at the task planning level, which can negatively affect human-robot interaction. This work introduces BAG-Learn Planning, an efficient learning-based task planning approach that combines a Bio-Inspired Action Context-Free Grammar (BAG) with a Long-Short-Term Memory (LSTM) network to infer symbolic task plans and achieve bimanual manipulation. The proposed approach replaces costly symbolic search with efficient inference by formulating task planning as sequence prediction over grammar-compliant symbolic representations. Experimental comparisons with the classical Fast Downward task planner across three activities demonstrate significant reductions in task planning time, with millisecond-scale planning achieved for both seen and unseen goals. Additional results show robustness to increasing numbers of objects and symbolic locations, thus mitigating combinatorial explosion. BAG-Learn Planning is integrated with a Rapidly Exploring Random Tree (RRT) motion planner to form a complete TAMP framework. The latter is deployed on a physical bimanual robotic platform to achieve three household activities: pouring, opening, and passing.
|
| |
| 15:00-16:30, Paper TuI2LB.3 | Add to My Program |
| Whole-Body Balance Control of Wheeled-Bipedal Robots for Perception-Less Terrain Adaptation |
|
| Lee, Young Hun | Korea Institute of Machinery & Materials |
| Ahn, Jeongdo | Korea Institute of Machinery and Materials |
| Park, Dongil | Korea Institute of Machinery and Materials (KIMM) |
Keywords: Whole-Body Motion Planning and Control, Wheeled Robots, Optimization and Optimal Control
Abstract: In this paper, we present a whole-body control framework that allows a wheeled-bipedal robot to achieve robust locomotion across diverse environments without relying on terrain perception. The proposed approach consists of a whole-body motion planner and an optimization-based torque computation module. By considering the floating-base dynamics of the robot, the motion planner produces terrain-adaptive behaviors using the zero moment point (ZMP) to preserve balance without prior knowledge of the terrain. In addition, the torque computation module combines a linear quadratic regulator (LQR) with a quadratic programming (QP)-based controller. The LQR computes wheel torques to regulate the body angle while addressing the inherent non-minimum phase characteristics. Using these wheel torques, the QP-based controller allocates optimal joint torques to achieve the desired motion and maintain stable balance. The proposed framework is validated on a wheeled-bipedal robot, demonstrating locomotion over various terrains, including slopes and stairs, as well as robustness against external disturbances.
|
| |
| 15:00-16:30, Paper TuI2LB.4 | Add to My Program |
| GLaMP: A Grounded Language Model-Based Multi-Agent System for Long-Horizon Robotic Task Planning in Industrial Settings |
|
| Chen, Hongpeng | Hong Kong Polytechnic University |
| Navarro-Alarcon, David | The Hong Kong Polytechnic University |
| Zheng, Pai | The Hong Kong Polytechnic University |
Keywords: Intelligent and Flexible Manufacturing, Task Planning, Assembly
Abstract: This paper presents GLaMP, a grounded language model-based multi-agent framework for long-horizon robotic task planning in industrial environments. A key challenge in such tasks lies in the gap between high-level language reasoning and low-level perceptual grounding, which often leads to error accumulation during execution. GLaMP introduces a closed-loop architecture where a vision-language model extracts hierarchical task structures from manuals, a perception module grounds multimodal observations into symbolic predicates, and a large language model generates executable behavior trees. Through bidirectional feedback between perception and planning, the system continuously verifies and updates symbolic states, improving robustness in long-horizon execution. Preliminary experiments on representative industrial tasks demonstrate improved task success rates and reliability compared to existing approaches, highlighting the effectiveness of closed-loop grounding for robotic task planning.
|
| |
| 15:00-16:30, Paper TuI2LB.5 | Add to My Program |
| Side-Scan Sonar SLAM Using Ping-Level Landmark Detection in Feature-Poor Seabed Environments |
|
| Im, Jinho | Keimyung University |
| Hong, Seonghun | Keimyung University |
Keywords: Marine Robotics, SLAM, Localization
Abstract: Side-scan sonar (SSS) is a particularly attractive sensing modality for underwater simultaneous localization and mapping (SLAM), offering wide-area seabed coverage and reliable acoustic measurements over long ranges. Many existing SSS-based SLAM approaches rely on image-domain processing, which depends on sufficiently rich image features and can struggle in feature-poor or homogeneous seabed environments. Furthermore, even when image-domain features are present, range-dependent intensity variations and speckle noise inherent in SSS measurements can degrade the reliability of feature extraction and data association. This study proposes a ping-level SSS SLAM framework that directly exploits raw backscatter intensity profiles without relying on image formation. By characterizing the nominal seafloor response and identifying structurally salient deviations in the acoustic intensity profiles, reliable landmark measurements are extracted at the ping level and incorporated into a landmark-based SLAM framework. This formulation preserves the native sensing geometry of SSS measurements and enables robust landmark extraction even in feature-sparse environments. The proposed approach is validated through real-world field experiments, demonstrating improved robustness and localization accuracy in challenging seabed conditions.
|
| |
| 15:00-16:30, Paper TuI2LB.7 | Add to My Program |
| Stereo-Based Vision and Tactile Sensing for Robust Dual-Arm Robotic Connector Assembly |
|
| Mirto, Michele | Università Degli Studi Della Campania "Luigi Vanvitelli" |
| Caporali, Alessio | University of Bologna |
| Pirozzi, Salvatore | Univesità Degli Studi Della Campania "Luigi Vanvitelli" |
| Palli, Gianluca | University of Bologna |
Keywords: Intelligent and Flexible Manufacturing, Assembly, Dual Arm Manipulation
Abstract: The automation of Deformable Linear Object (DLO) manipulation remains a key challenge in industrial production. While prior works demonstrated reliable wire terminal insertion using vision and tactile sensing, they typically assume a fixed connector pose. This paper presents a dual-arm robotic system for fully autonomous connector assembly. A stereo vision system enables robust 6D pose estimation of the wire terminal, while a custom mechatronic gripper with integrated tactile sensing supports accurate insertion monitoring. In parallel, the second arm performs connector grasping. By combining complementary visual and tactile feedback across both manipulators, the system achieves the precision required for tight-tolerance insertion without fixed fixtures.
|
| |
| 15:00-16:30, Paper TuI2LB.8 | Add to My Program |
| Embodied Stability in a Minimally-Actuated Soft Robot for Autonomous Exploration |
|
| Salem, Lior | Technion |
| Vichik, Adam | Technion |
| Gat, Amir | Technion - Israel Institute of Technology |
| Or, Yizhar | Technion |
Keywords: Search and Rescue Robots, Modeling, Control, and Learning for Soft Robots, Autonomous Agents
Abstract: Soft robots offer an opportunity to embed intelligence directly into morphology, potentially reducing the need for continuous feedback regulation. We present an autonomous, minimally actuated multi-stable soft robot for exploration in confined and cluttered environments. The robot is composed of a serial chain of multi-stable elastic elements whose energy landscape encodes discrete, passively stable configurations, enabling reversible shape transformation and shape retention without sustained actuation. A single mobile pneumatic actuator triggers transitions between these stable states, producing complex three-dimensional configurations with minimal hardware complexity. Autonomy is achieved through the integration of nonlinear hybrid modeling, visual pose estimation, and sampling-based motion planning within a ROS2 framework. Rather than regulating continuous deformation, computation in our system selects and sequences mechanically admissible state transitions, while structural multi-stability provides inherent stabilization and memory. Experimental results demonstrate closed-loop navigation in cluttered environments using this distributed balance between mechanics and control.
|
| |
| 15:00-16:30, Paper TuI2LB.9 | Add to My Program |
| A Large Language Model-Based Mission Manager for Autonomous UAV Control |
|
| Cihlar, Milos | Brno University of Technology |
Keywords: AI-Based Methods, Task and Motion Planning, Aerial Systems: Applications
Abstract: Autonomous unmanned aerial vehicles (UAVs) are traditionally controlled using behavior trees or state machines, which provide deterministic execution but limited adaptability in dynamic environments. Extending these conventional systems to handle new tasks requires manual specification of additional nodes or transitions, creating a scalability challenge as mission complexity increases. This work introduces a high-level mission manager leveraging local Large Language Models (LLMs) for autonomous UAV control. The system allows operators to issue high-level commands in natural language, which the LLM interprets and decomposes into sequences of ROS 2 actions, such as takeoff, navigation, object localization, and landing, without mission-specific programming. The LLM does not directly control the UAV but selects from a constrained set of tools mapped to ROS 2 actions or services. Real-time robot state is injected into the model context, ensuring that decisions are based on actual system status and environment perception.
|
| |
| 15:00-16:30, Paper TuI2LB.10 | Add to My Program |
| Hierarchical Grid-Based Sensor Pose Extraction for Demonstration Dataset Generation |
|
| Lim, Doyu | Pohang University of Science and Technology (POSTECH) |
| Park, Chaewon | POSTECH (Pohang University of Science and Technology) |
| Han, Soohee | Pohang University of Science and Technology ( POSTECH ) |
Keywords: Data Sets for Robot Learning, Reactive and Sensor-Based Planning, Learning from Demonstration
Abstract: High-quality 3D reconstruction of unknown small objects with complex surface details is important in applications such as digital preservation and cultural heritage archiving. In practice, such scanning procedures rely heavily on skilled human experts, but the high cost of expert training and the large number of objects requiring digitization make this process difficult to scale. This motivates the need to construct expert demonstration datasets as a foundation for future automated view planning. However, available scan data often contain only frame-level geometry without per-frame sensor poses. To address this issue, we propose a hierarchical grid-based method for extracting sensor poses from frame-based scan data. The proposed method progressively refines candidate poses through coarse-to-fine grid search and selects poses that effectively observe the target surface. Experimental results show an average coverage of 0.85, demonstrating the practicality of the proposed approach for expert demonstration dataset construction.
|
| |
| 15:00-16:30, Paper TuI2LB.11 | Add to My Program |
| Manipulator-Effort-Aware MPC for Body Motion Coordination of an Underwater Walking Robot under Ocean Current Disturbances |
|
| Koo, Bonhak | University of Science and Technology |
| Jun, Bong Huan | KRISO(Korea Research Institute of Ships and Ocean Engineereing |
| Park, Daegil | Korea Research Institute of Ships & Ocean Engineering (KRISO) |
Keywords: Marine Robotics, Legged Robots
Abstract: This paper proposes a manipulator-effort-aware model predictive control (MPC) framework for coordinating body motion of a multi-legged underwater walking robot during moored mine clearance under ocean-current disturbances. The method treats the norm of manipulator joint torques as an effort-related input and uses it in an upper-layer MPC to adapt the robot’s body approach motion and posture, while lower-level whole-body control and impedance control handle body stabilization and rope grasping. Simulation studies in a ROS1 Noetic–Gazebo environment with a UUV-Simulator-based model show that, under increasing unidirectional current, conventional decoupled controllers cause manipulator torques to grow and approach saturation, whereas the proposed framework keeps torques within safe limits by generating adaptive body-motion compensation. These results indicate improved mechanical stability and reduced manipulator burden during rope-grasping interactions, though validation is currently limited to a simplified unidirectional flow without full locomotion and is sensitive to model accuracy, motivating future experiments with more complex currents, dynamic walking scenarios, and refined hydrodynamic modeling.
|
| |
| 15:00-16:30, Paper TuI2LB.12 | Add to My Program |
| Optimal Path Planning for USV-AUV Docking under Various Marine Environmental Conditions |
|
| Park, Daegil | Korea Research Institute of Ships & Ocean Engineering (KRISO) |
| Lee, Seungyeon | Korea Research Institute of Ships and Ocean Engineering |
| Park, Junwoo | University of Science and Technology (Korea Research Institute of Ships and Ocean Engineering) |
| Kim, Hyungwoo | Korea Research Institute of Ships & Ocean Engineering |
| Jun, Bong Huan | KRISO(Korea Research Institute of Ships and Ocean Engineereing |
Keywords: Marine Robotics, Planning, Scheduling and Coordination
Abstract: The escalating demand for precision in maritime missions has led to the development of collaborative heterogeneous multi-robot systems, specifically pairing Autonomous Surface Vehicles (USVs) with Autonomous Underwater Vehicles (AUVs). Autonomous docking is essential for mission persistence, allowing AUVs to use USVs for recharging and data offloading, yet achieving reliable docking is difficult because these underactuated platforms are highly susceptible to wind and current disturbances. This paper introduces a specialized simulation framework utilizing a MATLAB-based Graphical User Interface (GUI) and 6-DOF equations of motion to evaluate docking success rates in real-time by analyzing measured environmental vectors. Through a scoring framework incorporating the Continuous Ranked Probability Score (CRPS), the system identifies optimal docking headings where environmental forces are minimized or exhibit a force-offsetting effect. To ensure kinematic feasibility, the trajectory planning logic integrates minimum turning radii of USV and AUV, while temporal synchronization is maintained via Estimated Time of Arrival (ETA) calculations at each waypoint. The proposed algorithm was implemented in C++ within the ROS2 framework and validated through stationary and collaborative docking scenarios under stochastic loads. Experimental results confirm that aligning the docking axis with optimized directions allows for stable docking performance.
|
| |
| 15:00-16:30, Paper TuI2LB.13 | Add to My Program |
| Experimental Validations of a Digital Twin Model for Underwater Tracked Vehicle |
|
| Han, Jong-Boo | Korea Research Institute of Ships and Ocean Engineering |
| Lee, Sangjin | University of Science and Technology |
| Lee, Yeongjun | Korea Research Institute of Ships and Ocean Engineering |
| Park, Daegil | Korea Research Institute of Ships & Ocean Engineering (KRISO) |
| Yeu, Tae-Kyeong | KRISO (Korea Research Institute of Ships & Ocean Engineering) |
Keywords: Marine Robotics
Abstract: This paper presents the experimental validation of a digital twin model for the Cyber Physics Opertation System(CPOS) ROVER, a tracked underwater robotic platform designed for seabed operations. Based on a high-fidelity multibody dynamics model, the digital twin incorporates trackground interactions and external underwater forces to simulate locomotion under ground-contact conditions. To evaluate the accuracy of the model, a series of water-tank experiments were conducted. The robot performed a round-trip trajectory while its motion was recorded using an external vision-based tracking system. The simulation results were then quantitatively compared with the experimental measurements to assess position and trajectory tracking performance. The comparison demonstrates a high degree of agreement between the digital twin and the experimental data. These findings validate the effectiveness of the proposed digital twin framework for representing tracked underwater locomotion, highlighting its potential for controller development and performance evaluation prior to field deployment.
|
| |
| 15:00-16:30, Paper TuI2LB.14 | Add to My Program |
| Development of the Saemangeum Digital Test-Bed (K-URSim) for Unmanned Underwater Vehicles |
|
| Lee, Yeongjun | Korea Research Institute of Ships and Ocean Engineering |
| Han, Jong-Boo | Korea Research Institute of Ships and Ocean Engineering |
| Kim, Sangsu | University of Science and Technology(UST), Korea Research Institute of Ships & Ocean Engineering(KRISO) |
| Kang, Hojun | KRISO |
| Kim, Kihun | KRISO |
Keywords: Marine Robotics, Computer Architecture for Robotic and Automation
Abstract: This paper presents K-URSim (KRISO Underwater Robot Simulator), a ROS2-based modular simulation platform that serves as the backbone of the Saemangeum Digital Marine Testbed for unmanned marine systems (UMS). Unlike conventional simulators, K-URSim is tightly coupled with the real inshore test site, integrating in-situ ocean data such as currents, waves, and bathymetry into the simulation loop for realistic environment reproduction and data-driven validation. The platform adopts a modular architecture (KRISO Extensions) supporting vehicle modeling, physics-based dynamics, sensing, planning, control, and external interfaces within a unified ROS2 framework. By bridging simulation and real-world experiments, K-URSim enables pre-validation of control algorithms and mission scenarios prior to deployment, reducing cost and risk. It also supports reinforcement learning-based autonomy and synthetic data generation for sim-to-real transfer. The system can also integrate with NVIDIA Omniverse for digital--physical hybrid testing and sim-to-real validation.
|
| |
| 15:00-16:30, Paper TuI2LB.15 | Add to My Program |
| Geometric Correction of Underwater Forward Looking Sonar-Based 3-D Reconstruction Via PS-Based Slope Pattern Interpretation Using AUV |
|
| Ham, Seungwon | Pohang University of Science and Technology |
| Ku, Bonchul | Pohang University of Science and Technology |
| Song, Young-woon | Pohang University of Science and Technology (POSTECH) |
| Kim, Jason | HEROLab (in Univ. POSTECH) |
| Jang, You hyun | Khnp |
| Seol, Woojin | Korea Hydro & Nuclear Power Co., Ltd |
| Yu, Son-Cheol | Pohang University of Science and Technology (POSTECH) |
Keywords: Marine Robotics, Mapping, Range Sensing
Abstract: Forward-looking sonar (FLS) enables long-range underwater sensing. In FLS-based 3-D reconstruction, falsely inclined surfaces arise from elevation ambiguity caused by the finite vertical beamwidth. Existing approaches mitigate these errors using multi-pass strategies, but they require repeated observations, which are often impractical in real-world underwater operations. To address this, we propose a pattern-informed geometric refinement framework that leverages structural patterns from profiling sonar (PS) to resolve ambiguity in FLS-based reconstruction. Within this framework, geometric patterns within ambiguity-dominated intervals are analyzed to distinguish between physically valid surfaces and falsely inclined surfaces, and selective geometric refinement is applied accordingly. Experimental results demonstrate effective suppression of falsely inclined surfaces and improved reconstruction accuracy without trajectory modifications. This provides a practical solution for reliable 3-D mapping and perception in underwater robotic applications.
|
| |
| 15:00-16:30, Paper TuI2LB.16 | Add to My Program |
| Toward Multimodal Liquid-Level Estimation for Closed-Loop Robotic Pouring |
|
| Deng, Hongyu | The Chinese University of Hong Kong |
| Chen, He | The Chinese University of Hong Kong |
Keywords: Perception for Grasping and Manipulation
Abstract: We consider the problem of real-time liquid-level estimation for closed-loop robotic pouring. To this end, we propose a fast-slow architecture where a Vision-Language Model handles high-level task reasoning and a sensor-driven fast system provides low-latency feedback. As a first instantiation of the fast system, we present RadarEye, a mmWave radar signal processing pipeline that tracks liquid level during pouring. RadarEye combines (i) AoA–ToF beamforming for liquid surface localization with (ii) a physics-informed tracker that suppresses multipath interference. In real-robot experiments, RadarEye achieves 0.35 cm median error at 0.62 ms per-update latency, outperforming vision and ultrasound baselines.
|
| |
| 15:00-16:30, Paper TuI2LB.17 | Add to My Program |
| Active Perception for Deformable Linear Objects Stiffness Estimation |
|
| Dong, Chengxiao | University of Bologna |
| Caporali, Alessio | University of Bologna |
| Lan, Hongyu | University of Bologna |
| Palli, Gianluca | University of Bologna |
Keywords: Perception for Grasping and Manipulation, Reinforcement Learning, Grasping
Abstract: Estimating the stiffness of Deformable Linear Objects (DLOs) is crucial for robust manipulation. Inferring this hidden property depends heavily on the physical interaction strategy. Through a 1D CNN-based analysis of predefined probing modes, we first demonstrate that boundary constraints and grasp locations drastically alter stiffness identifiability. While fixed-end setups yield highly informative responses, they are rarely practical in unconstrained tasks. Consequently, we move beyond manual heuristics and reframe DLO parameter identification as an active perception problem. We propose a Reinforcement Learning (RL) framework that autonomously learns informative interaction strategies for free cables. By coupling a Proximal Policy Optimization (PPO) agent with a trajectory-aware estimator, the system dynamically excites the DLO to extract stiffness from diverse, stochastic manipulation sequences. Achieving a Mean Absolute Error (MAE) of 0.0192, our approach provides a robust, active paradigm that overcomes the limitations of static probing in unconstrained environments.
|
| |
| 15:00-16:30, Paper TuI2LB.18 | Add to My Program |
| Perception of Social Robots As Communication Partners in Healthcare for Older Adults |
|
| Yamamoto, Hana | Karlsruhe Institute of Technology |
| Mayer, Carlotta Julia | Heidelberg University |
| Raithel, Charlotte | Heidelberg University |
| Buchner, Theresa | Heidelberg University |
| Werner, Christian | Heidelberg University |
| Hirata, Yasuhisa | Tohoku University |
| Eckstein, Monika | Heidelberg University |
| Mombaur, Katja | Karlsruhe Institute of Technology |
Keywords: Social HRI, Robot Companions, Acceptability and Trust
Abstract: Addressing the global caregiver shortage through socially assistive robots necessitates a deep understanding of their psychological and physiological impacts on older adults. This study addresses whether social robots can serve as effective interaction partners compared to humans, and if "positive prompts" can similarly enhance these interactions. We conducted a comparative study with 35 participants (aged 70+) to evaluate responses during both human-human and human-robot encounters, including an assessment of "positive prompts" for cognitive reappraisal. Our multi-modal analysis, integrating facial expression data, heart rate variability, and subjective questionnaires, revealed no significant differences in overall stress levels between human and robot interactions. Facial expression analysis confirmed that the robot was accepted as a valid interaction partner, while physiological data showed slightly lower heart rates during robot interactions, suggesting a more relaxed state compared to human-led sessions. These findings indicate that social robots can engage older adults without inducing psychological strain and are capable of alleviating caregiver burden by performing structured tasks, such as health-sensing surveys. Future work should address the identified "appearance-content mismatch" in robot design to facilitate even more natural and effective interactions.
|
| |
| 15:00-16:30, Paper TuI2LB.19 | Add to My Program |
| Task Correlation and Edge-Cost Estimation for Robotic Task Sequencing Problem in Torus C-Space |
|
| Yonrith, Phayuth | Chonnam National University |
| Hong, Ayoung | Chonnam National University |
Keywords: Task and Motion Planning
Abstract: Robotic manipulators are increasingly used in diverse applications, ranging from industrial automation to human-centered tasks such as grocery picking and packaging, where they are often required to perform sequences of tasks while maintaining motion optimality and collision-free operation over long horizons. This type of problem is known as the Robotic Task Sequencing Problem. Most existing works address this problem by reducing it to the Traveling Salesman Problem (TSP) and within a purely Euclidean framework, neglecting the robot’s inherently non-Euclidean toroidal mathcal{C}-space topology. This simplification limits the selection of feasible configurations and may lead to failure, suboptimal, or detoured motions. In this paper, we propose a robotic task sequencing problem solver that incorporates the robot’s natural mathcal{T}^n topology and joint limits.
|
| |
| 15:00-16:30, Paper TuI2LB.20 | Add to My Program |
| Disturbance-Adaptive Differentiable MPC for Underwater Structure Inspection Using Underwater Robot |
|
| Kim, Minjong | Pohang University of Science and Technology (POSTECH) |
| Ku, Bonchul | Pohang University of Science and Technology |
| Kim, Dongsub | POSTECH(Pohang University of Science and Technology) |
| Song, Young-woon | Pohang University of Science and Technology (POSTECH) |
| Seol, Woojin | Korea Hydro & Nuclear Power Co., Ltd |
| Jang, You hyun | Khnp |
| Yu, Son-Cheol | Pohang University of Science and Technology (POSTECH) |
Keywords: Marine Robotics, Robotics in Hazardous Fields, Task and Motion Planning
Abstract: Close-proximity inspection of underwater cylindrical structures is challenging due to nonlinear vehicle dynamics, flow disturbances, and payload uncertainty. Fixed-weight MPC provides structured constraint handling but lacks adaptivity, while model-free RL is adaptive but often unstable and unsafe under disturbances. We propose Marine AC-MPC, which combines a differentiable iLQR-based MPC layer with an actor-critic framework that learns time-varying MPC cost weights online. In MarineGym, the proposed method achieves more reliable orbit tracking and higher success rates than fixed-weight MPC and PPO baselines under disturbed conditions.
|
| |
| 15:00-16:30, Paper TuI2LB.21 | Add to My Program |
| Why Cognitive Robotics Matters: Lessons from OntoAgent and LLM Deployment in HARMONIC for Safety-Critical Robot Teaming |
|
| Oruganti, Sanjay | Rensselaer Polytechnic Institute |
| Nirenburg, Sergei | Rensselaer Polytechnic Institute |
| McShane, Marjorie | Rensselaer Polytechnic Institute |
| English, Jesse | Rensselaer Polytechnic Institute |
| Roberts, Michael | Rensselaer Polytechnic Institute |
| Arndt, Christian | Rensselaer Polytechnic Institute |
| Parasuraman, Ramviyas | University of Georgia |
| Sentis, Luis | The University of Texas at Austin |
Keywords: Cognitive Control Architectures, Embodied Cognitive Science, Safety in HRI
Abstract: Robots operating alongside humans must recognize what they do not know before acting, diagnose problems from domain knowledge, and reason about action consequences. These capabilities are operational requirements, not optimization targets, and their absence produces silent and unrecoverable failures. We present a first-of-its-kind controlled comparison between OntoAgent, our content-centric cognitive architecture, and six LLMs spanning frontier and efficient tiers as drop-in replacements at the strategic layer of the same robotic system in HARMONIC. LLMs fail to verify their knowledge state before acting, even when given equivalent procedural knowledge. The deficit is architectural, not knowledge-based. Knowledge-grounded architectures must retain decision authority; LLMs contribute where their strengths apply.
|
| |
| 15:00-16:30, Paper TuI2LB.22 | Add to My Program |
| Design by Robot: Relocating Obstacles for Efficient Mobile Robot Navigation |
|
| Samarakoon Mudiyanselage, Bhagya Prasangi Samarakoon | Singapore University of Technology and Design |
| Muthugala Arachchige, Viraj Jagathpriya Muthugala | Singapore University of Technology and Design |
| Rishan Sachinthana, Wijenayaka Kankanamge | Singapore University of Technology and Design |
| Elara, Mohan Rajesh | Singapore University of Technology and Design |
| |
| 15:00-16:30, Paper TuI2LB.23 | Add to My Program |
| GPT-PDDL: Towards Executable Robot Task Planning |
|
| Lee, Chang Sik | Korea Institute of Industrial Technology |
| Cho, Hye-Kyung | Hansung University |
| You, Sujeong | Korea Institute of Industrial Technology |
Keywords: AI-Based Methods, Assembly, Visual Learning
Abstract: Given the recent significant advancements in the video understanding capabilities of Large Language Models (LLMs), there is growing interest in research that automatically generates executable robot task plans from human demonstration videos. Existing LLM-based symbolic planning approaches often rely on manually defined Problem Domain Definition Language (PDDL) domains or fixed action primitives. This paper proposes GPT-PDDL, a framework that infers step-by-step task procedures from demonstration videos and converts them into robot plans based on PDDL.
|
| |
| 15:00-16:30, Paper TuI2LB.24 | Add to My Program |
| Learning from Demonstrations Over Riemannian Manifolds Using Neural ODEs |
|
| Cuervo Espinosa, Diana Gerlid | Technical University of Munich |
| Anand, Mahathi | Technical University of Munich |
| Schoellig, Angela P. | TU Munich |
Keywords: Learning from Demonstration, Imitation Learning, Machine Learning for Robot Control
Abstract: Learning from demonstratins (LfD) is usually performed over Euclidean spaces, while the robot state, e.g. orientation, naturally evolves over curved spaces. Therefore, to ensure natural, complex motion generation, we investigate learning from demonstrations over Riemannian manifolds that are capable of encoding both position and orientation data. Here, geodesic paths provide for natural motion between two arbitrary points within the manifold. We propose to numerically estimate geodesics via neural ordinary differential equations, mitigating large computational overhead of existing approaches. Finally, these geodesics can be decoded back into the original task space before deploying on the robot. In this extended abstract, we discuss the architecture of our framework, provide some initial insights from our simulation experiments, including comparison to other geodesic computation mechanisms, and discuss the challenges and prospects for future work.
|
| |
| TuBT1 Award Session, Hall A2 |
Add to My Program |
| Award Finalists 2 |
|
| |
| Co-Chair: Harada, Kensuke | Osaka University |
| |
| 16:45-16:55, Paper TuBT1.1 | Add to My Program |
| HEXAR: A Hierarchical Explainability Architecture for Robots |
|
| Love, Tamlin | Institut De Robòtica I Informàtica Industrial CSIC-UPC |
| Gebellí, Ferran | PAL Robotics |
| Pramanick, Pradip | University of Naples Federico II |
| Andriella, Antonio | Institut De Robòtica I Informàtica Industrial CSIC-UPC |
| Alenyà, Guillem | Institut De Robòtica I Informàtica Industrial CSIC-UPC |
| Garrell, Anaís | Institut De Robòtica I Informàtica Industrial CSIC-UPC |
| Ros, Raquel | Artificial Intelligence Research Institute IIIA-CSIC |
| Rossi, Silvia | Universita' Di Napoli Federico II |
Keywords: Social HRI, Natural Dialog for HRI, Acceptability and Trust
Abstract: As robotic systems become increasingly complex, the need for explainable decision-making becomes critical. Existing explainability approaches in robotics typically either focus on individual modules, which can be difficult to query from the perspective of high-level behaviour, or employ monolithic approaches, which do not exploit the modularity of robotic architectures. We present HEXAR (Hierarchical EXplainability Architecture for Robots), a novel framework that provides a plug-in, hierarchical approach to generate explanations about robotic systems. HEXAR consists of specialised component explainers using diverse explanation techniques (e.g., LLM-based reasoning, causal models, feature importance, etc) tailored to specific robot modules, orchestrated by an explainer selector that chooses the most appropriate one for a given query. We implement and evaluate HEXAR on a TIAGo robot performing assistive tasks in a home environment, comparing it against end-to-end and aggregated baseline approaches across 180 scenario-query variations. We observe that HEXAR significantly outperforms baselines in root cause identification, incorrect information exclusion, and runtime, offering a promising direction for transparent autonomous systems.
|
| |
| 16:55-17:05, Paper TuBT1.2 | Add to My Program |
| Uncertainty Comes for Free: Human-In-The-Loop Policies with Diffusion Models |
|
| He, Zhanpeng | Columbia University |
| Cao, Yifeng | Columbia University |
| Ciocarlie, Matei | Columbia University |
Keywords: Human Factors and Human-in-the-Loop, Imitation Learning, Deep Learning in Grasping and Manipulation
Abstract: Human-in-the-loop robot deployment has gained significant attention in both academia and industry as a semi-autonomous paradigm that enables human operators to intervene and adjust robot behaviors at deployment time, improving success rates. However, continuous human monitoring and intervention can be highly labor-intensive and impractical when deploying a large number of robots. To address this limitation, we propose a method that allows diffusion policies to actively seek human assistance only when necessary, reducing reliance on constant human oversight. To achieve this, we leverage the generative process of diffusion policies to compute an uncertainty-based metric based on which the autonomous agent can decide to request operator assistance at deployment time, without requiring any operator interaction during training. Additionally, we show that the same method can be used for efficient data collection for fine-tuning diffusion policies in order to improve their autonomous performance. Experimental results from simulated and real-world environments demonstrate that our approach enhances policy performance during deployment for a variety of scenarios.
|
| |
| 17:05-17:15, Paper TuBT1.3 | Add to My Program |
| SA-VLM V2: Useful, Comprehensive, and Concise Guidance for Guide-Dog Robots Assisting the Visually Impaired |
|
| Yun, Woo-han | Electronics and Telecommunications Research Institute (ETRI) |
| Shin, JaeHo | University of Science and Technology, Deajeon, Republic of Korea |
| Seo, BeomSu | ETRI |
| Kim, Jaehong | ETRI |
| Han, ByungOk | ETRI |
Keywords: Multi-Modal Perception for HRI, Robot Companions, Social HRI
Abstract: The development of guide dog robots is expected to enhance the mobility and safety of visually impaired individuals outdoors. To assist these users in real-world navigation, walking guidance should be useful, comprehensive, and concise so that instructions are both actionable and easy to follow. While recent VLMs show promising capabilities in scene understanding, existing approaches do not address the effective delivery of guidance for visually impaired users. In this work, we propose SA-VLMv2 (Space-Aware VLM), a model designed to generate useful, comprehensive, and concise walking guidance based on ego-centric scenes and target destinations. To this end, we first derived four canonical templates for walking guidance through user evaluation with professional guide dog trainers across diverse images, providing insights into preferred guidance formats. We then collected, manually annotated, curated a dataset of 19,945 samples aligned with these templates and trained SA-VLMv2 from the open-sourced VLM, Qwen2.5VL. Experimental results show that SA-VLMv2 outperforms state-of-the-art proprietary MLLMs (Claude 3.5 Sonnet, Gemini 2.5, GPT-4o) and the open-sourced pretrained VLM (Qwen2.5VL) in both holistic and factor-wise evaluations. SA-VLMv2 generated more concise yet informative guidance while achieving higher scores across multiple evaluation factors.
|
| |
| 17:15-17:25, Paper TuBT1.4 | Add to My Program |
| IMR-LLM: Industrial Multi-Robot Task Planning and Program Generation Using Large Language Models |
|
| Su, Xiangyu | Shenzhen University |
| Xu, Juzhan | Shenzhen University |
| van Kaick, Oliver | Carleton University |
| Xu, Kai | Institute of AI for Industries, Chinese Academy of Sciences |
| Hu, Ruizhen | Shenzhen University |
Keywords: Industrial Robots, Task and Motion Planning, Multi-Robot Systems
Abstract: In modern industrial production, multiple robots often collaborate to complete complex manufacturing tasks. Large language models (LLMs), with their strong reasoning capabilities, have shown potential in coordinating robots for simple household and manipulation tasks. However, in industrial scenarios, stricter sequential constraints and more complex dependencies within tasks present new challenges for LLMs. To address this, we propose IMR-LLM, a novel LLM-driven Industrial Multi-Robot task planning and program generation framework. Specifically, we utilize LLMs to assist in constructing disjunctive graphs and employ deterministic solving methods to obtain a feasible and efficient high-level task plan. Based on this, we use a process tree to guide LLMs to generate executable low-level programs. Additionally, we create IMR-Bench, a challenging benchmark that encompasses multi-robot industrial tasks across three levels of complexity. Experimental results indicate that our method significantly surpasses existing methods across all evaluation metrics.
|
| |
| 17:25-17:35, Paper TuBT1.5 | Add to My Program |
| ETac: A Lightweight and Efficient Tactile Simulation Framework for Learning Dexterous Manipulation |
|
| Xu, Zhe | Shanghaitech University |
| Zhao, Feiyu | ShanghaiTech University |
| Huang, Xiyan | ShanghaiTech University |
| Xiao, Chenxi | ShanghaiTech University |
Keywords: Force and Tactile Sensing, Contact Modeling, Simulation and Animation
Abstract: Tactile sensors are increasingly integrated into dexterous robotic manipulators to enhance contact perception. However, learning manipulation policies that rely on tactile sensing remains challenging, primarily due to the trade-off between fidelity and computational cost of soft-body simulations. To address this, we present ETac, a tactile simulation framework that models elastomeric soft-body interactions with both high fidelity and efficiency. ETac employs a lightweight data-driven deformation propagation model to capture soft-body contact dynamics, achieving high simulation quality and boosting efficiency that enables large-scale policy training. When serving as the simulation backend, ETac produces surface deformation estimates comparable to FEM and demonstrates applicability for modeling real tactile sensors. Then, we showcase its capability in training a blind grasping policy that leverages large-area tactile feedback to manipulate diverse objects. Running on a single RTX 4090 GPU, ETac supports reinforcement learning across 4,096 parallel environments, achieving a total throughput of 869 FPS. The resulting policy reaches an average success rate of 84.45% across four object types, underscoring ETac's potential to make tactile-based skill learning both efficient and scalable.
|
| |
| 17:35-17:45, Paper TuBT1.6 | Add to My Program |
| Ro-To-Go! Robust Reactive Control with Signal Temporal Logic |
|
| Ilyes, Roland | University of Oxford |
| Brudermüller, Lara | University of Oxford |
| Hawes, Nick | University of Oxford |
| Lacerda, Bruno | University of Oxford |
Keywords: Formal Methods in Robotics and Automation, Hybrid Logical/Dynamical Planning and Verification, Optimization and Optimal Control
Abstract: Signal Temporal Logic robustness is a common objective for optimal robot control, but its dependence on history limits the robot's decision-making capabilities when used in model predictive control approaches. In this work, we introduce Signal Temporal Logic robustness-to-go, a new quantitative semantics for the logic that isolates the contributions of suffix trajectories. We prove its relationship to formula progression for Metric Temporal Logic, and show that the robustness-to-go depends only on the suffix trajectory and progressed formula. We implement robustness-to-go as the objective in a model predictive control algorithm and use formula progression to efficiently evaluate it online. We test the algorithm in simulation and compare it to model predictive control using other robustness measures. Our experiments show that using robustness-to-go improves performance compared to using traditional robustness.
|
| |
| 17:45-17:55, Paper TuBT1.7 | Add to My Program |
| LASER: Level-Based Asynchronous Scheduling and Execution Regime for Spatiotemporally Constrained Multi-Robot Timber Manufacturing |
|
| Huang, Zhenxiang | University of Stuttgart |
| Skoury, Lior | University of Stuttgart |
| Stark, Tim | University of Stuttgart |
| Wagner, Aaron | University of Stuttgart |
| Wagner, Hans-Jakob | University of Stuttgart |
| Wortmann, Thomas | University of Stuttgart |
| Menges, Achim | University of Stuttgart |
Keywords: Intelligent and Flexible Manufacturing, Planning, Scheduling and Coordination, Multi-Robot Systems
Abstract: Automating large-scale manufacturing in domains like timber construction requires multi-robot systems to manage tightly coupled spatiotemporal constraints, such as collision avoidance and process-driven deadlines. This paper introduces LASER (Level-based Asynchronous Scheduling and Execution Regime), a complete framework for scheduling and executing complex assembly tasks, demonstrated on a screw-press gluing application for timber slab manufacturing. Our central contribution is to integrate a barrier-based mechanism into a constraint programming (CP) scheduling formulation that partitions tasks into spatiotemporally disjoint sets, which we define as levels. This structure enables robots to execute tasks in parallel and asynchronously within a level, synchronizing only at level barriers, which guarantees collision-free operation by construction and provides robustness to timing uncertainties. To solve this formulation for large problems, we propose two specialized algorithms: an iterative temporal-relaxation approach for heterogeneous task sequences and a bi-level decomposition for homogeneous tasks that balances workload. We validate the LASER framework by fabricating a full-scale 2.4m x 6m timber slab with a two-robot system mounted on parallel linear tracks, successfully coordinating 108 subroutines and 352 screws under tight adhesive time windows. Computational studies show our method scales steadily with size compared to a monolithic approach.
|
| |
| 17:55-18:05, Paper TuBT1.8 | Add to My Program |
| LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction |
|
| Chen, Ziyu | University of Science and Technology of China |
| Zhu, Fan | University of Science and Technology of China |
| Zhu, Hui | Hefei Institutes of Physical Science, Chinese Academy of Sciences |
| Kong, Deyi | Hefei Institutes of Physical Science, Chinese Academy of Sciences |
| Kuang, Xinkai | University of Science and Technology of China |
| Zhang, Yujia | University of Science and Technology of China |
| Jiang, Chunmao | University of Science and Technology of China |
Keywords: Computer Vision for Automation, Mapping, Sensor Fusion
Abstract: Recent 3D Gaussian Splatting (3DGS) methods have demonstrated the feasibility of self-driving scenes reconstruction and novel view synthesis. However, most existing methods either rely solely on cameras or use LiDAR only for Gaussian initialization or depth supervision, while the rich scene information contained in point clouds, such as reflectance, and the complementarity between LiDAR and RGB have not been fully exploited, leading to degradation in challenging self-driving scenes, such as those with high ego-motion and complex lighting. To address these issues, we propose a robust and efficient LiDAR-reflectance-guided Salient Gaussian Splatting method (LR-SGS) for self-driving scenes, which introduces a structure-aware Salient Gaussian representation, initialized from geometric and reflectance feature points extracted from LiDAR and refined through a salient transform and improved density control to capture edge and planar structures. Furthermore, we calibrate LiDAR intensity into reflectance and attach it to each Gaussian as a lighting-invariant material channel, jointly aligned with RGB to enforce boundary consistency. Extensive experiments on the Waymo Open Dataset demonstrate that LR-SGS achieves superior reconstruction performance with fewer Gaussians and shorter training time. In particular, on Complex Lighting scenes, our method surpasses OmniRe by 1.18 dB PSNR.
|
| |
| 18:05-18:15, Paper TuBT1.9 | Add to My Program |
| KISS-IMU: Self-Supervised Inertial Odometry with Motion-Balanced Learning and Uncertainty-Aware Inference |
|
| Choi, Jiwon | Inha University |
| Kim, Hogyun | Inha University |
| Yang, Geonmo | Inha University |
| Lee, Juhui | Inha University |
| Cho, Younggun | Inha University |
Keywords: Field Robots, Deep Learning Methods, Localization
Abstract: Inertial measurement units (IMUs), which provide high-frequency linear acceleration and angular velocity measurements, serve as fundamental sensing modalities in robotic systems. Recent advances in deep neural networks have led to remarkable progress in inertial odometry. However, the heavy reliance on ground truth data during training fundamentally limits scalability and generalization to unseen and diverse environments. We propose KISS-IMU, a novel self-supervised inertial odometry framework that eliminates ground truth dependency by leveraging simple LiDAR-based ICP registration and pose graph optimization as a supervisory signal. Our approach embodies two key principles: keeping the IMU stable through motion-aware balanced training and keeping the IMU strong through uncertainty-driven adaptive weighting during inference. To evaluate performance across diverse motion patterns and scenarios, we conducted comprehensive experiments on various real-world platforms, including quadruped robots. Importantly, we train only the IMU network in a self-supervised manner, with LiDAR serving solely as a lightweight supervisory signal rather than requiring additional learnable processes. This design enables the framework to ensure robustness without relying on joint multimodal learning or ground truth supervision. The supplementary materials are available at https://sparolab.github.io/ research/kiss_imu.
|
| |
| TuBT2 Regular Session, Hall A3 |
Add to My Program |
| Control and Optimization |
|
| |
| Chair: Franchi, Antonio | University of Twente / Sapienza University of Rome |
| Co-Chair: Li, Longqiu | Harbin Institute of Technology |
| |
| 16:45-16:55, Paper TuBT2.1 | Add to My Program |
| Closed-Loop Trajectory Optimization of Deformable Linear Objects for Dynamic Motions |
|
| Klankers, Marc Kilian | Technische Universität Braunschweig |
| Steil, Jochen J. | Technische Universität Braunschweig |
Keywords: Model Learning for Control, Machine Learning for Robot Control, Motion Control
Abstract: Tracking dynamic endpoint trajectories of deformable linear objects (DLOs) with a robotic manipulator remains challenging due to their complex non-linear behavior. While closed-loop Model Predictive Control (MPC) can account for these non-linearities, it requires an accurate dynamic model and precise state estimation. This paper introduces a closed-loop approach for controlling a DLO's endpoint to track dynamic 2D shapes. We model the DLO as a floating-base kinematic chain and present a new perspective on learning its dynamics using a data-driven approximation of its hybrid dynamics. Based on this model, we formulate an Optimal Control Problem (OCP), which we solve within the control loop using both linear MPC and DDP. We validate our approach with simulation and hardware experiments, demonstrating its ability to track dynamic endpoint motions.
|
| |
| 16:55-17:05, Paper TuBT2.2 | Add to My Program |
| Integrated Planning and Control on Manifolds: Factor Graph Representation and Toolkit |
|
| Yang, Peiwen | The Hong Kong Polytechnic University (PolyU) |
| Wen, Weisong | Hong Kong Polytechnic University |
| Yang, Runqiu | The Hong Kong Polytechnic University |
| Zhang, Yuanyuan | The Hong Kong Polytechnic University |
| Hu, Jiahao | The Hong Kong Polytechnic University |
| Chen, Yingming | The Hong Kong Polytechnic University |
| Xiao, Naigui | The Hong Kong Polytechnic University |
| Zhao, Jiaqi | The Hong Kong Polytechnic University |
Keywords: Aerial Systems: Mechanics and Control, Motion Control, Collision Avoidance
Abstract: Model predictive control (MPC) faces significant limitations when applied to systems evolving on nonlinear manifolds, such as robotic attitude dynamics and constrained motion planning, where traditional Euclidean formulations struggle with singularities, over-parameterization, and poor convergence. To overcome these challenges, this paper introduces FactorMPC, a factor-graph based MPC toolkit that unifies system dynamics, constraints, and objectives into a modular, user-friendly, and efficient optimization structure. Our approach natively supports manifold-valued states with Gaussian uncertainties modeled in tangent spaces. By exploiting the sparsity and probabilistic structure of factor graphs, the toolkit achieves real-time performance even for high-dimensional systems with complex constraints. The design of velocity-extended on-manifold control barrier function (CBF)-based obstacle avoidance factors are derived for safety-critical applications. By bridging graphical models with safety-critical MPC, our work offers a scalable and geometrically consistent framework for integrated planning and control. The simulations and experimental results on quadrotor platform demonstrate superior trajectory tracking and obstacle avoidance performance compared to baseline methods. To foster research reproducibility, we have provided open-source implementation offering plug-and-play factors.
|
| |
| 17:05-17:15, Paper TuBT2.3 | Add to My Program |
| Beyond Collision Cones: Dynamic Obstacle Avoidance for Nonholonomic Robots Via Dynamic Parabolic Control Barrier Functions |
|
| Park, Hun Kuk | University of Michigan |
| Kim, Taekyung | University of Michigan |
| Panagou, Dimitra | University of Michigan, Ann Arbor |
Keywords: Constrained Motion Planning, Nonholonomic Motion Planning, Collision Avoidance
Abstract: Control Barrier Functions (CBFs) are a powerful tool for ensuring the safety of autonomous systems, yet applying them to nonholonomic robots in cluttered, dynamic environments remains an open challenge. State-of-the-art methods often rely on collision-cone or velocity-obstacle constraints which, by only considering the angle of the relative velocity, are inherently conservative and can render the CBF-based quadratic program infeasible, particularly in dense scenarios. To address this issue, we propose a Dynamic Parabolic Control Barrier Function (DPCBF) that defines the safe set using a parabolic boundary. The parabola's vertex and curvature dynamically adapt based on both the distance to an obstacle and the magnitude of the relative velocity, creating a less restrictive safety constraint. We prove that the proposed DPCBF is valid for a kinematic bicycle model subject to input constraints. Extensive comparative simulations demonstrate that our DPCBF-based controller significantly enhances navigation success rates and QP feasibility compared to baseline methods. Our approach successfully navigates through dense environments with up to 100 dynamic obstacles, scenarios where collision cone-based methods fail due to infeasibility.
|
| |
| 17:15-17:25, Paper TuBT2.4 | Add to My Program |
| Nonlinear Predictive Control of the Continuum and Hybrid Dynamics of a Suspended Deformable Cable for Aerial Pick and Place |
|
| Rapuano, Antonio | Sapienza University of Rome |
| Shen, Yaolei | University of Twente |
| Califano, Federico | University of Twente |
| Gabellieri, Chiara | University of Twente |
| Franchi, Antonio | University of Twente / Sapienza University of Rome |
Keywords: Aerial Systems: Mechanics and Control, Motion Control, Mobile Manipulation
Abstract: This paper presents a framework for aerial manipulation of an extensible cable that combines a high-fidelity model based on partial differential equations (PDEs) with a reduced-order representation suitable for real-time control. The PDEs are discretized using a finite-difference method, and proper orthogonal decomposition is employed to extract a reduced-order model (ROM) that retains the dominant deformation modes while significantly reducing computational complexity. Based on this ROM, a nonlinear model predictive control scheme is formulated, capable of stabilizing cable oscillations and handling hybrid transitions such as payload attachment and detachment. Simulation results confirm the stability, efficiency, and robustness of the ROM, as well as the effectiveness of the controller in regulating cable dynamics under a range of operating conditions. Additional simulations illustrate the application of the ROM for trajectory planning in constrained environments, demonstrating the versatility of the proposed approach. Overall, the framework enables real-time, dynamics-aware control of unmanned aerial vehicles (UAVs) carrying suspended flexible cables.
|
| |
| 17:25-17:35, Paper TuBT2.5 | Add to My Program |
| LPV-MPC for Lateral Control in Full-Scale Autonomous Racing |
|
| Jardali, Hassan | Indiana University |
| Mohamed, Ihab S. | Indiana University Bloomington |
| Pushp, Durgakant | Indiana University Bloomington |
| Liu, Lantao | Indiana University |
Keywords: Field Robots, Control Architectures and Programming, Autonomous Vehicle Navigation
Abstract: Autonomous racing has attracted significant attention recently, presenting challenges in selecting an optimal controller that operates within the onboard system's computational limits and meets operational constraints such as limited track time and high costs. This paper introduces a Linear Parameter-Varying Model Predictive Controller (LPV-MPC) for lateral control. Implemented on an IAC AV-24, the controller achieved stable performance at speeds exceeding 160 mph (71.5 m/s). We detail the controller design, the methodology for extracting model parameters, and key system-level and implementation considerations. Additionally, we report results from our final race run, providing a comprehensive analysis of both vehicle dynamics and controller performance. A Python implementation of the framework is available at: https://tinyurl.com/LPV-MPC-acados.
|
| |
| 17:35-17:45, Paper TuBT2.6 | Add to My Program |
| Field-Superimposed Control Magnetically Driving Nanorobot Swarms with Hybrid Rotating and Gradient Fields |
|
| Li, Chan | Beihang University |
| Zeng, Zijin | Beihang University |
| Niu, Wenyan | Beihang University |
| Ye, Jingwen | Beihang University |
| Huang, Shunxiao | Beihang University |
| Chen, Zaiyang | Beihang University |
| Sun, Hongyan | Beihang University |
| Guo, Yingjian | Beihang University |
| Feng, Lin | Beihang University |
Keywords: Micro/Nano Robots
Abstract: Magnetically actuated micro/nanorobot swarms have exhibited considerable promise for targeted biomedical delivery and localized therapies, attributed to their advantages of remote manipulation and robust penetration through biological tissues. However, achieving the simultaneous enhancement of both collective structural stability and efficient propulsion under a single-mode magnetic field remains a critical challenge. This paper presents a rotational–gradient superimposed magnetic actuation strategy that enables precise superposition of a uniform rotating field and a directional gradient magnetic field, using a tri-axial electromagnetic coil system. This approach significantly enhances the motility of micro/nanorobots while preserving their collective stability. Experimental results reveal that a co-directional gradient magnetic field can increase cluster velocity by 1.5-2 times without compromising cluster stability, while a counter-directional gradient magnetic field enables effective deceleration or anchoring of the swarm. Also, this paper elucidates the impact of the gradient magnetic field on swarm stability. Moreover, this paper demonstrates the formation of the chain-like structure of micro/nanorobots, which possess axial movement capability under the superimposed gradient magnetic field. This work provides a systematic theoretical and experimental foundation for multi-field synergistic actuation of micro/nanorobot swarms, and paves new paths for their application in biomedical micromanipulation.
|
| |
| 17:45-17:55, Paper TuBT2.7 | Add to My Program |
| Goal-Oriented Control Strategies for Soft Growing Robots |
|
| Huang, Wentao | Harbin Institute of Technology |
| Li, Pengchun | Harbin Institute of Technology |
| Zhang, Ziyi | Harbin Institute of Technology |
| Wang, Zuankai | The Hong Kong Polytechnic University |
| Zhou, Dekai | Harbin Institute of Technology |
| Li, Longqiu | Harbin Institute of Technology |
Keywords: Modeling, Control, and Learning for Soft Robots, Motion and Path Planning, Reinforcement Learning
Abstract: Soft growing robots, as highly mobile pneumatic membrane robots, are limited in control performance due to their soft structure and nonlinear mechanical properties, especially under dynamic conditions. Therefore, developing reliable control strategies for the robot is essential. This study proposes a dual-thread, goal-oriented control strategy for soft growing robot that combines planning and control. By integrating graph convolutional networks with deep reinforcement learning, the global path planning method is better suited to the self-growing behaviors of soft robots, leading to improvements in both computational efficiency and accuracy compared to inverse kinematics planning methods. Motion control reduces the adverse effects of deformation errors caused by its own low stiffness or by disturbances in the external environment. This strategy effectively combines reinforcement learning-based global planning with a multiple closed-loop motion control system, addressing the issues of low precision and reliability under dynamic conditions. Experimental results demonstrate that the robot achieves a tracking accuracy of 11.83 mm within a 5-meter range and successfully tracks and approaches a non-cooperative dynamic target. These results highlight the significant potential of the proposed approach in applications such as target capture and dynamic manipulation.
|
| |
| 17:55-18:05, Paper TuBT2.8 | Add to My Program |
| SDPRLayers: Certifiable Backpropagation through Polynomial Optimization Problems in Robotics |
|
| Holmes, Connor | University of Toronto |
| Dümbgen, Frederike | ENS, PSL University |
| Barfoot, Timothy | University of Toronto |
Keywords: Optimization and Optimal Control, Localization, Visual Learning, Differentiable Optimization
Abstract: A recent set of techniques in the robotics community, known as certifiably correct methods, frames robotics problems as polynomial optimization problems (POPs) and applies convex, semidefinite programming (SDP) relaxations to either find or certify their global optima. In parallel, differentiable optimization allows optimization problems to be embedded into end-to-end learning frameworks and has received considerable attention in the robotics community. In this paper, we consider the ill effect of convergence to spurious local minima in the context of learning frameworks that use differentiable optimization. We present SDPRLayers, an approach that seeks to address this issue by combining convex relaxations with implicit differentiation techniques to provide certifiably correct solutions and gradients throughout the training process. We provide theoretical results that outline conditions for the correctness of these gradients and provide efficient means for their computation. Our approach is first applied to two simple-but-demonstrative simulated examples, which expose the potential pitfalls of reliance on local optimization in existing, state-of-the-art, differenti
|
| |
| 18:05-18:15, Paper TuBT2.9 | Add to My Program |
| Co-Optimizing Reconfigurable Environments and Policies for Decentralized Multi-Agent Navigation |
|
| Gao, Zhan | University of Cambridge |
| Yang, Guang | University of Cambridge |
| Prorok, Amanda | University of Cambridge |
Keywords: Multi-Robot Systems, Distributed Robot Systems, Optimization and Optimal Control, Co-Design
Abstract: This work views the multi-agent system and its surrounding environment as a co-evolving system. The goal is to take agent actions and environment configurations as decision variables, and optimize both in a coordinated manner. Towards this end, we consider the problem of decentralized multi-agent navigation in reconfigurable environments. By introducing two sub-objectives of multi-agent navigation and environment optimization, we propose an agent-environment co-optimization problem and develop a coordinated algorithm that alternates between sub-objectives to search for an optimal synthesis of agent actions and obstacle configurations; ultimately, improving navigation performance. Due to the challenge of modeling the relation between agents, environment and performance, we formulate a model-free learning mechanism within the coordinated framework. A formal convergence analysis shows our coordinated algorithm tracks the local minimum trajectory of an associated time-varying non-convex optimization problem. Experiments evaluate the benefits of co-optimization and interestingly, indicate optimized environments offer structural guidance that is key to de-conflicting agents.
|
| |
| TuBT3 Regular Session, Lehar 1-4 |
Add to My Program |
| Manipulation and Locomotion |
|
| |
| |
| 16:45-16:55, Paper TuBT3.1 | Add to My Program |
| Ringbot Quad: A Monocycle Robot with Four Legs for Versatile Wheel-Leg Transformable Locomotion |
|
| Gim, Kevin | University of Illinois, Urbana-Champaign |
| Kim, Joohyung | University of Illinois Urbana-Champaign |
Keywords: Intelligent Transportation Systems, Hardware-Software Integration in Robotics, Mechanism Design
Abstract: Integrating multiple locomotion modes on a single platform has become an active focus in pursuit of versatile and efficient movement. This paper introduces a novel monocycle robot with four legs, named Ringbot Quad, combining the wheeled and legged mechanisms. The Ringbot Quad is developed as a unique monocycle mechanism that replaces the traditional monocycle drivetrain with four individually actuated driving modules, each topped with an articulated leg. The four legs can be used for balance and steering in driving mode and for quadruped walking that fully supports the body with the legs. By switching between two distinct locomotion modes, it can navigate various terrains and overcome obstacles typically challenging for either mechanism alone. In this work, we present a compact-scale Ringbot Quad prototype as a proof of concept for the proposed mechanism, demonstrating the feasibility of a new type of mobile robot.
|
| |
| 16:55-17:05, Paper TuBT3.2 | Add to My Program |
| EmbodiedCoder: Parameterized Embodied Mobile Manipulation Via Modern Coding Model |
|
| Lin, Zefu | University of Chinese Academy of Sciences |
| Cui, Rongxu | Beihang University |
| Hanning, Chen | University of Chinese Academy of Science |
| Wang, Xiangyu | University of Chinese Academy of Sciences |
| Xu, Junjia | Beihang University |
| Jin, Xiaojuan | Institute of Automation, Chinese Academy of Sciences |
| Wenbo, Chen | Institute of Automation, Chinese Academy of Sciences |
| Zhou, Hui | The Chinese University of Hongkong |
| Fan, Lue | Chinese Academy of Sciences |
| Li, Wenling | Beihang University |
| Zhang, Zhaoxiang | Chinese Academy of Sciences |
Keywords: Agent-Based Systems, AI-Based Methods, AI-Enabled Robotics
Abstract: Recent advances in robot control methods, from end-to-end vision-language-action frameworks to modular systems with predefined primitives, have advanced robots’ ability to follow natural language instructions. Nonetheless, many approaches still struggle to scale to diverse environments, as they often rely on large annotated datasets and offer limited interpretability. In this work, we introduce EmbodiedCoder, a training-free framework for open-world mobile robot manipulation that leverages coding models to directly generate executable robot trajectories. By grounding high-level instructions in code, EmbodiedCoder enables flexible object geometry parameterization and task trajectory synthesis without additional data collection or fine-tuning. This coding-based paradigm provides a transparent and generalizable way to connect perception with manipulation. Experiments on real mobile robots show that EmbodiedCoder achieves robust performance across diverse long-horizon tasks and generalizes effectively to unseen objects and environments. Our results demonstrate an interpretable approach for bridging high-level reasoning and low-level control, moving beyond fixed primitives toward versatile robot intelligence. See the project page at https://embodiedcoder.github.io/EmbodiedCoder/.
|
| |
| 17:05-17:15, Paper TuBT3.3 | Add to My Program |
| CoorGrasp: Coordinated Contact Control for Adaptive Dexterous Grasping under Uncertainty |
|
| Yu, Mingrui | Tsinghua University |
| Jiang, Yongpeng | Tsinghua University |
| Jia, Yongyi | Tsinghua University |
| Ren, Yi | Huawei Technologies |
| Li, Xiang | Tsinghua University |
Keywords: Grasping, Multifingered Hands, Dexterous Manipulation
Abstract: While recent research has focused heavily on dexterous grasp pose generation, less attention has been devoted to the execution of planned grasps. Under shape and position uncertainty, open-loop execution often yields uncoordinated contacts, causing undesired in-hand object motion and even grasp failures. To address this, this paper proposes a tactile-driven model predictive controller for adaptive and delicate execution of diverse dexterous grasps. Our approach emphasizes multi-contact coordination across both approaching and grasping phases, with three key novelties: (i) coordination-aware phase separation, (ii) arm–hand coordination to compensate for position errors, and (iii) adaptive force coordination to increase contact forces in a balanced manner. An analytical model is employed to relate contact forces to robot joint motions for predictive control. Our formulation imposes no restrictions on grasp types or contact configurations and integrates seamlessly with state-of-the-art grasp pose generation methods. We validate the approach through large-scale simulations involving 15k grasps across 400 objects on three robotic hands, and real-world experiments on eight objects. Results demonstrate that our method achieves higher grasp success rates and reduced undesired object movements. Supplementary materials are available at https://ada-grasp-ctrl.github.io/.
|
| |
| 17:15-17:25, Paper TuBT3.4 | Add to My Program |
| OpenHEART: Opening Heterogeneous Articulated Objects with a Legged Manipulator |
|
| Lim, Seonghyeon | KAIST |
| Lee, Hyeonwoo | KAIST (Korea Advanced Institute of Science and Technology) |
| Lee, Seunghyun | KAIST (Korea Advanced Institute of Science and Technology) |
| Nahrendra, I Made Aswin | KRAFTON |
| Myung, Hyun | KAIST (Korea Advanced Institute of Science and Technology) |
Keywords: Whole-Body Motion Planning and Control, Legged Robots, Reinforcement Learning
Abstract: Legged manipulators offer high mobility and versatile manipulation. However, robust interaction with heterogeneous articulated objects, such as doors, drawers, and cabinets, remains challenging because of the diverse articulation types of the objects and the complex dynamics of the legged robot. Existing reinforcement learning (RL)-based approaches often rely on high-dimensional sensory inputs, leading to sample inefficiency. In this paper, we propose a robust and sample-efficient framework for opening heterogeneous articulated objects with a legged manipulator. In particular, we propose Sampling-based Abstracted Feature Extraction (SAFE), which encodes handle and panel geometry into a compact low-dimensional representation, improving cross-domain generalization. Additionally, Articulation Information Estimator (ArtIEst) is introduced to adaptively mix proprioception with exteroception to estimate opening direction and range of motion for each object. The proposed framework was deployed to manipulate various heterogeneous articulated objects in simulation and real-world robot systems. Videos can be found on the project website: https://openheart-icra.github.io/OpenHEART/
|
| |
| 17:25-17:35, Paper TuBT3.5 | Add to My Program |
| Learning Location-Specific Latent Behavior Priors for Occupancy Prediction in Automated Driving |
|
| Schmidt, Julian | Mercedes-Benz AG |
| Ruoff, Mario | Mercedes-Benz AG, University of Stuttgart |
| Rist, Christoph | Daimler |
| Jordan, Julian | Mercedes-Benz AG |
Keywords: AI-Based Methods, Imitation Learning, Semantic Scene Understanding
Abstract: Performance in automated driving tasks improves significantly with the incorporation of location-specific prior knowledge. This is because agent behavior usually strongly correlates with location features. A common example is the strong tendency of vehicles to follow their lane, but less obvious interactions exist as well. To this end, high definition (HD) map information is typically collected and made available during both training and inference to act as a location prior. In this paper, we propose to aggregate location-specific information in a data-driven way. Specifically, we learn a global latent grid that acts as a behavior prior to a learned occupancy prediction model. Since the prediction loss function is directly backpropagated into the latent grid, no additional labels are required beyond the already available future agent locations. We use the large real-world Lyft Level 5 motion prediction dataset to empirically demonstrate the merit of our learned location-specific latent behavior prior. Applied to two different prediction models, our approach achieves performance comparable to or exceeding baseline models that rely on HD maps, without requiring an HD map. Additional experiments reveal that the latent behavior prior is able to distill geometric and semantic information purely from agent behavior. These results indicate that directly learning location-specific priors is a promising direction towards automated driving without costly HD maps.
|
| |
| 17:35-17:45, Paper TuBT3.6 | Add to My Program |
| WiXus: A Wheeled-Legged Robot with Wire-Driven Environmental Utilizing to Integrate Mobility and Manipulation |
|
| Inoue, Shintaro | The University of Tokyo |
| Kawaharazuka, Kento | The University of Tokyo |
| Suzuki, Temma | The University of Tokyo |
| Yuzaki, Sota | The University of Tokyo |
| Okada, Kei | The University of Tokyo |
Keywords: Tendon/Wire Mechanism, Legged Robots, Wheeled Robots
Abstract: Wheeled-legged robots, which have wheels at their feet and achieve high mobility by coordinating wheel drive and leg drive, have been developed. These robots have been developed purely as platforms specialized for locomotion. Therefore, they do not have a means to repurpose their legs for roles other than locomotion, such as object manipulation or tool utilization. In this paper, we address the problem of how to draw out the potential task-execution capability of the legs by freeing them from the roles of locomotion through external body support. To this end, we propose and develop a new robot, WiXus, which fuses a wheeled-legged mechanism with a wire-driven mechanism that utilizes the external environment. The developed WiXus demonstrates not only planar locomotion with wheeled-legged drive, but also three-dimensional mobility such as cliff climbing by coordinating wire-driven and wheeled-legged actuation. Furthermore, by suspending the body with wire-driven actuation, WiXus successfully repurpose its legs as arms to perform object manipulation, (e.g., rescuing a dog (stuffed animal)), and tool utilization (e.g., harvesting an apple (mockup) with loppers). This study demonstrates that the approach of utilizing the environment with wire-driven actuation is a new design principle that extends the operational domain of wheeled-legged robots.
|
| |
| 17:45-17:55, Paper TuBT3.7 | Add to My Program |
| Colonial Architectures for Centimeter-Scale Underwater Robot Swarms |
|
| Spino, Pascal | Massachusetts Institute of Technology |
| Bäckert, Marc | MIT |
| Yin, Lianhao | MIT |
| Rus, Daniela | MIT |
Keywords: Cellular and Modular Robots, Swarm Robotics, Marine Robotics
Abstract: Micro underwater robots offer scalable and low-cost access to environments that are difficult to study with conventional vehicles, but severe communication constraints, limited onboard power, and low swimming speed restrict the capability of these miniature systems. Inspired by colonial organisms such as salps and siphonophores, this work explores physically connected swarms of small underwater robots that form larger structures with improved collective performance. We present a modular platform of centimeter-scale robots capable of three-dimensional propulsion, onboard sensing, and autonomous behavior, with magnetic interfaces that enable reversible connections into prescribed morphologies. Experiments demonstrate autonomous assembly and disassembly, quantify the propulsive benefits of chain aggregates, and show that mechanically coupled robots can distribute sensing, actuation, and control across the collective. Results indicate that certain colonial architectures can greatly improve swimming speed, locomotion efficiency, and task performance compared to individual robots, suggesting a path toward more capable centimeter-scale underwater swarms.
|
| |
| 17:55-18:05, Paper TuBT3.8 | Add to My Program |
| CLF-RL: Control Lyapunov Function Guided Reinforcement Learning |
|
| Li, Kejun | California Institute of Technology |
| Olkin, Zachary | California Institute of Technology |
| Yue, Yisong | California Institute of Technology |
| Ames, Aaron | Caltech |
Keywords: Humanoid and Bipedal Locomotion, Reinforcement Learning
Abstract: Reinforcement learning (RL) has shown promise in generating robust locomotion policies for bipedal robots, but often suffers from tedious reward design and sensitivity to poorly shaped objectives. In this work, we propose a structured reward shaping framework that leverages model-based trajectory generation and control Lyapunov functions (CLFs) to guide policy learning. We explore two model-based planners for generating reference trajectories: a reduced-order linear inverted pendulum (LIP) model for velocity-conditioned motion planning, and a precomputed gait library based on hybrid zero dynamics (HZD) using full-order dynamics. These planners define desired end-effector and joint trajectories, which are used to construct CLF-based rewards that penalize tracking error and encourage rapid convergence. This formulation provides meaningful intermediate rewards, and is straightforward to implement once a reference is available. Both the reference trajectories and CLF shaping are used only during training, resulting in a lightweight policy at deployment. We validate our method both in simulation and through extensive real-world experiments on a Unitree G1 robot. CLF-RL demonstrates significantly improved robustness relative to the baseline RL policy and better performance than a classic tracking reward RL formulation.
|
| |
| 18:05-18:15, Paper TuBT3.9 | Add to My Program |
| Differentiable Contact Dynamics for Stable Object Placement under Geometric Uncertainties |
|
| Li, Linfeng | National University of Singapore |
| Yang, Gang | National University of Singapore |
| Shao, Lin | National University of Singapore |
| Hsu, David | National University of Singapore |
Keywords: Manipulation Planning, Model Learning for Control, Contact Modeling
Abstract: From stacking a tower of blocks to serving a cup of coffee, stable object placement is a crucial skill for future robots. It becomes particularly challenging under geometric uncertainties, e.g., when the object pose or shape is not known accurately. This work leverages a differentiable simulation model of contact dynamics to tackle this challenge. We derive a novel gradient that relates force-torque sensor readings to geometric uncertainties, thus enabling uncertainty estimation by minimizing discrepancies between sensor data and model predictions via gradient descent. Gradient-based methods are sensitive to initialization. To mitigate this effect, we maintain a belief over multiple estimates and choose the robot action based on the current belief at each time step. In experiments on a Franka robot arm, our method achieved promising results on multiple objects under various geometric uncertainties, including the in-hand pose uncertainty of a grasped object, the object shape uncertainty, and the environment uncertainty.
|
| |
| TuBT4 Regular Session, Strauss 1-2 |
Add to My Program |
| Multi-Robot Systems |
|
| |
| Chair: Sartoretti, Guillaume Adrien | National University of Singapore (NUS) |
| |
| 16:45-16:55, Paper TuBT4.1 | Add to My Program |
| Morphogenetic Assembly and Adaptive Control for Heterogeneous Modular Robots |
|
| Meng, Chongxi | Tongji University |
| Zhao, Da | The Chinese University of Hong Kong |
| Zhao, Yifei | Tongji University |
| Zeng, Minghao | Tongji University |
| Zhou, Yanmin | Tongji University |
| Wang, Zhipeng | Tongji University |
| He, Bin | Tongji University |
Keywords: Cellular and Modular Robots, Assembly
Abstract: This paper presents a closed-loop automation framework for heterogeneous modular robots, encompassing the entire pipeline from morphological construction to adaptive control. Within this framework, a mobile manipulator manipulates heterogeneous functional modules—including structural, joint, and wheeled modules—to dynamically assemble diverse robot configurations and grant them immediate locomotion capabilities. To address the state-space explosion inherent in large-scale heterogeneous reconfiguration, we propose a hierarchical planner: the high-level planner employs a bi-directional heuristic search with type penalty terms to generate module-handling sequences, while the low-level planner utilizes A* search to compute optimal execution trajectories. This approach effectively decouples discrete configuration planning from continuous motion execution. For adaptive motion generation of unknown assembled configurations, we introduce a GPU-accelerated Annealing Variance Model Predictive Path Integral (MPPI) controller. By incorporating a multi-stage variance annealing strategy to balance global exploration and local convergence, the controller achieves configuration-agnostic, real-time motion control. Large-scale simulations demonstrate that the type penalty term is crucial for planning robustness in heterogeneous scenarios. Furthermore, the greedy heuristic generates plans with lower physical execution costs compared to the Hungarian heuristic. The proposed Annealing-variance MPPI significantly outperforms standard MPPI in both velocity tracking accuracy and control frequency, achieving real-time control at 50 Hz. The framework successfully validates the full-cycle process, including module assembly, robot merging and splitting, and dynamic motion generation.
|
| |
| 16:55-17:05, Paper TuBT4.2 | Add to My Program |
| An Approximate Set Membership Approach to Resilient Multi-Robot Communication |
|
| Smith, Nicholas | University of Technology Sydney |
| Chung, Jen Jen | The University of Queensland |
| Best, Graeme | University of Technology Sydney |
Keywords: Multi-Robot Systems, Networked Robots, Cooperating Robots
Abstract: Effective communication is critical for coordinating multi-robot teams, yet practical deployments often face severe bandwidth constraints and frequent message loss. This paper presents a communication protocol that leverages Bloom filters to enable efficient, approximate set membership queries in multi-robot systems. Bloom filters offer a tunable trade-off between false positive rate and memory footprint, making them well suited for bandwidth-limited communication. To mitigate the effects of false positives, we introduce a salting strategy that decorrelates Bloom filters and enables stacking - the combination of membership queries across multiple filters. These stacked results are incorporated into each robot's belief map, such that only sufficiently corroborated information influences frontier generation and exploration planning. We evaluate our proposed communication protocol in a multi-robot exploration task, where robots share information about their observed cells to enable efficient coverage. Our results demonstrate that compared to exact methods, our Bloom filter-based protocol reduces communication cost by up to 6 times while maintaining team exploration performance, even under severe communication dropouts.
|
| |
| 17:05-17:15, Paper TuBT4.3 | Add to My Program |
| CLOT: Multi-Robot Motion Planning Via Collaborative Optimal Transport under Signal Temporal Logic Tasks |
|
| Zhang, Ying | Peking University |
| Zhang, Yunyi | Peking University |
| Le, An Thai | Vin University |
| Guo, Meng | Peking University |
Keywords: Multi-Robot Systems, Formal Methods in Robotics and Automation, Motion and Path Planning
Abstract: Multi-robot systems often need to navigate through obstacle-cluttered environments while performing complex tasks. To ensure collision-free trajectories among the robots and with the obstacles is essential for the overall safety, along with additional requirements such as dynamic feasibility, relative formation, connectivity maintenance and temporal tasks. Existing work mostly focuses on the design of analytical controllers that encapsulate all these constraints, which often suffer from undesired local minima due to conflicting non-convex objectives. This work proposes a novel motion planning scheme for multi-robot systems under various safety and high-level tasks, specified as signal temporal logic (STL) formulas over collective states such as collision avoidance, relative formation and connectivity maintenance. A gradient-free method called collaborative optimal transport (CLOT) is proposed that optimizes batches of system-wide smooth trajectories over highly nonlinear costs handled through the zero-order Sinkhorn-Knopp step. Via parallel computation on GPUs, it is shown to significantly improve the scalability from a few robots to over 100 robots, with an average planning time of few seconds. Lastly, its applicability is extensively demonstrated both in simulation and hardware, over complex environments and high-level temporal tasks.
|
| |
| 17:15-17:25, Paper TuBT4.4 | Add to My Program |
| Distributed First-Order and Second-Order Adaptive Hybrid Optimization for Multi-Robot Learning |
|
| Zhang, Yilun | Swansea University |
| Xie, Xianghua | Swansea University |
| Zhang, Lu | Loughborough University London |
Keywords: Optimization and Optimal Control, Multi-Robot Systems, Deep Learning Methods
Abstract: We present a distributed first-order and second-order adaptive hybrid optimization algorithm (DAHO) for multi-robot systems. A team of robots collaboratively trains a shared deep neural network using only local data while exchanging model updates via peer-to-peer robot communication. Raw data never leaves the device, which preserves privacy and conserves communication bandwidth. The method blends a second-order Limited-memory Broyden–Fletcher–Goldfarb–Shann (LBFGS) method with an alternating direction method of multipliers (ADMM) based first-order method to obtain both the fast convergence of second-order methods and the robustness of first-order schemes. An automatic switching policy, guided by a convergence analysis rooted in trust region theory, selects the update type at each round. A soft switch mechanism derived from the same analysis mitigates oscillations during mode changes. Compared with four single-method baselines that range from first-order to second-order optimization, the proposed hybrid approach achieves faster convergence, superior accuracy, and near centralized performance on robotics related deep learning tasks.
|
| |
| 17:25-17:35, Paper TuBT4.5 | Add to My Program |
| UDON: Uncertainty-Weighted Distributed Optimization for Multi-Robot Neural Implicit Mapping under Extreme Communication Constraints |
|
| Zhao, Hongrui | University of Illinois Urbana-Champaign |
| Zhou, Xunlan | Nanjing University |
| Ivanovic, Boris | NVIDIA |
| Mehr, Negar | University of California Berkeley |
Keywords: Multi-Robot Systems, Mapping, Multi-Robot SLAM
Abstract: Multi-robot mapping with neural implicit representations enables the compact reconstruction of complex environments. However, it demands robustness against communication challenges like packet loss and limited bandwidth. While prior works have introduced various mechanisms to mitigate communication disruptions, performance degradation still occurs under extremely low communication success rates. This paper presents UDON, a real-time multi-agent neural implicit mapping framework that introduces a novel uncertainty-weighted distributed optimization to achieve highquality mapping under severe communication deterioration. The uncertainty weighting prioritizes more reliable portions of the map, while the distributed optimization isolates and penalizes mapping disagreement between individual pairs of communicating agents. We conduct extensive experiments on standard benchmark datasets and real-world robot hardware. We demonstrate that UDON significantly outperforms existing baselines, maintaining high-fidelity reconstructions and consistent scene representations even under extreme communication degradation (as low as 1% success rate). The codes can be found at https://iconlab.negarmehr.com/UDON/
|
| |
| 17:35-17:45, Paper TuBT4.6 | Add to My Program |
| RhoMorph: Rhombus-Shaped Modular Robots for Stable, Medium-Independent Reconfiguration Motion |
|
| Gu, Jie | Fudan University |
| Sun, Yirui | Fudan University |
| Xia, Zhihao | Fudan University |
| Lam, Tin Lun | The Chinese University of Hong Kong, Shenzhen |
| Tian, Chunxu | Fudan University |
| Zhang, Dan | The Hong Kong Polytechnic University |
Keywords: Cellular and Modular Robots
Abstract: In this paper, we present RhoMorph, a novel deformable planar lattice modular self-reconfigurable robot (MSRR) with a rhombus shaped module. Each module consists of a parallelogram skeleton with a single centrally mounted actuator that enables folding and unfolding along its diagonal. The core design philosophy is to achieve essential MSRR functionalities such as morphing, docking, and locomotion with minimal control complexity. This enables a continuous and stable reconfiguration process that is independent of the surrounding medium, allowing the system to reliably form various configurations in diverse environments. To leverage the unique kinematics of RhoMorph, we introduce morphpivoting, a novel motion primitive for reconfiguration that differs from advanced MSRR systems, and propose a strategy for its continuous execution. Finally, a series of physical experiments validate the module's stable reconfiguration ability, as well as its positional and docking accuracy.
|
| |
| 17:45-17:55, Paper TuBT4.7 | Add to My Program |
| Decentralized Reinforcement Learning for Multi-Agent Multi-Resource Allocation Via Dynamic Cluster Agreements |
|
| Marino, Antonio | University of Cambridge |
| Restrepo, Esteban | CNRS |
| Pacchierotti, Claudio | Centre National De La Recherche Scientifique (CNRS) |
| Robuffo Giordano, Paolo | Irisa Cnrs Umr6074 |
Keywords: Multi-Robot Systems, Reinforcement Learning, Swarm Robotics
Abstract: This paper addresses the challenge of allocating heterogeneous resources among multiple agents in a decentralized manner. Our proposed method, Liquid-Graph-Time Clustering-IPPO, builds upon Independent Proximal Policy Optimization (IPPO) by integrating dynamic cluster consensus, a mechanism that allows agents to form and adapt local sub-teams based on resource demands. This decentralized coordination strategy reduces reliance on global information and enhances scalability. We evaluate LGTC-IPPO against standard multi-agent reinforcement learning baselines and a centralized expert solution across a range of team sizes and resource distributions. Experimental results demonstrate that LGTC-IPPO achieves more stable rewards, better coordination, and robust performance even as the number of agents or resource types increases. Additionally, we illustrate how dynamic clustering enables agents to reallocate resources efficiently also for scenarios with discharging resources.
|
| |
| 17:55-18:05, Paper TuBT4.8 | Add to My Program |
| Cooperative Payload Estimation by a Team of Mocobots |
|
| Zhang, Haoxuan | Northwestern University |
| Liu, C. Lin | Northwestern University |
| Elwin, Matthew | Northwestern University |
| Freeman, Randy | Northwestern Univ |
| Lynch, Kevin | Northwestern University |
Keywords: Multi-Robot Systems, Kinematics
Abstract: For high-performance autonomous manipulation of a payload by a mobile manipulator team, or for collaborative manipulation with the human, robots should be able to discover where other robots are attached to the payload, as well as the payload's mass and inertial properties. In this paper, we describe a method for the robots to autonomously discover this information. The robots cooperatively manipulate the payload, and the twist, twist derivative, and wrench data at their grasp frames are used to estimate the transformation matrices between the grasp frames, the location of the payload's center of mass, and the payload's inertia matrix. The method is validated experimentally with a team of three mobile cobots, or mocobots.
|
| |
| 18:05-18:15, Paper TuBT4.9 | Add to My Program |
| P3GASUS: Pre-Planned Path Execution Graphs for Multi-Agent Systems at Ultra-Large Scale |
|
| Duhan, Tanishq Harish | National University of Singapore |
| He, Chengyang | National University Singapore |
| Sartoretti, Guillaume Adrien | National University of Singapore (NUS) |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Distributed Robot Systems, Software Architecture for Robotic and Automation
Abstract: Executing pre-planned paths in multi-agent systems is challenging, as a lack of synchronization can lead to collisions or live-/deadlocks, while enforcing strict synchronization may cause a widespread team delay in reaching goals. An Action Dependency Graph (ADG) solves this problem by identifying and synchronizing only the necessary robots during execution by post-processing all agents’ paths into a static directed graph with actions as nodes and edges representing the execution precedence order between actions. However, ADG's creation phase currently requires an exhaustive search of the action space that inflates both computation and communication (O(R^2 T^2), where R is the number of robots and T is the maximum path length). This makes ADG the bottleneck for current state-of-the-art MAPF planners, which can solve for up to 10000 agents, and lifelong MAPF, which needs constant replanning during execution. Moreover, this biquadratic scaling also limits the extension of ADG to continuous space scenarios, where high-frequency updates typically effectively result in long path lengths. Addressing these limitations, in this work, we propose three improved execution graphs to cater to different needs and scenarios: SAGE, which adds edges based on the sequence in which robots visit a position; MAGE, which takes the execution graph of SAGE as input and prunes its redundant edges, reducing communication burden during execution; and FORTED which combines both approaches with a reduced complexity of O(RT), making it the overall best in discrete scenarios, trading-off scalability to continuous space scenarios. All three methods achieve speedups of 300-8000 folds over ADG, with MAGE and FORTED reducing communication overhead by more than 14 times. By integr
|
| |