| | |
Last updated on May 6, 2026. This conference program is tentative and subject to change
Technical Program for Thursday June 4, 2026
| |
| ThI1I Interactive Session, Hall C |
Add to My Program |
| Interactive Session 5 |
|
| |
| |
| 09:00-10:30, Paper ThI1I.1 | Add to My Program |
| Efficient Training Data Collection for Distance Sensor Arrays through Data Correction and Augmentation Approaches |
|
| Amagai, Sogo | The University of Tokyo |
| Warisawa, Shin'ichi | The University of Tokyo |
| Fukui, Rui | The University of Tokyo |
Keywords: Data Sets for Robot Learning, Transfer Learning
Abstract: Several machine learning (ML)-based measurement systems have been proposed to estimate difficult-to-measure quantities from the values of distance sensor arrays. However, variations in sensor output characteristics (OCs) can lead to degradation in the estimation accuracy when transferring training data acquired from the original acquisition sensors to new target sensors. Moreover, acquiring training data from target sensors is time and labor intensive. We propose two methods to convert previously collected training data to reflect different OCs, enabling their repeated use. For evaluation, we use a device that estimates the relative position and orientation of vehicles based on the values of distance sensor arrays. The correction approach for the training data based on the OC data reduces the root-mean-square error (RMSE) by up to 23% compared with transferring training data. The augmentation approach transforms the training data into data that include different OCs using a mapping function constructed from a small batch of training data. Furthermore, a method for collecting a small batch of training data to achieve a higher OC conversion accuracy is demonstrated. The RMSE is reduced by up to 58% by the proposed method compared with transferring training data. The results of this study demonstrate the feasibility of the practical applications of ML-based measurement systems using distance sensor arrays, which may facilitate the development of simple and fast calibration methods.
|
| |
| 09:00-10:30, Paper ThI1I.2 | Add to My Program |
| InstantPose: Zero-Shot Instance-Level 6D Pose Estimation from a Single View |
|
| Di Felice, Francesco | Mechanical Intelligence Institute, Sant'Anna School of Advanced Studies |
| Remus, Alberto | Sant'Anna School of Advanced Studies |
| Gasperini, Stefano | Technical University of Munich |
| Busam, Benjamin | Technical University of Munich |
| Ott, Lionel | ETH Zurich |
| Thalhammer, Stefan | TU Wien |
| Tombari, Federico | Technische Universität München |
| Avizzano, Carlo Alberto | Scuola Superiore Sant'Anna |
Keywords: RGB-D Perception, Deep Learning for Visual Perception, Deep Learning Methods
Abstract: Object pose estimation using visual data is crucial for robotic interaction with the environment. Many existing instance-level methods are restricted by their requirements for 3D CAD models or multiple object views, which limits their flexibility and generalizability. Overcoming this limitation is critical to enhance the adaptability of pose estimation systems. In this work, a novel pipeline that leverages recent advances in reconstruction techniques is presented to address these challenges. To this end, Large Reconstruction Models (LRM) represent an advanced neural architecture capable of generating 3D object models from a limited set of views. Nevertheless, the resulting 3D models often lack relevant geometric and texture details due to insufficient input information. This research presents InstantPose, an innovative zero-shot instance-level pose estimation method that, building upon LRM, can determine the pose of unseen objects using as little as a single RGB-D query image. Extensive experiments demonstrate that InstantPose achieves remarkable performance in object pose estimation on the YCB-V dataset, compared to methods conceived to rely on a geometrically perfect object's model. Furthermore, the 6D pose provided through the presented approach facilitates successful object grasping, highlighting its practical utility in robotic manipulation tasks.
|
| |
| 09:00-10:30, Paper ThI1I.3 | Add to My Program |
| Enhancing Robot Learning through Cognitive Reasoning Trajectory Optimization under Unknown Dynamics |
|
| Dong, Qingwei | Shenyang Institute of Automation, Chinese Academy of Sciences |
| Wu, Tingting | China Mobile Research Institute |
| Zeng, Peng | Shenyang Institute of Automation Chinese Academy of Sciences |
| Zang, Chuanzhi | Shenyang University of Technology |
| Wan, Guangxi | Shenyang Institute of Automation, Chinese Academy of Sciences |
| Cui, Shijie | Shenyang Institute of Automation Chinese Academy of Sciences |
|
|
| |
| 09:00-10:30, Paper ThI1I.4 | Add to My Program |
| Towards Optimizing a Convex Cover of Collision-Free Space for Trajectory Generation |
|
| Wu, Yuwei | University of Pennsylvania |
| Spasojevic, Igor | University of California, Riverside |
| Chaudhari, Pratik | University of Pennsylvania |
| Kumar, Vijay | University of Pennsylvania |
Keywords: Collision Avoidance, Computational Geometry, Aerial Systems: Applications
Abstract: We propose an online iterative algorithm to optimize a convex cover to under-approximate the free space for autonomous navigation to delineate Safe Flight Corridors (SFC). The convex cover consists of a set of polytopes such that the union of the polytopes represents obstacle-free space, allowing us to find trajectories for robots that lie within the convex cover. In order to find the SFC that facilitates trajectory optimization, we iteratively find overlapping polytopes of maximum volumes that include specified waypoints initialized by a geometric or kinematic planner. Constraints at waypoints appear in two alternating stages of a joint optimization problem, which is solved by a method inspired by the Alternating Direction Method of Multipliers (ADMM) with partially distributed variables. We validate the effectiveness of our proposed algorithm using a range of parameterized environments and show its applications for two-stage motion planning.
|
| |
| 09:00-10:30, Paper ThI1I.5 | Add to My Program |
| Embedded Hierarchical MPC for Autonomous Navigation |
|
| Benders, Dennis | Delft University of Technology |
| Köhler, Johannes | Imperial College London |
| Niesten, Thijs | Technical University Delft |
| Babuška, Robert | Delft University of Technology |
| Alonso-Mora, Javier | Delft University of Technology |
| Ferranti, Laura | Delft University of Technology |
Keywords: Optimization and Optimal Control, Motion and Path Planning, Collision Avoidance, Autonomous Embedded Robotics
Abstract: To efficiently deploy robotic systems in society, mobile robots need to autonomously and safely move through complex environments. Nonlinear model predictive control (MPC) methods provide a natural way to find a dynamically feasible trajectory through the environment without colliding with nearby obstacles. However, the limited computation power available on typical embedded robotic systems, such as quadrotors, poses a challenge to running MPC in real-time, including its most expensive tasks: constraints generation and optimization. To address this problem, we propose a novel hierarchical MPC scheme that consists of a planning and a tracking layer. The planner constructs a trajectory with a long prediction horizon at a slow rate, while the tracker ensures trajectory tracking at a relatively fast rate. We prove that the proposed framework avoids collisions and is recursively feasible. Furthermore, we demonstrate its effectiveness in simulations and lab experiments with a quadrotor that needs to reach a goal position in a complex static environment. The code is efficiently implemented on the quadrotor's embedded computer to ensure real-time feasibility. Compared to a state-of-the-art single-layer MPC formulation, this allows us to increase the planning horizon by a factor of 5, which results in significantly better performance.
|
| |
| 09:00-10:30, Paper ThI1I.6 | Add to My Program |
| Enabling Embodied Human-Robot Co-Learning: Requirements, Method, and Test with Handover Task |
|
| van Zoelen, Emma M. | TNO, Delft University of Technology |
| Veldman-Loopik, Hugo | Delft University of Technology |
| van den Bosch, Karel | TNO |
| Neerincx, Mark | TNO, Delft University of Technology |
| Abbink, David A. | Delft University of Technology |
| Peternel, Luka | Delft University of Technology |
Keywords: Human-Robot Collaboration, Physical Human-Robot Interaction, Reinforcement Learning
Abstract: Despite a large body of research on robot learning, it has not yet been thoroughly studied how collaborating humans and robots learn reciprocally. In such situations, both humans and robots continuously learn about each other and the task through interaction. This paper addresses the research question: “How can human-robot co-learning be facilitated in physically embodied collaborative tasks?”. First, we derived five requirements for successful human-robot co-learning from literature: shared goal, synchrony, interdependence, adaptability, and transparency. Based on these requirements, we designed a collaborative human-robot handover task and a robot Q-learning method. In an evaluation with six human participants co-learning was indeed found to emerge in the hand-over task. Particularly, for three of the human-robot dyads, our designed setup proved to facilitate co-learning in a way that met all five requirements. The task and robot learning method presented in this paper demonstrate how human-robot co-learning can be enabled in physically embodied tasks.
|
| |
| 09:00-10:30, Paper ThI1I.7 | Add to My Program |
| How to Relieve Distribution Shifts in Semantic Segmentation for Off-Road Environments |
|
| Hwang, Ji-Hoon | Seoul National University |
| Kim, Daeyoung | Seoul National University |
| Yoon, Hyung-Suk | Seoul National University |
| Kim, Dong-Wook | Seoul National University |
| Seo, Seung-Woo | Seoul National University |
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, Object Detection, Segmentation and Categorization
Abstract: Semantic segmentation is crucial for autonomous navigation in off-road environments, enabling precise classification of surroundings to identify traversable regions. However, distinctive factors inherent to off-road conditions, such as source-target domain discrepancies and sensor corruption from rough terrain, can result in distribution shifts that alter the data differently from the trained conditions. This often leads to inaccurate semantic label predictions and subsequent failures in navigation tasks. To address this, we propose ST-Seg, a novel framework that expands the source distribution through style expansion (SE) and texture regularization (TR). Unlike prior methods that implicitly apply generalization within a fixed source distribution, ST-Seg offers an intuitive approach for distribution shift. Specifically, SE broadens domain coverage by generating diverse realistic styles, augmenting the limited style information of the source domain. TR stabilizes local texture representation affected by style-augmented learning through a deep texture manifold. Experiments across various distribution-shifted target domains demonstrate the effectiveness of ST-Seg, with substantial improvements over existing methods. These results highlight the robustness of ST-Seg, enhancing the real-world applicability of semantic segmentation for off-road navigation.
|
| |
| 09:00-10:30, Paper ThI1I.8 | Add to My Program |
| QP-Based Inner-Loop Control for Constraint-Safe and Robust Trajectory Tracking for Aerial Robots |
|
| Balandi, Lorenzo | Inria Rennes |
| Robuffo Giordano, Paolo | Irisa Cnrs Umr6074 |
| Tognon, Marco | Inria Rennes |
Keywords: Aerial Systems: Mechanics and Control, Motion Control, Optimization and Optimal Control
Abstract: Accurate trajectory tracking is crucial in aerial robotics. Optimal control methods such as Nonlinear Model Predictive Control (NMPC) are able to track trajectories exploiting the full nonlinear dynamics while respecting constraints. However, the NMPC model-based nature makes it sensitive to mismatches among nominal and real models. A common workaround to mitigate the effects of model uncertainties is to implement an inner-loop controller which robustifies the NMPC outer-loop. However, this inner-loop is usually based on purely feedback-based controllers such as PID or Incremental Nonlinear Dynamic Inversion (INDI), which do not allow to consider any constraint (such as limited actuation) or optimization criteria. In contrast, in this work we propose an optimization-based inner-loop controller inspired by Time Delay Control (TDC), that, thanks to a Quadratic Program (QP) formulation, is able to respect constrains and can thus preserve stability in presence of input saturation and model mismatches. Furthermore, thanks to the use of acceleration feedback, the knowledge of inertial parameters is not required by the proposed inner-loop which therefore makes it even more robust against model uncertainties. The overall architecture is validated on a fully-actuated hexarotor under model mismatches and aggressive trajectories. The experiments clearly show that our QP-based inner-loop improves the NMPC tracking performance while preserving the stability in conditions where a non-optimal (and more classical) inner-loop controllers would fail.
|
| |
| 09:00-10:30, Paper ThI1I.9 | Add to My Program |
| Overcoming Explicit Environment Representations with Geometric Fabrics |
|
| Spahn, Max | TU Delft |
| Bakker, Saray | Delft University of Technology |
| Alonso-Mora, Javier | Delft University of Technology |
Keywords: Reactive and Sensor-Based Planning, Collision Avoidance, Motion Control
Abstract: Deployment of robots in dynamic environments requires reactive trajectory generation. While optimization-based methods, such as Model Predictive Control focus on constraint verificaction, Geometric Fabrics offer a computationally efficient way to generate trajectories that include all avoidance behaviors if the environment can be represented as a set of object primitives. Obtaining such a representation from sensor data is challenging, especially in dynamic environments. In this paper, we integrate implicit environment representations, such as Signed Distance Fields and Free Space Decomposition into the framework of Geometric Fabrics. In the process, we derive how numerical gradients can be integrated into the push and pull operations in Geometric Fabrics. Our experiments reveal that both, ground robots and robotic manipulators, can be controlled using these implicit representations. Moreover, we show that, unlike the explicit representation, implicit representations can be used in the presence of dynamic obstacles without further considerations. Finally, we demonstrate our methods in the real-world, showing the applicability of our approach in practice.
|
| |
| 09:00-10:30, Paper ThI1I.10 | Add to My Program |
| OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding |
|
| Deng, Yinan | Beijing Institute of Technology |
| Wang, Jiahui | Beijing Institute of Technology |
| Zhao, Jingyu | Beijing Institute of Technology |
| Dou, Jianyu | Beijing Institute of Technology |
| Yang, Yi | Beijing Institute of Technology |
| Yue, Yufeng | Beijing Institute of Technology |
| |
| 09:00-10:30, Paper ThI1I.11 | Add to My Program |
| Motion before Action: Diffusing Object Motion As Manipulation Condition |
|
| Su, Yue | Xidian University |
| Zhan, Xinyu | Shanghai Jiao Tong University |
| Fang, Hongjie | Shanghai Jiao Tong University |
| Li, Yong-Lu | Shanghai Jiao Tong University |
| Lu, Cewu | ShangHai Jiao Tong University |
| Yang, Lixin | Shanghai Jiao Tong University |
Keywords: Imitation Learning, Deep Learning in Grasping and Manipulation, Learning from Demonstration
Abstract: Inferring object motion representations from observations enhances the performance of robotic manipulation tasks. This paper introduces a new paradigm for robot imitation learning that generates action sequences by reasoning about object motion from visual observations.We propose MBA, a novel module that employs two cascaded diffusion processes for object motion generation and robot action generation under object motion guidance. MBA first predicts the future pose sequence of the object based on observations, then uses this sequence as a condition to guide robot action generation. Designed as a plug-and-play component, MBA can be flexibly integrated into existing robotic manipulation policies with diffusion action heads. Extensive experiments in both simulated and real-world environments demonstrate that our approach substantially improves the performance of existing policies across a wide range of manipulation tasks. Project page: https://selen-suyue. github.io/MBApage/
|
| |
| 09:00-10:30, Paper ThI1I.12 | Add to My Program |
| HybridTrack: A Hybrid Approach for Robust Multi-Object Tracking |
|
| Di Bella, Leandro | Vrije Universiteit Brussel |
| Lyu, Yangxintong | Vrije Universiteit Brussel |
| Cornelis, Bruno | Vrije Universiteit Brussel |
| Munteanu, Adrian | Vrije Universiteit Brussel |
Keywords: Intelligent Transportation Systems, Visual Tracking, Motion Control
Abstract: The evolution of Advanced Driver Assistance Systems (ADAS) has increased the need for robust and generalizable algorithms for multi-object tracking. Traditional statistical model-based tracking methods rely on predefined motion models and assumptions about system noise distributions. Although computationally efficient, they often lack adaptability to varying traffic scenarios and require extensive manual design and parameter tuning. To address these issues, we propose a novel 3D multi-object tracking approach for vehicles, HybridTrack, which integrates a data-driven Kalman Filter (KF) within a tracking-by-detection paradigm. In particular, it learns the transition residual and Kalman gain directly from data, which eliminates the need for manual motion and stochastic parameter modeling. Validated on the real-world KITTI dataset, HybridTrack achieves 82.72% HOTA accuracy, significantly outperforming state-of-the-art methods. We also evaluate our method under different configurations, achieving the fastest processing speed of 112 FPS. Consequently, HybridTrack eliminates the dependency on scene-specific designs while improving performance and maintaining real-time efficiency.
|
| |
| 09:00-10:30, Paper ThI1I.13 | Add to My Program |
| ForestVO: Enhancing Visual Odometry in Forest Environments through ForestGlue |
|
| Pritchard, Thomas | Imperial College London |
| Ijaz, Saifullah | Imperial College London |
| Clark, Ronald | University of Oxford |
| Kocer, Basaran Bahadir | University of Bristol |
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation
Abstract: Recent advancements in visual odometry systems have improved autonomous navigation, yet challenges persist in complex environments like forests, where dense foliage, variable lighting, and repetitive textures compromise the accuracy of feature correspondences. To address these challenges, we introduce ForestGlue. ForestGlue enhances the SuperPoint feature detector through four configurations - grayscale, RGB, RGB-D, and stereo-vision inputs - optimised for various sensing modalities. For feature matching, we employ LightGlue or SuperGlue, both of which have been retrained using synthetic forest data. ForestGlue achieves comparable pose estimation accuracy to baseline LightGlue and SuperGlue models, yet require only 512 keypoints, just 25% of the 2048 keypoints used by baseline models, to achieve an LO-RANSAC AUC score of 0.745 at a 10 degree threshold. With a 1/4 of the keypoints required, ForestGlue has the potential to reduce computational overhead whilst being effective in dynamic forest environments, making it a promising candidate for real-time deployment on resource-constrained platforms such as drones or mobile robotic platforms. By combining ForestGlue with a novel transformer based pose estimation model, we propose ForestVO that estimates relative camera poses using the 2D pixel coordinates of matched features between frames. On challenging TartanAir forest sequences, ForestVO achieves an average relative pose error (RPE) of 1.09 m and kitti_score of 2.33%, outperforming direct-based methods such as DSO in dynamic scenes, while maintaining competitive performance with TartanVO despite being a significantly lighter model trained on only 10% of the dataset. This work establishes an end-to-end deep learning pipeline tailored for visual odometry in forested environments, leveraging forest-specific training data to optimise feature correspondence and pose estimation for improved accuracy and robustness in autonomous navigation systems.
|
| |
| 09:00-10:30, Paper ThI1I.14 | Add to My Program |
| Do You Know the Way? Human-In-The-Loop Understanding for Fast Traversability Estimation in Mobile Robotics |
|
| Schreiber, Andre | University of Illinois Urbana-Champaign |
| Driggs-Campbell, Katherine | University of Illinois at Urbana-Champaign |
Keywords: Field Robots, Vision-Based Navigation, Deep Learning for Visual Perception
Abstract: The increasing use of robots in unstructured environments necessitates the development of effective perception and navigation strategies to enable field robots to successfully perform their tasks. In particular, it is key for such robots to understand where in their environment they can and cannot travel---a task known as traversability estimation. However, existing geometric approaches to traversability estimation may fail to capture nuanced representations of traversability, whereas vision-based approaches typically either involve manually annotating a large number of images or require robot experience. In addition, existing methods can struggle to address domain shifts as they typically do not learn during deployment. To this end, we propose a human-in-the-loop (HiL) method for traversability estimation that prompts a human for annotations as-needed. Our method uses a foundation model to enable rapid learning on new annotations and to provide accurate predictions even when trained on a small number of quickly-provided HiL annotations. We extensively validate our method in simulation and on real-world data, and demonstrate that it can provide state-of-the-art traversability prediction performance.
|
| |
| 09:00-10:30, Paper ThI1I.15 | Add to My Program |
| A Wearable Isokinetic Training Robot for Enhanced Bedside Knee Rehabilitation |
|
| Feng, Yanggang | Beihang University |
| Hu, Xingyu | Beihang University |
| Li, Yuebing | Beihang University |
| Ma, Ke | Beihang University |
| Ren, Jiaxin | Beihang University |
| Zhou, Zhihao | Peking University |
| Yuan, Fuzhen | Peking University Third Hospital |
| Huang, Yan | Beijing Institute of Technology |
| Wang, Liu | University of Science and Technology of China |
| Wang, Qining | Peking University |
| Zhang, Wuxiang | Beihang University |
| Ding, Xilun | Beijing Univerisity of Aeronautics and Astronautics |
Keywords: Rehabilitation Robotics, Medical Robots and Systems, Human-Centered Robotics, Energy Regeneration
Abstract: Knee pain is prevalent in over 20% of the population, limiting the mobility of those affected. In turn, isokinetic dynamometers and robots have been used to facilitate rehabilitation for those still capable of ambulation. However, there are at most only a few wearable robots capable of delivering isokinetic training for bedridden patients. Here, we developed a wearable robot that provides bedside isokinetic training by utilizing a variable stiffness actuator and dynamic energy regeneration. The efficacy of this device was validated in a study involving six subjects with debilitating knee injuries. During two courses of rehabilitation over a total of three weeks, the average peak torque, average torque, and average work produced by their affected knees increased significantly by 81.0%, 101.4%, and 117.6%, respectively. Furthermore, the device’s energy regeneration features were found capable of extending its operating time to 198 days under normal usage, representing a 57.8% increase over the same device without regeneration. These results suggest potential methodologies for delivering isokinetic joint rehabilitation to bedridden patients in areas with limited infrastructure.
|
| |
| 09:00-10:30, Paper ThI1I.16 | Add to My Program |
| VarWrist: An Anthropomorphic Soft Wrist with Variable Stiffness |
|
| Zhang, Chaozhou | Xi'an Jiaotong University |
| Li, Min | Xi'an Jiaotong University |
| Yang, Zhanshuo | Xi'an Jiaotong University |
| Xiangrui, Kong | Xi'an Jiaotong University |
| Luo, Jiayi | Hunan Readore Technology Co., Ltd |
| Liu, Yushen | Hunan Readore Technology Co., Ltd |
| Fu, Jian | Hunan Readore Technology Co., Ltd |
| Xu, Guanghua | School of Mechanical Engineering, Xi'an Jiaotong University |
| Luo, Shan | King's College London |
Keywords: Grippers and Other End-Effectors, Soft Sensors and Actuators, Hydraulic/Pneumatic Actuators
Abstract: Robotic wrists play a crucial role in enhancing the dexterity and stability of robotic end-effectors. Existing rigid robotic wrists tend to be complex and lack flexibility, while soft robotic wrists often struggle with limited load-bearing capacity and lower accuracy. Human wrists feature multi-degrees of freedom and variable stiffness, which help human hands to accomplish daily tasks. This study presents an innovative anthropomorphic soft robotic wrist, VarWrist, equipped with a fiber jamming variable stiffness module, enabling stiffness adjustment through vacuuming. VarWrist consists of three parallel bellows, utilizing a positive-negative pneumatic actuation strategy to mimic human wrist motion. In addition, the trajectory equation of the rotation center was fitted through modeling. We developed a prototype of VarWrist and assessed its performance. Results indicate that the soft wrist surpasses the motion range of human wrists, achieving flexion (81.9°), extension (78.5°), ulnar deviation (70.5°), and radial deviation (70.5°). The bending motion trajectory showed a 73% increase in similarity to human motion compared to fixed-axis rotation, with VarWrist exhibiting a significant range of variable stiffness (resting state: 206%, working state: 155%). Demonstration experiments confirm that this wrist facilitates a dexterous hand in completing grasping tasks that would be unattainable by the hand alone.
|
| |
| 09:00-10:30, Paper ThI1I.17 | Add to My Program |
| BOSS: Benchmark for Observation Space Shift in Long-Horizon Task |
|
| Yang, Yue | The University of North Carolina at Chapel Hill |
| Zhao, Linfeng | Northeastern University |
| Ding, Mingyu | University of North Carolina at Chapel Hill |
| Bertasius, Gedas | UNC Chapel Hill |
| Szafir, Daniel J. | University of North Carolina at Chapel Hill |
Keywords: Imitation Learning, AI-Based Methods, Data Sets for Robot Learning
Abstract: Robotics has long sought to develop robots capable of completing previously unseen long-horizon tasks. Hierarchical approaches offer a pathway for achieving this goal by executing skill combinations arranged by a task planner, with each visuomotor skill pre-trained using a specific imitation learning (IL) algorithm. However, even in simple long-horizon tasks like skill chaining, hierarchical approaches often struggle due to a problem we identify as Observation Space Shift (OSS), where the sequential execution of preceding skills causes shifts in the observation space, disrupting the performance of subsequent individually trained skill policies. To understand OSS and evaluate its impact on long-horizon tasks, we introduce BOSS (a Benchmark for Observation Space Shift). BOSS comprises three distinct challenges: "Single Predicate Shift," "Accumulated Predicate Shift," and "Skill Chaining," each designed to assess a different aspect of OSS's negative effect. We evaluated several recent popular IL algorithms on BOSS, including three Behavioral Cloning methods and the Visual Language Action model OpenVLA. Even on the simplest challenge, we observed average performance drops of 67%, 35%, 34%, and 54%, respectively, when comparing skill performance with and without OSS. Additionally, we investigate three potential solutions, including using frozen robotics-specific vision encoders, switching to 3D pointcloud-based inputs, and applying data augmentation to expand visual diversity. Our results show that none of these approaches are sufficient to resolve OSS. The project page is: https://boss-benchmark.github.io/
|
| |
| 09:00-10:30, Paper ThI1I.18 | Add to My Program |
| What's the Deal with Robot Comedy? Pinpointing the Impact of Post-Joke Repartee in a Robotic Comedian's Performance |
|
| Robinson, Ayan | Oregon State University |
| Woods, Sarah | Oregon State University |
| Shippy, Madison | Oregon State University |
| Walcott, DeAndre | Oregon State University |
| Fitter, Naomi T. | Oregon State University |
Keywords: Art and Entertainment Robotics, Social HRI, Robot Companions
Abstract: The rise in prevalence of AI-enabled technologies (from voice assistants to social robots) has not yet been accompanied by an analogous mastery of computer-mediated humor. Although humans often use jokes to repair interactions and navigate uncomfortable scenarios, social robots in similar roles typically fall short at reading the room and adapting behavior according to sensed social contexts and reactions. We pursued two studies to gain clearer evidence about adaptive robot joking's influence (compared to hardcoded repartee or no robot banter). The first study (N = 48, between-subjects design) examined in-person one-on-one human-robot interactions across the three conditions. The results indicated that adaptive repartee by robots tended to increase perceived warmth, competence, comfort, social closeness feelings, and humorousness, and that human behavioral responses varied significantly between conditions, with any repartee leading to significant gains over no repartee. The second study used an online video-based survey with a within-subjects design (N = 99) to examine the same conditions. This follow-up effort showed significant gains in perceived competence and anthropomorphism for any type of repartee, although this banter also made the robot more discomforting. Our work can help practitioners who are interested in applying playful banter to enhance robot charm and success.
|
| |
| 09:00-10:30, Paper ThI1I.19 | Add to My Program |
| I2D-LocX: An Efficient, Precise and Robust Method for Camera Localization in LiDAR Maps |
|
| Yu, Huai | Wuhan University |
| Zhu, Xubo | Wuhan University |
| Han, Shu | Wuhan University |
| Yang, Wen | Wuhan University |
| Xia, Gui-Song | Wuhan University |
Keywords: Localization, SLAM, Sensor Fusion
Abstract: Camera localization within LiDAR maps has gained significant attention due to its potential for accurate positioning with low-cost and lightweight sensors compared to LiDAR-based systems. However, existing methods often prioritize localization accuracy, sometimes compromising efficiency, which can limit their suitability for real-time applications. To address these issues, we propose I2D-LocX, a lightweight monocular camera localization framework with three branches, establishing pixel-level and feature-level constraints to enhance localization performance without increasing model complexity. Specifically, the main branch generates a flow map to represent pixel-point displacements. One auxiliary branch shares the same input as the main branch and employs an additional decoder to evaluate the confidence of the flow map. The other auxiliary branch leverages a zero-flow generated from the displacement-free input to guide feature matching, thereby enhancing localization robustness. Notably, both auxiliary branches share parameters with the main branch and are omitted during inference, ensuring computational efficiency. Extensive experiments on benchmark datasets, including KITTI-Odometry, Argoverse, Waymo, and nuScenes, show that I2D-LocX can achieve centimeter-level localization accuracy with about 37 millisecond inference time, greatly improving the localization performance for real-world applications.
|
| |
| 09:00-10:30, Paper ThI1I.20 | Add to My Program |
| Modeling and Reinforcement Learning-Based Control of Simultaneous Positive and Negative Pressure Generation in Pneumatic Systems |
|
| Park, Sang Hyeon | Sungkyunkwan University |
| Doh, Myeongyun | Sungkyunkwan University |
| Park, Chanyong | Department of Mechanical Engineering, Sungkyunkwan Univ |
| Luong, Tuan | Sungkyunkwan University |
| Choi, Hyouk Ryeol | Sungkyunkwan University |
| Koo, Ja Choon | Sungkyunkwan University |
| Rodrigue, Hugo | Sungkyunkwan University |
| Moon, Hyungpil | Sungkyunkwan University |
Keywords: Reinforcement Learning, Modeling, Control, and Learning for Soft Robots
Abstract: In soft robotics, actuators using both positive and negative pressures are notable for their high payload-to-weight ratios and wide operating ranges, but they require separate power sources. A single-pump system generating dual pressures presents a promising solution, though addressing pressure fluctuations due to coupled dynamics remains a challenge. In this work, we propose a reinforcement learning (RL)-based controller capable of tracking both pressures over a wide range. To facilitate RL training, we built a simulator that models not only airflow dynamics but also the pump’s kinematics and the electromagnetic behavior of pneumatic components. Our controller employs Model-Predicted Observation (MPObs) to predict future input effects and mitigate nonlinearities, and uses a Conditioning for Action Policy Smoothness (CAPS)-based action smoothing to reduce abrupt input changes. Experimental results show that the proposed RL controller achieves root-mean-square errors (RMSEs) of 0.6935 kPa (positive) and 0.2646 kPa (negative), outperforming the Disturbance Observer (DOB)-based approach. Ablation studies confirm the synergistic effect of MPObs and CAPS, underscoring their importance in control. Furthermore, robustness tests with external loads from 0 to 20 kg demonstrate a maximum RMSE of 0.7906 kPa (positive) and 0.1186 kPa (negative), indicating strong robustness. This study verifies that our proposed RL-based controller overcomes the nonlinear challenges of pneumatic power sources and highlights its potential for future stand-alone systems in field applications.
|
| |
| 09:00-10:30, Paper ThI1I.21 | Add to My Program |
| MV3D: Multi-View 3D Reconstruction of Objects Using Forward-Looking Sonar |
|
| Jaber, Nael | DFKI |
| Wehbe, Bilal | German Research Center for Artificial Intelligence |
| Christensen, Leif | DFKI |
| Kirchner, Frank | University of Bremen |
Keywords: Marine Robotics, Deep Learning Methods, Deep Learning for Visual Perception
Abstract: This work proposes a method for learning features from a batch of 2D sonar images to predict a multi-view point-cloud for achieving a dense 3D-reconstruction. In comparison to vision-based sensors, acoustics are considered a reliable sensing modality in underwater environments. The output of sonars is a 2D image which is unable to represent the scanned scene in all three dimensions. Estimation of this missing information, known as the elevation angle, is the key to performing 3d-reconstruction from acoustic images. One of the approaches is to predict a depth-map from the 2D sonar image, and transforming it into a point-cloud. In this paper, this idea is further improved into learning features from a batch of 2D acoustic images and predicting multiple depthmaps of the scanned object which covers it from different viewpoints. For training the deep learning model, and due to the lack of datasets from real environments, data was generated synthetically. For reducing the simulation-to-real gap, a Cycle-GAN was trained on real images for transferring the realistic style into the synthetically generated images. The conducted experiments in simulation showed that the proposed method is able to perform dense 3D reconstruction. The approach was then further tested in a real environment using an underwater vehicle, which accurately 3d-reconstructed the scanned objects achieving an average chamfer distance error of 0.06 meters when compared to a laser-scanned ground-truth.
|
| |
| 09:00-10:30, Paper ThI1I.22 | Add to My Program |
| Anticipating Degradation: A Predictive Approach to Fault Tolerance in Robot Swarms |
|
| O'Keeffe, James | University of York |
Keywords: Swarm Robotics, Multi-Robot Systems, Failure Detection and Recovery
Abstract: An active approach to fault tolerance is essential for robot swarms to achieve long-term autonomy. Previous efforts have focused on responding to spontaneous electro-mechanical faults and failures. However, many faults occur gradually over time. This work argues that the principles of predictive mainte- nance, in which potential faults are resolved before they hinder the operation of the swarm, offer a promising means of achieving long-term fault tolerance. This is a novel approach to swarm fault tolerance, which is shown to give a comparable or improved performance when tested against a reactive approach in almost all cases tested.
|
| |
| 09:00-10:30, Paper ThI1I.23 | Add to My Program |
| SoMaSLAM: 2D Graph SLAM for Sparse Range Sensing with Soft Manhattan World Constraints |
|
| Han, Jeahn | Gwangju Institute of Science and Technology |
| Hu, Zichao | University of Texas at Austin |
| Yang, Seonmo | Gwangju Institute of Science and Technology (GIST) |
| Kim, Minji | Gwangju Institute of Science and Technology |
| Kim, Pyojin | Gwangju Institute of Science and Technology (GIST) |
Keywords: SLAM, Range Sensing, Aerial Systems: Perception and Autonomy
Abstract: We propose a graph SLAM algorithm for sparse range sensing that incorporates a soft Manhattan world utilizing landmark-landmark constraints. Sparse range sensing is necessary for tiny robots that do not have the luxury of using heavy and expensive sensors. Existing SLAM methods dealing with sparse range sensing lack accuracy and accumulate drift error over time due to limited access to data points. Algorithms that cover this flaw using structural regularities, such as the Manhattan world (MW), have shortcomings when mapping real-world environments that do not coincide with the rules. We propose SoMaSLAM, a 2D graph SLAM designed for tiny drones with sparse range sensing. Our approach effectively maps sparse range data without enforcing strict structural regularities and maintains an adaptive graph. We implement the MW assumption as soft constraints, which we refer to as a soft Manhattan world. We propose novel soft landmark-landmark constraints to incorporate the soft MW into graph SLAM. Through extensive evaluation, we demonstrate that our proposed SoMaSLAM method improves localization accuracy on diverse datasets and is flexible enough to be used in the real world. We plan to release our code and dataset on our project page https://SoMaSLAM.github.io/.
|
| |
| 09:00-10:30, Paper ThI1I.24 | Add to My Program |
| Image-Based Roadmaps for Vision-Only Planning and Control of Robotic Manipulators |
|
| Chatterjee, Sreejani | Worcester Polytechnic Institute |
| Gandhi, Abhinav | University of Twente |
| Calli, Berk | Worcester Polytechnic Institute |
| Chamzas, Constantinos | Worcester Polytechnic Institute |
Keywords: Motion and Path Planning, Collision Avoidance, Integrated Planning and Control
Abstract: This work presents a motion planning framework for robotic manipulators that computes collision-free paths directly in image space. The generated paths can then be tracked using vision-based control, eliminating the need for an explicit robot model or proprioceptive sensing. At the core of our approach is the construction of a roadmap entirely in image space. To achieve this, we explicitly define sampling, nearest-neighbor selection, and collision checking based on visual features rather than geometric models. We first collect a set of image-space samples by moving the robot within its workspace, capturing keypoints along its body at different configurations. These samples serve as nodes in the roadmap, which we construct using either learned or predefined distance metrics. At runtime, the roadmap generates collision-free paths directly in image space, removing the need for a robot model or joint encoders. We validate our approach through an experimental study in which a robotic arm follows planned paths using an adaptive vision-based control scheme to avoid obstacles. The results show that paths generated with the learned-distance roadmap achieved 100% success in control convergence, whereas the predefined image-space distance roadmap enabled faster transient responses but had a lower success rate in convergence.
|
| |
| 09:00-10:30, Paper ThI1I.25 | Add to My Program |
| Haptic Stiffness Perception Using Hand Exoskeletons in Tactile Robotic Telemanipulation |
|
| Giudici, Gabriele | University College London |
| Coppola, Claudio | Humanoid AI |
| Althoefer, Kaspar | Queen Mary University of London |
| Farkhatdinov, Ildar | King's College London |
| Jamone, Lorenzo | University College London |
Keywords: Haptics and Haptic Interfaces, Telerobotics and Teleoperation, Perception for Grasping and Manipulation
Abstract: Robotic telemanipulation—the human-guided manipulation of remote objects—plays a pivotal role in several applications, from healthcare to operations in harsh environments. While visual feedback from cameras can provide valuable information to the human operator, haptic feedback is essential for accessing specific object properties that are difficult to be perceived by vision, such as stiffness. For the first time, we present a participant study demonstrating that operators can perceive the stiffness of remote objects during real-world telemanipulation with a dexterous robotic hand, when haptic feedback is generated from tactile sensing fingertips. Participants were tasked with squeezing soft objects by teleoperating a robotic hand, using two methods of haptic feedback: one based solely on the measured contact force, while the second also includes the squeezing displacement between the leader and follower devices. Our results demonstrate that operators are indeed capable of discriminating objects of different stiffness, relying on haptic feedback alone and without any visual feedback. Additionally, our findings suggest that the displacement feedback component may enhance discrimination with objects of similar stiffness.
|
| |
| 09:00-10:30, Paper ThI1I.26 | Add to My Program |
| A Modular Residual Learning Framework to Enhance Model-Based Approach for Robust Locomotion |
|
| Kim, Min-Gyu | KAIST |
| Kang, Dongyun | Korea Advanced Institute of Science and Technology |
| Kim, Hajun | Korea Advanced Institute of Science and Technology |
| Park, Hae-Won | Korea Advanced Institute of Science and Technology |
Keywords: Legged Robots, Machine Learning for Robot Control, Optimization and Optimal Control
Abstract: This paper presents a novel approach that combines the advantages of both model-based and learning-based frameworks to achieve robust locomotion. The residual modules are integrated with each corresponding part of the model-based framework, a footstep planner and dynamic model designed using heuristics, to complement performance degradation caused by a model mismatch. By utilizing a modular structure and selecting the appropriate learning-based method for each residual module, our framework demonstrates improved control performance in environments with high uncertainty, while also achieving higher learning efficiency compared to baseline methods. Moreover, we observed that our proposed methodology not only enhances control performance but also provides additional benefits, such as making nominal controllers more robust to parameter tuning. To investigate the feasibility of our framework, we demonstrated residual modules combined with model predictive control in a real quadrupedal robot. Despite uncertainties beyond the simulation, the robot successfully maintains balance and tracks the commanded velocity.
|
| |
| 09:00-10:30, Paper ThI1I.27 | Add to My Program |
| Design, Modeling, and Experimental Characterization of a Rod-Driven Continuum Robot with Asymmetric Joints for Active Chest Catheters |
|
| Lari, Mohammadmehdi | University of Rome Tor Vergata |
| Russo, Matteo | University of Rome Tor Vergata |
Keywords: Mechanism Design, Tendon/Wire Mechanism, Soft Robot Materials and Design
Abstract: Thoracostomy involves draining fluid from the pleural cavity using chest tubes. This medical intervention is currently performed manually by inserting a hollow flexible tube, risking damage to vital organs, including the lungs, diaphragm, spleen, and mediastinum, due to the lack of control over the tube’s path inside the patient’s body. Inspired by snake-like structures, continuum robots are particularly well-suited to address the challenges encountered during thoracostomy. Taking advantage of their slender shape, they can nest inside the tubes and guide them from within without requiring further incision. However, available continuum robots are not suitable for this application due to geometrical and payload requirements. In this paper, a novel design is presented, leveraging a multi-backbone structure with asymmetrical rolling joints to enhance payload capacity and dexterity while maintaining the slender shape of the robot. A static modeling approach is proposed to estimate the configuration of the robot given the force applied to the robot, including the effects of friction and gravity often neglected for these robots. Two prototypes were 3D-printed, allowing for after-use disposal due to their cost-effectiveness, thereby preventing cross-contamination. Stiffness and position error were evaluated for the prototypes, demonstrating a modeling accuracy of 2.25%.
|
| |
| 09:00-10:30, Paper ThI1I.28 | Add to My Program |
| Localized Coverage Planning for a Heat Transfer Tube Inspection Robot |
|
| Li, Jiawei | Harbin Engineering University |
| Liu, Zhaojin | Harbin Engineering University |
| Li, Yuxiao | Harbin Engineering University |
| Li, Yuanyue | Harbin Engineering University |
| Huang, Yimin | Harbin Engineering University |
| Wang, Gang | Harbin Engineering University |
Keywords: Task and Motion Planning, Industrial Robots, Robotics and Automation in Construction
Abstract: The heat transfer tubes of the steam generator are critical components of the nuclear power system and require regular inspection to ensure safety. The SG-Climbot, a quadruped heat transfer tube inspection robot, is equipped with a guiding device capable of simultaneously aligning with and inspecting two heat transfer tubes. Furthermore, The guiding device must execute hundreds of pose configuration transformations to complete a localized coverage inspection, thereby presenting challenges to the robot’s efficient autonomous planning. This letter presents a planning framework for the SG-Climbot’s localized coverage inspection task.The framework consists of four planning levels: pair planning, position and orientation planning for the guiding device, inspection sequence planning, and time-optimal trajectory planning. A maximum matching algorithm suitable for robotic arms equipped with dual execution devices to perform tasks has been proposed, achieving the optimal pairing of heat transfer tubes and reducing inspection time by over 48 minutes (18.32% improvement). In addition, we analyze the impact of various Traveling Salesman Problem (TSP) solving algorithms on sequence planning issues that require reaching numerous nodes within short operation times, reducing the arm operating time by 33.20 s (6.99% improvement). Finally, the effectiveness of the proposed planning algorithm was validated through simulations and experiments.
|
| |
| 09:00-10:30, Paper ThI1I.29 | Add to My Program |
| Tidiness Score-Guided Monte Carlo Tree Search for Visual Tabletop Rearrangement |
|
| Kee, Hogun | Seoul National University |
| Oh, Wooseok | Seoul National Univetsity |
| Kang, Minjae | Seoul National University (SNU) |
| Ahn, Hyemin | POSTECH |
| Oh, Songhwai | Seoul National University |
Keywords: Manipulation Planning, Data Sets for Robot Learning, Deep Learning Methods
Abstract: In this paper, we present the tidiness score-guided Monte Carlo tree search (TSMCTS), a novel framework designed to address the tabletop tidying up problem using only an RGB-D camera. We address two major problems for tabletop tidying up problem: (1) the lack of public datasets and benchmarks, and (2) the difficulty of specifying the goal configuration of unseen objects. We address the former by presenting the tabletop tidying up (TTU) dataset, a structured dataset collected in simulation. Using this dataset, we train a vision-based discriminator capable of predicting the tidiness score. This discriminator can consistently evaluate the degree of tidiness across unseen configurations, including real-world scenes. Addressing the second problem, we employ Monte Carlo tree search (MCTS) to find tidying trajectories without specifying explicit goals. Instead of providing specific goals, we demonstrate that our MCTS-based planner can find diverse tidied configurations using the tidiness score as a guidance. Consequently, we propose TSMCTS, which integrates a tidiness discriminator with an MCTS-based tidying planner to find optimal tidied arrangements. TSMCTS has successfully demonstrated its capability across various environments, including coffee tables, dining tables, office desks, and bathrooms. The TTU dataset and code will be publicly available.
|
| |
| 09:00-10:30, Paper ThI1I.30 | Add to My Program |
| A High-Payload Robotic Hopper Powered by Bidirectional Thrusters |
|
| Li, Song | City University of Hong Kong |
| Bai, Songnan | University of Hong Kong |
| Jia, Ruihan | City University of Hong Kong |
| Cai, Yixi | KTH Royal Institute of Technology |
| Ding, Runze | City University of Hongkong |
| Shi, Yu | City University of Hong Kong |
| Zhang, Fu | University of Hong Kong |
| Chirarattananon, Pakpong | City University of Hong Kong |
Keywords: Aerial Systems: Mechanics and Control, Legged Robots, Dynamics, Hybrid Locomotion
Abstract: Mobile robots have revolutionized various fields, offering solutions for manipulation, environmental monitoring, and exploration. However, payload capacity remains a limitation. This paper presents a novel thrust-based robotic hopper capable of carrying payloads up to 9 times its own weight while maintaining agile mobility over less structured terrain. The 220 gram robot carries up to 2 kg while hopping—--a capability that bridges the gap between high-payload ground robots and agile aerial platforms. Key advancements that enable this high-payload capacity include the integration of bidirectional thrusters, allowing for both upward and downward thrust generation to enhance energy management while hopping. Additionally, we present a refined model of dynamics that accounts for heavy payload conditions, particularly for large jumps. To address the increased computational demands, we employ a neural network compression technique, ensuring real-time onboard control. The robot's capabilities are demonstrated through a series of experiments, including leaping over a high obstacle, executing sharp turns with large steps, as well as performing simple autonomous navigation while carrying a 730 g LiDAR payload. This showcases the robot's potential for applications such as mobile sensing and mapping in challenging environments.
|
| |
| 09:00-10:30, Paper ThI1I.31 | Add to My Program |
| Differentiable Motion Manifold Primitives for Reactive Motion Generation under Kinodynamic Constraints |
|
| Lee, Yonghyeon | Massachusetts Institute of Technology |
Keywords: Representation Learning, Learning from Demonstration
Abstract: Real-time motion generation -- which is essential for achieving reactive and adaptive behavior -- under kinodynamic constraints for high-dimensional systems is a crucial yet challenging problem. We address this with a two-step approach: offline learning of a lower-dimensional trajectory manifold of task‑relevant, constraint‑satisfying trajectories, followed by rapid online search within this manifold. Extending the discrete‑time Motion Manifold Primitives (MMP) framework, we propose Differentiable Motion Manifold Primitives (DMMP), a novel neural network architecture that encodes and generates continuous‑time, differentiable trajectories, trained using data collected offline through trajectory optimizations, with a strategy that ensures constraint satisfaction -- absent in existing methods. Experiments on dynamic throwing with a 7‑DoF robot arm demonstrate that DMMP outperforms prior methods in planning speed, task success, and constraint satisfaction.
|
| |
| 09:00-10:30, Paper ThI1I.32 | Add to My Program |
| MonoKey: Keypoint-Based Monocular 3D Object Detection Using Prior Guidance for Occlusion Robustness |
|
| Cho, Yeon Woo | Chonnam National University |
| Cheon, Jung Woo | Chonnam National University |
| Yoon, Jae Hyun | Chonnam National University |
| Yoo, Seok Bong | Chonnam National University |
Keywords: Object Detection, Segmentation and Categorization, Computer Vision for Automation, AI-Based Methods
Abstract: Monocular 3D object detection has gained attention for its cost-efficiency and simpler setup compared to multi-sensor systems. In this task, accurate depth estimation is crucial for precise object localization, yet extracting sufficient depth cues from a single image remains inherently challenging. Moreover, when occlusions occur, structural cues become limited, making precise object localization even more difficult. To address these problems, we propose MonoKey, a keypoint-based monocular 3D object detection method robust to occlusion. MonoKey leverages 2D keypoints due to their suitability for recovering occluded regions. The occlusion-robust 2D keypoint detection approach estimates keypoints and reconstructs occluded ones by using prior information. The frequency-based global-local depth predictor estimates 3D cues using fast Fourier convolution to incorporate both global and local context. These 3D cues and keypoints are then fused in a 3D detection decoder. Additionally, relational graph refinement adjusts initial bounding boxes for improved localization. Experimental results indicate that MonoKey outperforms the existing monocular 3D object detection methods. The source code is available at https://anonymous.4open.science/r/MonoKey-B72B.
|
| |
| 09:00-10:30, Paper ThI1I.33 | Add to My Program |
| FDSPC: Fast and Direct Smooth Motion Planning Via Continuous Curvature Integration |
|
| Chen, Zong | Huazhong University of Science and Technology |
| Shao, Haoluo | Huazhong University of Science and Technology |
| Liu, Ben | Huazhong University of Science and Technology |
| Qiao, Siyuan | Huazhong University of Science and Technology |
| Zhou, Yu | Huazhong University of Science and Technology |
| Li, Yiqun | Huazhong University of Science and Technology |
Keywords: Motion and Path Planning, Task and Motion Planning, Simulation and Animation
Abstract: In recent decades, mobile robot motion planning has seen significant advancements. Both search-based and sampling-based methods have demonstrated capabilities to find feasible solutions in complex scenarios. Mainstream path planning algorithms divide the map into occupied and free spaces, considering only planar movement and ignoring the ability of mobile robots to traverse obstacles in the z-direction. Additionally, paths generated often have numerous bends, requiring additional smoothing post-processing. In this work, a fast, and direct motion planning method based on continuous curvature integration that takes into account the robot's obstacle-crossing ability under different parameter settings is proposed. This method generates smooth paths directly with pseudo-constant velocity and limited curvature, and performs curvature-based speed planning in complex 2.5-D terrain-based environment (take into account the ups and downs of the terrain), eliminating the subsequent path smoothing process and enabling the robot to track the path generated directly. The proposed method is also compared with some existing approaches in terms of solution time, path length, memory usage and smoothness under multiple scenarios. The proposed method is vastly superior to the average performance of state-of-the-art (SOTA) methods, especially in terms of the self-defined S_2 smoothness (mean angle of steering). Furthermore, simulations and experiments are conducted on our self-designed wheel-legged robot with 2.5-D traversability. These results demonstrate the effectiveness and superiority of the proposed approach in several representative environments. The implementation of this work is available at https://github.com/SkelonChan/GPCC_curvature_planning.
|
| |
| 09:00-10:30, Paper ThI1I.34 | Add to My Program |
| DCL-Sparse: Distributed Relative Localization in Sparse Graphs |
|
| Sagale, Atharva | University of Georgia |
| Kargar Tasooji, Tohid | University of Georgia |
| Parasuraman, Ramviyas | University of Georgia |
Keywords: Multi-Robot Systems, Sensor Networks, Networked Robots
Abstract: This paper presents a novel approach to range-based distributed cooperative localization (DCL) for robot swarms in GPS-denied environments, relying solely on inter-robot range measurements, specifically addressing the limitations of current methods in noisy and sparse settings where the geometric non-rigidity of the sensing graph creates flipping (suboptimal) effects in the localization outcomes. We propose a robust multilayered localization framework (DCL-Sparse) that utilizes distributed 1-hop shadow edges (S1-Edge) to address the non-rigidity problem and improve localization convergence in sparse and noisy sensing graphs. Our approach leverages the advantages of distributed localization methods, enhancing scalability and adaptability in large robot networks. We establish theoretical conditions for the new S1-Edge that ensure solutions exist even in the presence of noise, thereby validating the effectiveness of the new shadow edge localization. Extensive simulation and real-world experiments confirm the superior performance of our method compared to state-of-the-art techniques, resulting in a reduction of up to 93% in the localization error in DCL. These experiments demonstrate substantial improvements in localization accuracy and robustness to sparse graphs. DCL-Sparse increases the localizability of large multi-robot and sensor networks, offering a powerful tool for high-performance and reliable operations in challenging large-scale environments.
|
| |
| 09:00-10:30, Paper ThI1I.35 | Add to My Program |
| Previous Knowledge Utilization in Online Anytime Belief Space Planning |
|
| Novitsky, Michael | Technion - Israel Institute of Technology |
| Barenboim, Moran | Technion - Israel Institute of Technology |
| Indelman, Vadim | Technion - Israel Institute of Technology |
Keywords: Planning under Uncertainty, Autonomous Agents
Abstract: Online planning under uncertainty remains a critical challenge in robotics and autonomous systems. While tree search techniques are commonly employed to construct partial future trajectories within computational constraints, most existing methods discard information from previous planning sessions considering continuous spaces. This study presents a novel, computationally efficient approach that leverages historical planning data in current decision-making processes. We provide theoretical foundations for our information reuse strategy and introduce an algorithm based on Monte Carlo Tree Search (MCTS) that implements this approach. Experimental results demonstrate that our method significantly reduces computation time while maintaining high performance levels. Our findings suggest that integrating historical planning information can substantially improve the efficiency of online decisionmaking in uncertain environments, paving the way for more responsive and adaptive autonomous systems.
|
| |
| 09:00-10:30, Paper ThI1I.36 | Add to My Program |
| Innovative Design of Multi-Functional Supernumerary Robotic Limbs with Ellipsoid Workspace Optimization |
|
| Huo, Jun | Huazhong University of Science and Technology |
| Huang, Jian | Huazhong University of Science and Technology |
| Zuo, Jie | Wuhan University of Technology |
| Yang, Bo | Huazhong University of Science and Technology |
| Fu, Zhongzheng | Huazhong University of Science and Technology |
| Li, Xi | Huazhong University of Science and Technology |
| Mohammed, Samer | University of Paris Est Créteil - (UPEC) |
Keywords: Wearable Robots, Mechanism Design, Optimization and Optimal Control, Rehabilitation Robotics
Abstract: Supernumerary robotic limbs (SRL) offer substantial potential in both the rehabilitation of hemiplegic patients and the enhancement of functional capabilities for healthy individuals. Designing a general-purpose SRL device is inherently challenging, particularly when developing a unified theoretical framework that meets the diverse functional requirements of both upper and lower limbs. In this paper, we propose a MOO design theory that integrates grasping workspace similarity, walking workspace similarity, bracing force for STS movements, and overall mass and inertia. To facilitate rapid and stable convergence of the model to high-dimensional irregular Pareto fronts, we introduce a multi-subpopulation correction firefly algorithm. The optimized solution is utilized to redesign the prototype for experimentation to meet specified requirements. Six healthy participants and two hemiplegia patients participated in real experiments. Compared to the pre-optimization results, the average grasp success rate improved by 7.2%, while muscle activity during walking and STS tasks decreased by an average of 12.7% and 25.1%, respectively, following the optimization.
|
| |
| 09:00-10:30, Paper ThI1I.37 | Add to My Program |
| Estimating Trust in Human-Robot Collaboration through Behavioral Indicators and Explainability |
|
| Campagna, Giulio | Aalborg University |
| Lagomarsino, Marta | Istituto Italiano Di Tecnologia |
| Lorenzini, Marta | Istituto Italiano Di Tecnologia |
| Chrysostomou, Dimitrios | Aalborg University |
| Rehm, Matthias | Aalborg University |
| Ajoudani, Arash | Istituto Italiano Di Tecnologia |
Keywords: Human Factors and Human-in-the-Loop, Acceptability and Trust, Human-Robot Collaboration
Abstract: Industry 5.0 focuses on human-centric collaboration between humans and robots, prioritizing safety, comfort, and trust. This study introduces a data-driven framework to assess trust using behavioral indicators. The framework employs a Preference-Based Optimization algorithm to generate trust-enhancing trajectories based on operator feedback. This feedback serves as ground truth for training machine learning models to predict trust levels from behavioral indicators. The framework was tested in a chemical industry scenario where a robot assisted a human operator in mixing chemicals. Machine learning models classified trust with over 80% accuracy, with the Voting Classifier achieving 84.07% accuracy and an AUC-ROC score of 0.90. These findings underscore the effectiveness of data-driven methods in assessing trust within human-robot collaboration, emphasizing the valuable role behavioral indicators play in predicting the dynamics of human trust.
|
| |
| 09:00-10:30, Paper ThI1I.38 | Add to My Program |
| HADEC - High-Response Artificial Muscle Actuator Using Dimethyl Ether Combustion |
|
| Mori, Kengo | Chuo University |
| Tsurumi, Koya | Chuo University |
| Sawahashi, Ryunosuke | Chuo University |
| Enjo, Ryuto | Chuo University |
| Nakamura, Taro | Chuo University |
| Okui, Manabu | Chuo University |
Keywords: Soft Robot Applications, Hydraulic/Pneumatic Actuators, Human-Robot Collaboration
Abstract: This paper introduces a high-response artificial muscle actuator using dimethyl ether combustion (HADEC), which is a novel method to enhance the responsiveness and force output of pneumatic actuators. The HADEC system integrates a McKibben-type artificial muscle filled with a combustible mixture of dimethyl ether (DME) and air, and it is ignited to generate rapid fluid pressure through combustion. This approach achieves force, displacement, and response speeds comparable to those of biological muscles while maintaining the simplicity and low-cost structure of McKibben-type actuators. The system provides instantaneous force generation without the need for complex mechanisms such as latches or brakes. DME, which is an environment- friendly fuel, ensures minimal emissions. Experimental results validate the effectiveness of HADEC in improving responsiveness, and the findings suggest superior force generation, faster response times, and high-frequency operability compared to that of conventional pneumatic actuators. Further, the paper discusses the potential for repeated actuation and highlights the benefits of HADEC in various robotic applications that require rapid and significant force.
|
| |
| 09:00-10:30, Paper ThI1I.39 | Add to My Program |
| Active Learning Design: Modeling Force Output for Axisymmetric Soft Pneumatic Actuators |
|
| Campbell, Gregory | Lafayette College |
| Muhaxheri, Gentian | Syracuse University |
| Guilhoto, Leonardo Ferreira | University of Pennsylvania |
| Santangelo, Christian | Syracuse University |
| Perdikaris, Paris | University of Pennsylvania |
| Pikul, James | University of Wisconsin-Madison |
| Yim, Mark | University of Pennsylvania |
Keywords: Soft Robot Materials and Design, Hydraulic/Pneumatic Actuators
Abstract: Soft pneumatic actuators (SPA) made from elastomeric materials can provide large strain and large force. The behavior of locally strain-restricted hyperelastic materials under inflation has been investigated thoroughly for shape reconfiguration, but requires further investigation for trajectories involving external force. In this work we model force-pressure-height relationships for a concentrically strain-limited class of soft pneumatic actuators and demonstrate the use of this model to design SPA response for object lifting. We predict relationships under different loadings by solving energy minimization equations and verify this theory by using an automated test rig to collect rich data for n=22 Ecoflex 00-30 membranes. We collect data using an active learning pipeline to efficiently model the design space. We show that this learned model outperforms the theory-based model and a naive regression. We use our model to optimize membrane design for different lift tasks and compare this performance to other designs. These contributions represent a step towards understanding the natural response for this class of actuator and embodying intelligent lifts in a single-pressure input actuator system.
|
| |
| 09:00-10:30, Paper ThI1I.40 | Add to My Program |
| Line-Search Filter Differential Dynamic Programming for Optimal Control with Nonlinear Equality Constraints |
|
| Xu, Mingda | EPFL |
| Gould, Stephen | Australian National University |
| Shames, Iman | The University of Melbourne |
Keywords: Optimization and Optimal Control, Constrained Motion Planning
Abstract: We present FilterDDP, a differential dynamic programming algorithm for solving discrete-time, optimal control problems (OCPs) with nonlinear equality constraints. Unlike prior methods based on merit functions or the augmented Lagrangian class of algorithms, FilterDDP uses a step filter in conjunction with a line search to handle equality constraints. We identify two important design choices for the step filter criteria which lead to robust numerical performance: 1) we use the Lagrangian instead of the cost in the step acceptance criterion and, 2) in the backward pass, we perturb the value function Hessian. Both choices are rigorously justified, for 2) in particular by a formal proof of local quadratic convergence. In addition to providing a primal-dual interior point extension for handling OCPs with both equality and inequality constraints, we validate FilterDDP on three contact implicit trajectory optimisation problems which arise in robotics.
|
| |
| 09:00-10:30, Paper ThI1I.41 | Add to My Program |
| Grasp, Slide, Roll: Comparative Analysis of Contact Modes for Tactile-Based Shape Reconstruction |
|
| Kim, Chung Hee | Carnegie Mellon University |
| Kamtikar, Shivani Kiran | University of Illinois at Urbana-Champaign |
| Brady, Tye | Amazon |
| Padir, Taskin | Northeastern University |
| Migdal, Joshua | Amazon Robotics |
Keywords: Multifingered Hands, Force and Tactile Sensing
Abstract: Tactile sensing allows robots to gather detailed geometric information about objects through physical interaction, complementing vision-based approaches. However, efficiently acquiring useful tactile data remains challenging due to the time-consuming nature of physical contact and the need to strategically choose contact locations that maximize information gain while minimizing physical interactions. This paper studies how different contact modes affect object shape reconstruction using a tactile-enabled dexterous gripper. We compare three contact interaction modes: grasp-releasing, sliding induced by finger-grazing, and palm-rolling. These contact modes are combined with an information-theoretic exploration framework that guides subsequent sampling locations using a shape completion model. Our results show that the improved tactile sensing efficiency of finger-grazing and palm-rolling translates into faster convergence in shape reconstruction, requiring 34% fewer physical interactions while improving reconstruction accuracy by 55%. We validate our approach using a UR5e robot arm equipped with an Inspire-Robots Dexterous Hand, showing robust performance across primitive object geometries.
|
| |
| 09:00-10:30, Paper ThI1I.42 | Add to My Program |
| StereoMamba: Real-Time and Robust Intraoperative Stereo Disparity Estimation Via Long-Range Spatial Dependencies |
|
| Wang, Xu | University College London |
| Xu, Jialang | University College London |
| Zhang, Shuai | University College London |
| Huang, Baoru | Imperial College London |
| Stoyanov, Danail | University College London |
| Mazomenos, Evangelos | University College London |
Keywords: Deep Learning for Visual Perception, Computer Vision for Medical Robotics, Computer Vision for Automation
Abstract: Stereo disparity estimation is crucial for obtaining depth information in robot-assisted minimally invasive surgery (RAMIS). While current deep learning methods have made significant advancements, challenges remain in achieving an optimal balance between accuracy, robustness, and inference speed. To address these challenges, we propose the StereoMamba architecture, which is specifically designed for stereo disparity estimation in RAMIS. Our approach is based on a novel Feature Extraction Mamba (FE-Mamba) module, which enhances long-range spatial dependencies both within and across stereo images. To effectively integrate multi-scale features from FE-Mamba, we then introduce a novel Multidimensional Feature Fusion (MFF) module. Experiments against the state-of-the-art on the ex-vivo SCARED benchmark demonstrate that StereoMamba achieves superior performance on EPE of 2.64 px and depth MAE of 2.55 mm, the second-best performance on Bad2 of 41.49% and Bad3 of 26.99%, while maintaining an inference speed of 21.28 FPS for a pair of high-resolution images (1280*1024), striking the optimum balance between accuracy, robustness, and efficiency. Furthermore, by comparing synthesized right images, generated from warping left images using the generated disparity maps, with the actual right image, StereoMamba achieves the best average SSIM (0.8970) and PSNR (16.0761), exhibiting strong zero-shot generalization on the in-vivo RIS2017 and StereoMIS datasets.
|
| |
| 09:00-10:30, Paper ThI1I.43 | Add to My Program |
| Using the Chebyshev Basis for Energy Optimal Motion Profiles in Path-Constrained Applications |
|
| Van Oosterwyck, Nick | University of Antwerp |
| De Laet, Robbe | University of Antwerp |
| Scalera, Lorenzo | University of Udine |
| Cuyt, Annie | University of Antwerp |
| Gasparetto, Alessandro | University of Udine |
| Derammelaere, Stijn | University of Antwerp, Faculty of Applied Engineering |
Keywords: Constrained Motion Planning, Motion and Path Planning, Task and Motion Planning
Abstract: Motion profile optimization is a powerful optimization technique that allows to reduce the energy consumption of robotic systems by changing the temporal profile of the joint position setpoints. However, despite the extensive exploration of these techniques for robotic systems following constrained paths, many existing methodologies rely on complex optimization processes or a larger number of design parameters. This paper introduces a novel approach that leverages the Chebyshev basis to optimize the motion along a fixed geometric path, thereby achieving a measured torque difference of -15% while requiring only a limited number of design parameters. By employing the Chebyshev basis, the formulation leads to a smooth objective function and enables the definition of linear inequality constraints that accurately enclose the feasible design space. This unique combination of features not only simplifies the optimization problem but also enhances the probability of locating the global optimum, particularly illustrated in the two-dimensional case. The methodology is established in a generic and model-independent manner, setting a promising direction for future research in motion profile optimization for constrained-path robotic systems.
|
| |
| 09:00-10:30, Paper ThI1I.44 | Add to My Program |
| Autonomous Exploration with Terrestrial-Aerial Bimodal Vehicles |
|
| Gao, Yuman | Zhejiang University, Huzhou Institute of Zhejiang University |
| Zhang, Ruibin | Zhejiang University, Huzhou Institute of Zhejiang University |
| Lai, Tiancheng | Zhejiang University, Huzhou Institute of Zhejiang University |
| Cao, Yanjun | Zhejiang University, Huzhou Institute of Zhejiang University |
| Xu, Chao | Zhejiang University, Huzhou Institute of Zhejiang University |
| Gao, Fei | Zhejiang University, Huzhou Institute of Zhejiang University |
Keywords: Aerial Systems: Applications, Task and Motion Planning, Aerial Systems: Perception and Autonomy
Abstract: Terrestrial-aerial bimodal vehicles, which integrate the high mobility of aerial robots with the long endurance of ground robots, offer significant potential for autonomous exploration. Given the inherent energy and time constraints in practical exploration tasks, we present a hierarchical framework for the bimodal vehicle to utilize its flexible locomotion modalities for exploration. Beginning with extracting environmental information to identify informative regions, we generate a set of potential bimodal viewpoints. To adaptively manage energy and time constraints, we introduce an extended Monte Carlo Tree Search approach that strategically optimizes both modality selection and viewpoint sequencing. Combined with an improved bimodal vehicle motion planner, we present a complete bimodal energy- and time-aware exploration system. Extensive simulations and deployment on a customized real-world platform demonstrate the effectiveness of our system.
|
| |
| 09:00-10:30, Paper ThI1I.45 | Add to My Program |
| MemClaw-RAG: Memory-Driven Navigation and Adaptive Locomotion for Wheeled-Legged Robots in Dynamic Environments |
|
| Li, Mingyi | Beijing Institute of Technology School |
| Zhang, Shubo | Beijing University of Posts and Telecommunications |
| Gao, Chunle | Beijing Institute of Technology School |
| Yang, Kaixin | Beijing Institute of Technology School |
| Li, Ying | Beijing Institute of Technology School |
Keywords: Vision-Based Navigation, Semantic Scene Understanding, Wheeled Robots
Abstract: Object-Goal Navigation in dynamic environments remains challenging because many existing approaches rely primarily on reactive mapping and lack the ability to retain historical experience or establish structured memory associations. To address this limitation, we introduce MemClaw-RAG, an embodied multimodal framework. MemClaw-RAG includes three main components: (1) a Memory Graph Retrieval (MGR) module that leverages multimodal knowledge graphs to support semantic association and target retrieval; (2) a SelfClaw cognitive module that manages skill scheduling and task execution through memory-aware reasoning; and (3) a Hybrid Adaptive Locomotion Policy (HALP) based on deep reinforcement learning that enables efficient locomotion for wheeled-legged robots across different terrain conditions.On Habitat benchmarks, MemClaw-RAG achieves a Success Rate (SR) of 0.81 and a Success-weighted Path Length (SPL) of 0.51 on the Gibson and HM3D datasets. In the more challenging multi-layer environments of MP3D, the proposed method achieves an SR of 0.76 and an SPL of 0.48, outperforming several representative memory-based and end-to-end navigation approaches. Real-world deployment on a Unitree wheeled-legged robot further demonstrates the practicality of the system, achieving an average per-step inference latency of 55 ms on a Jetson Orin platform while maintaining stable navigation behavior in dynamic indoor environments.
|
| |
| 09:00-10:30, Paper ThI1I.46 | Add to My Program |
| Memory-Maze: Scenario Driven Visual Language Navigation Benchmark for Guiding Blind People |
|
| Kuribayashi, Masaki | Waseda Univeristy, Miraikan - The National Museum of Emerging Science and Innovation |
| Uehara, Kohei | Miraikan |
| Wang, Allan | Miraikan |
| Sato, Daisuke | Carnegie Mellon University |
| Ribeiro, Renato Alexandre | Miraikan, The National Museum of Emerging Science and Innovation |
| Chu, Simon | Carnegie Mellon University |
| Morishima, Shigeo | Waseda University |
|
|
| |
| 09:00-10:30, Paper ThI1I.47 | Add to My Program |
| Dynamic Targeting of Satellite Observations Using Supplemental Geostationary Satellite Data and Hierarchical Planning |
|
| Kangaslahti, Akseli | Jet Propulsion Laboratory |
| Zilberstein, Itai | Jet Propulsion Laboratory |
| Candela, Alberto | Jet Propulsion Laboratory, California Institute of Technology |
| Chien, Steve | Jet Propulsion Laboratory |
Keywords: Space Robotics and Automation, Planning, Scheduling and Coordination
Abstract: The Dynamic Targeting (DT) mission concept is an approach to satellite observation in which a lookahead sensor gathers information about the upcoming environment and uses this information to intelligently plan observations. Previous work has shown that DT has the potential to increase the science return across several applications. However, DT mission concepts must address challenges such as the limited spatial extent of onboard lookahead data and instrument mobility, data throughput, and onboard computation constraints. In this work, we show how the performance of DT systems can be improved by using supplementary data streamed from geostationary satellites that provide lookahead information up to 35 minutes ahead of time rather than the 1 minute latency from an onboard lookahead sensor. While there is a greater volume of geostationary data, the search space for observation planning explodes exponentially with the size of the horizon. To address this, we introduce a hierarchical planning approach in which the geostationary data is used to plan a long-term observation blueprint in polynomial time, then the onboard lookahead data is leveraged to refine that plan over short-term horizons. We compare the performance of our approach to that of traditional DT planners relying on onboard lookahead data across four different problem instances: three cloud avoidance variations and a storm hunting scenario. We show that our hierarchical planner outperforms the traditional DT planners by up to 41% and examine the features of the scenarios that affect the performance of our approach. We demonstrate that incorporating geostationary satellite data is most effective for dynamic problem instances in which the targets of interest are sparsely distributed throughout the overflight.
|
| |
| 09:00-10:30, Paper ThI1I.48 | Add to My Program |
| PPT: Pretraining with Pseudo-Labeled Trajectories for Motion Forecasting |
|
| Xu, Yihong | Valeo.ai |
| Yin, Yuan | Valeo.ai |
| Zablocki, Eloi | Valeo.ai |
| Vu, Tuan-Hung | Valeo.ai |
| Boulch, Alexandre | Valeo.ai |
| Cord, Matthieu | Sorbonne Université, Valeo.ai |
Keywords: Autonomous Vehicle Navigation, Computer Vision for Automation, Vision-Based Navigation
Abstract: Accurately predicting how agents move in dynamic scenes is essential for safe autonomous driving. State-of-the-art motion forecasting models rely on datasets with manually annotated or post-processed trajectories. However, building these datasets is costly, generally manual, hard to scale, and lacks reproducibility. They also introduce domain gaps that limit generalization across environments. We introduce PPT (Pretraining with Pseudo-labeled Trajectories), a simple and scalable pretraining framework that uses unprocessed and diverse trajectories automatically generated from off-the-shelf 3D detectors and tracking. Unlike data annotation pipelines aiming for clean, single-label annotations, PPT is a pretraining framework embracing off-the-shelf trajectories as useful signals for Learning robust representations. With optional finetuning on a small amount of labeled data, models pretrained with PPT achieve strong performance across standard benchmarks, particularly in low-data regimes, and in cross-domain, end-to- end, and multi-class settings. PPT is easy to implement and improves generalization in motion forecasting.
|
| |
| 09:00-10:30, Paper ThI1I.49 | Add to My Program |
| DroneKey++: A Size Prior-Free Method and New Benchmark for Drone 3D Pose Estimation from Sequential Images |
|
| Hwang, Seo-Bin | Chonnam National University |
| Cho, Yeong-Jun | Chonnam National University |
Keywords: Surveillance Robotic Systems, Deep Learning for Visual Perception, Data Sets for Robotic Vision
Abstract: Accurate 3D pose estimation of drones is essential for security and surveillance systems. However, existing methods often rely on prior drone information such as physical sizes or 3D meshes. At the same time, current datasets are small-scale, limited to single models, and collected under constrained environments, which makes reliable validation of generalization difficult. We present DroneKey++, a prior-free framework that jointly performs keypoint detection, drone classification, and 3D pose estimation. The framework employs a keypoint encoder for simultaneous keypoint detection and classification, and a pose decoder that estimates 3D pose using ray-based geometric reasoning and class embeddings. To address dataset limitations, we construct 6DroneSyn, a large-scale synthetic benchmark with over 50K images covering 7 drone models and 88 outdoor backgrounds, generated using 360-degree panoramic synthesis. Experiments show that DroneKey++ achieves MAE 17.34° and MedAE 17.1° for rotation, MAE 0.135 m and MedAE 0.242 m for translation, with inference speeds of 19.25 FPS (CPU) and 414.07 FPS (GPU), demonstrating both strong generalization across drone models and suitability for real-time applications. The dataset is available at [link].
|
| |
| 09:00-10:30, Paper ThI1I.50 | Add to My Program |
| Adaptive Gain Nonlinear Observer for External Wrench Estimation in Human-UAV Physical Interaction |
|
| Naser, Hussein N. | Carleton University, University of Thi-Qar |
| Hashim, Hashim A. | Carleton University |
| Ahmadi, Mojtaba | Carleton University |
Keywords: Physical Human-Robot Interaction, Human-Robot Collaboration, Aerial Systems: Applications
Abstract: This paper presents an Adaptive Gain Nonlinear Observer (AGNO) for estimating the external interaction wrench (forces and torques) in human-UAV physical interaction for assistive payload transportation. The proposed AGNO uses the full nonlinear dynamic model to achieve an accurate and robust wrench estimation without relying on dedicated force-torque sensors. A key feature of this approach is the explicit consideration of the non-constant inertia matrix, which is essential for aerial systems with asymmetric mass distribution or shifting payloads. A comprehensive dynamic model of a cooperative transportation system composed of two quadrotors and a shared payload is derived, and the stability of the observer is rigorously established using Lyapunov-based analysis. Simulation results validate the effectiveness of the proposed observer in enabling intuitive and safe human-UAV interaction. Comparative evaluations demonstrate that the proposed AGNO outperforms an Extended Kalman Filter (EKF) in terms of estimation root mean square errors (RMSE), particularly for torque estimation under nonlinear interaction conditions. This approach reduces system weight and cost by eliminating additional sensing hardware, enhancing practical feasibility.
|
| |
| 09:00-10:30, Paper ThI1I.51 | Add to My Program |
| Learning to Drift with Individual Wheel Drive: Maneuvering Autonomous Vehicle at the Handling Limits |
|
| Zhou, Yihan | Tsinghua University |
| Lu, Yiwen | Tsinghua University |
| Yang, Bo | Tsinghua University |
| Li, Jiayun | Tsinghua University |
| Mo, Yilin | Tsinghua University |
Keywords: Motion Control, Reinforcement Learning
Abstract: Drifting, characterized by controlled vehicle motion at high sideslip angles, is crucial for safely handling emergency scenarios at the friction limits. While recent reinforcement learning approaches show promise for drifting control, they struggle with the significant simulation-to-reality gap, as policies that perform well in simulation often fail when transferred to physical systems. In this paper, we present a reinforcement learning framework with GPU-accelerated parallel simulation and systematic domain randomization that effectively bridges the gap. The proposed approach is validated on both simulation and a custom-designed and open-sourced 1/10 scale Individual Wheel Drive (IWD) RC car platform featuring independent wheel speed control. Experiments across various scenarios from steady-state circular drifting to direction transitions and variable-curvature path following demonstrate that our approach achieves precise trajectory tracking while maintaining controlled sideslip angles throughout complex maneuvers in both simulated and real-world environments.
|
| |
| 09:00-10:30, Paper ThI1I.52 | Add to My Program |
| GP3: A 3D Geometry-Aware Policy with Multi-View Images for Robotic Manipulation |
|
| Qian, Quanhao | Alibaba Damo Academy |
| Zhao, Guoyang | Tongji University |
| Zhang, Gongjie | Alibaba Damo Academy |
| Wang, Jiuniu | Alibaba Damo Academy |
| Gao, Junlong | Alibaba Damo Academy |
| Zhao, Deli | Alibaba Damo Academy |
| Xu, Ran | Alibaba Damo Academy |
Keywords: Deep Learning for Visual Perception, Visual Servoing, Imitation Learning
Abstract: Effective robotic manipulation relies on a precise understanding of 3D scene geometry, and one of the most straightforward ways to acquire such geometry is through multi-view observations. Motivated by this, we present GP3—a 3D geometry-aware robotic manipulation policy that leverages multi-view input. GP3 employs a spatial encoder to infer dense spatial features from RGB observations, which enable the estimation of depth and camera parameters, leading to a compact yet expressive 3D scene representation tailored for manipulation. This representation is fused with language instructions and translated into continuous actions via a lightweight policy head. We further introduce G-FiLM, which applies language-conditioned FiLM only to cross-view global attention. Comprehensive experiments demonstrate that GP3 consistently outperforms state-of-the-art methods on simulated benchmarks. Furthermore, GP3 transfers effectively to realworld robots in depth-challenging scenes with only minimal fine-tuning. These results highlight GP3 as a practical, sensoragnostic solution for geometry-aware robotic manipulation.
|
| |
| 09:00-10:30, Paper ThI1I.54 | Add to My Program |
| TopoNav: Topological Graphs As a Key Enabler for Advanced Object Navigation |
|
| Liu, Peiran | Hong Kong University of Science and Technology (GuangZhou) |
| Zhang, Qiang | The Hong Kong University of Science and Technology (Guangzhou) |
| Peng, Daojie | The Hong Kong University of Science and Technology Guangzhou (HKUSTGZ) |
| Zhang, Lingfeng | The Hong Kong University of Science and Technology (Guangzhou) |
| Qin, Yihao | The Hong Kong University of Science and Technology (Guangzhou) |
| Zhou, Hang | The Hong Kong University of Science and Technology (Guangzhou) |
| Ma, Jun | The Hong Kong University of Science and Technology |
| Xu, Renjing | The Hong Kong University of Science and Technology (Guangzhou) |
| Ji, Yiding | Hong Kong University of Science and Technology (Guangzhou) |
|
|
| |
| 09:00-10:30, Paper ThI1I.55 | Add to My Program |
| Generative Predictive Control: Flow Matching Policies for Dynamic, Difficult-To-Demonstrate Tasks |
|
| Kurtz, Vincent | DePaul University |
| Burdick, Joel | California Institute of Technology |
Keywords: Machine Learning for Robot Control, Simulation and Animation, Optimization and Optimal Control
Abstract: Generative control policies have recently unlocked major progress in robotics. These methods produce action sequences via diffusion or flow matching, with training data provided by demonstrations. But existing methods come with two key limitations: they require expert demonstrations, which can be difficult or costly to obtain, and they are limited to relatively slow, quasi-static tasks. In this paper, we leverage a tight connection between sampling-based predictive control and generative modeling to address these issues. In particular, we introduce generative predictive control, a supervised learning framework for tasks with fast dynamics that are easy to simulate but difficult to demonstrate. We show how trained flow-matching policies can be warm-started at inference time, maintaining temporal consistency and enabling high-frequency feedback. We believe that generative predictive control offers a complementary approach to existing behavior cloning methods, and hope that it will pave the way toward generalist policies that extend beyond quasi-static demonstration-oriented tasks.
|
| |
| 09:00-10:30, Paper ThI1I.56 | Add to My Program |
| Switchable Neural Teleoperation |
|
| Ye, Jianglong | UC San Diego |
| Jing, Changwei | UC San Diego |
| Chen, Kezhou | UC San Diego |
| Wang, Keyi | UC San Diego |
| Yi, Sha | UC San Diego |
| Zou, Xueyan | UC San Diego |
| Wang, Xiaolong | UC San Diego |
Keywords: Telerobotics and Teleoperation, Human-Robot Collaboration, Dexterous Manipulation
Abstract: Collecting demonstrations through human teleoperation is an effective approach for learning complex manipulation skills. However, challenges such as morphology gaps, control latency, and limited feedback make high-quality data collection costly and inefficient. In this paper, we introduce Neural Teleoperation, a shared-autonomy system that integrates human guidance with a robust grasping policy using a learning-based policy switcher. This hybrid framework allows users to focus on high-level planning while delegating fine-grained control to an autonomous policy when needed. Our system supports both immersive VR devices and lightweight 6-DoF controllers, making dexterous hand teleoperation more accessible. Real-world experiments across six manipulation tasks show that Neural Teleop increases success rates and reduces demonstration collection time compared to state-of-the-art baselines.
|
| |
| 09:00-10:30, Paper ThI1I.57 | Add to My Program |
| Design of a Single-Input, Five-Output Differential Actuation Unit for Underactuated Hands |
|
| Scuderoni, Hugo | Université De Toulon |
| Perini, Alessandro | University of Rome Tor Vergata |
| Russo, Matteo | University of Rome Tor Vergata |
Keywords: Mechanism Design, Actuation and Joint Mechanisms, Product Design, Development and Prototyping
Abstract: Robotic hands, prosthetics, and hand exoskeletons struggle with replicating the natural dexterity of human hands: the mechanical intelligence of our muscles can be hardly replicated with rigid actuators, while soft mechanisms compromise precision. Underactuated hand mechanisms represent a trade-off between these extremes. However, single-motor solutions, while robust and compact, generally actuate a maximum of four fingers or present critical differences in force transmission between the fingers. Here, we propose a design for a single-input, five-output differential gearbox that delivers balanced transmission thanks to a unique asymmetrical layout. This feature enables adaptive grasp control through mechanical intelligence only, providing the user with a reliable, safe, and lightweight solution for tendon-driven hand mechanisms. A preliminary 3D-printed prototype is presented to demonstrate the concept.
|
| |
| 09:00-10:30, Paper ThI1I.58 | Add to My Program |
| RoboMatch: A Unified Mobile-Manipulation Teleoperation Platform with Auto-Matching Network Architecture for Long-Horizon Tasks |
|
| Liu, Hanyu | Jiangnan University |
| Ma, Yunsheng | Jiangnan University |
| Huang, Jiaxin | Jiangnan University |
| Ren, Keqiang | Jiangnan University |
| Wen, Jiayi | Jiangnan University |
| Zheng, Yilin | Jiangnan University |
| Luan, Haoru | Jiangnan University |
| Wan, Baishu | Jiangnan University |
| Li, Pan | Jiangnan University |
| Hou, Jiejun | Jiangnan University |
| Wang, Zhihua | Jiangnan University |
| Song, Zhigong | Jiangnan University |
Keywords: AI-Enabled Robotics, Imitation Learning, Engineering for Robotic Systems
Abstract: This paper presents RoboMatch, a novel unified teleoperation platform for mobile manipulation with an auto-matching network architecture, designed to tackle long-horizon tasks in dynamic environments. Our system enhances teleoperation performance, data collection efficiency, task accuracy, and operational stability. The core of RoboMatch is a cockpit-style control interface that enables synchronous operation of the mobile base and dual arms, significantly improving control precision and data collection. Moreover, we introduce the Proprioceptive-Visual Enhanced Diffusion Policy (PVE-DP), which leverages Discrete Wavelet Transform (DWT) for multi-scale visual feature extraction and integrates high-precision IMUs at the end-effector to enrich proprioceptive feedback, substantially boosting fine manipulation performance. Furthermore, we propose an Auto-Matching Network (AMN) architecture that decomposes long-horizon tasks into logical sequences and dynamically assigns lightweight pre-trained models for distributed inference. Experimental results demonstrate that our approach improves data collection efficiency by over 20%, increases task success rates by 20–30% with PVE-DP, and enhances long-horizon inference performance by approximately 40% with AMN, offering a robust solution for complex manipulation tasks. Project website: https://robomatch.github.io
|
| |
| 09:00-10:30, Paper ThI1I.59 | Add to My Program |
| Online Pareto-Optimal Decision-Making for Complex Tasks Using Active Inference |
|
| Amorese, Peter | University of Colorado Boulder |
| Wakayama, Shohei | University of Colorado Boulder |
| Ahmed, Nisar | University of Colorado Boulder |
| Lahijanian, Morteza | University of Colorado Boulder |
Keywords: Formal Methods in Robotics and Automation, Learning and Adaptive Systems, Probability and Statistical Methods, Active Inference
Abstract: When a robot autonomously performs a complex task, it frequently must balance competing objectives while maintaining safety. This becomes more difficult in uncertain environments with stochastic outcomes. Enhancing transparency in the robot’s behavior and aligning with user preferences are also crucial. This paper introduces a novel framework for multi-objective reinforcement learning that ensures safe task execution, optimizes trade-offs between objectives, and adheres to user preferences. The framework has two main layers: a multi objective task planner and a high-level selector. The planning layer generates a set of optimal trade-off plans that guarantee satisfaction of a temporal logic task. The selector uses active inference to decide which generated plan best complies with user preferences and aids learning. Operating iteratively, the framework updates a parameterized learning model based on collected data. Case studies and benchmarks on both manipulation and mobile robots show that our framework outperforms other methods and (i) learns multiple optimal trade-offs, (ii) adheres to a user preference, and (iii) allows the user to adjust the balance between (i) and (ii).
|
| |
| 09:00-10:30, Paper ThI1I.60 | Add to My Program |
| ASCENT: Transformer-Based Aircraft Trajectory Prediction in Non-Towered Terminal Airspace |
|
| Prutsch, Alexander | Graz University of Technology |
| Schinagl, David | Graz University of Technology |
| Possegger, Horst | Graz University of Technology |
Keywords: Aerial Systems: Perception and Autonomy, Intelligent Transportation Systems, Aerial Systems: Applications
Abstract: Accurate trajectory prediction can improve General Aviation safety in non-towered terminal airspace, where high traffic density increases accident risk. We present ASCENT, a lightweight transformer-based model for multimodal 3D aircraft trajectory forecasting, which integrates domain-aware 3D coordinate normalization and parameterized predictions. ASCENT employs a transformer-based motion encoder and a query-based decoder, enabling the generation of diverse maneuver hypotheses with low latency. Experiments on the TrajAir and TartanAviation datasets demonstrate that our model outperforms prior baselines, as the encoder effectively captures motion dynamics and the decoder aligns with structured aircraft traffic patterns. Furthermore, ablation studies confirm the contributions of the decoder design, coordinate-frame modeling, and parameterized outputs. These results establish ASCENT as an effective approach for real-time aircraft trajectory prediction in non-towered terminal airspace.
|
| |
| 09:00-10:30, Paper ThI1I.61 | Add to My Program |
| Quadrotor Navigation Using Reinforcement Learning with Privileged Information |
|
| Lee, Jonathan | Carnegie Mellon University |
| Rathod, Abhishek | Carnegie Mellon University |
| Goel, Kshitij | Carnegie Mellon University |
| Stecklein, John | Carnegie Mellon University |
| Tabib, Wennie | Carnegie Mellon University |
Keywords: Field Robots, Aerial Systems: Applications
Abstract: This paper presents a reinforcement learning-based quadrotor navigation method that leverages efficient differentiable simulation, novel loss functions, and privileged information to navigate around large obstacles. Prior learning-based methods perform well in scenes that exhibit narrow obstacles, but struggle when the goal location is blocked by large walls or terrain. In contrast, the proposed method utilizes time-of-arrival (ToA) maps as privileged information and a yaw alignment loss to guide the robot around large obstacles. The policy is evaluated in photo-realistic simulation environments containing large obstacles, sharp corners, and dead-ends. Our approach achieves an 86% success rate and outperforms baseline strategies by 34%. We deploy the policy onboard a custom quadrotor in outdoor cluttered environments both during the day and night. The policy is validated across 20 flights, covering 589 meters without collisions at speeds up to 4 m/s.
|
| |
| 09:00-10:30, Paper ThI1I.62 | Add to My Program |
| Prior-Constrained Explorative Guidance for Generalization in Diffusion Motion Planning |
|
| Kim, Sunhwi | Ulsan National Institute of Science and Technology |
| Kim, Junsu | Ulsan National Institute of Science and Technology |
| Baek, Seungjae | Ulsan National Institute of Science and Technology |
| Shin, Jaechan | Ulsan National Institute of Science and Technology |
| Lee, Jungeun | UNIST |
| Lee, Seongjae | Ulsan National Institute of Science and Technology (UNIST) |
| Joo, Kyungdon | UNIST |
| Jeon, Jeong hwan | Ulsan National Institute of Science and Technology |
Keywords: Motion and Path Planning, Integrated Planning and Learning, Deep Learning Methods
Abstract: Diffusion-based planners have achieved generalization comparable to classical planners by leveraging inference-time optimization through guidance. However, their limited ability to capture environmental variations often constrains their responsiveness in unseen settings. In addition, the diversity-consistency trade-off inherent in guidance has remained unresolved. In this work, we propose Prior-Constrained Explorative Guidance (PCEG), a novel approach that gathers environmental information through local exploration and prevents guided samples from converging prematurely to similar solutions by leveraging a trajectory prior. The collected information is included in the guidance via stochastic gradient estimation, while a succinct parameter scheduling strategy enables latent optimization driven by environmental signals without significant computational overhead. Furthermore, during the modal-seeking stages of the reverse diffusion process, we employ a Gaussian Process (GP) to enforce dynamics-informed priors, effectively constraining the exploration region of each sample and thereby enhancing solution diversity. Across diverse benchmarks including 7-degree-of-freedom (7-DoF) robot-arm manipulation, PCEG substantially improves success rate by up to 30 percentage points compared to competitive diffusion planners without compromising trajectory quality, even in scenarios involving unseen obstacles. Real-world experiments further validate these findings, showcasing the generation of smooth, collision-free trajectories in novel environments. The project page is available at https://rml-unist.github.io/PCEG/.
|
| |
| 09:00-10:30, Paper ThI1I.63 | Add to My Program |
| DynaFlow: Dynamics-Embedded Flow Matching for Physically Consistent Motion Generation from State-Only Demonstrations |
|
| Lee, Sowoo | KAIST |
| Kang, Dongyun | Korea Advanced Institute of Science and Technology |
| Park, Jaehyun | Korea Advanced Institute of Science & Technology (KAIST) |
| Park, Hae-Won | Korea Advanced Institute of Science and Technology |
Keywords: Learning from Demonstration, Imitation Learning
Abstract: This paper introduces DynaFlow, a novel framework that embeds a differentiable simulator directly into a flow matching model. By generating trajectories in the action space and mapping them to dynamically feasible state trajectories via the simulator, DynaFlow ensures all outputs are physically consistent by construction. This end-to-end differentiable architecture enables training on state-only demonstrations, allowing the model to simultaneously generate physically consistent state trajectories while inferring the underlying action sequences required to produce them. We demonstrate the effectiveness of our approach through quantitative evaluations and showcase its real-world applicability by deploying the generated actions onto a physical Go1 quadruped robot. The robot successfully reproduces diverse gait present in the dataset, executes long-horizon motions in open-loop control and translates infeasible kinematic demonstrations into dynamically executable, stylistic behaviors. These hardware experiments validate that DynaFlow produces deployable, highly effective motions on real-world hardware from state-only demonstrations, effectively bridging the gap between kinematic data and real-world execution.
|
| |
| 09:00-10:30, Paper ThI1I.64 | Add to My Program |
| PegasusFlow: Parallel Rolling-Denoising Score Sampling for Robot Diffusion Planner Flow Matching |
|
| Ye, Lei | Harbin Institute of Technology |
| Gao, Haibo | Harbin Institute of Technology |
| Xu, Peng | Harbin Institute of Technology |
| Zhang, Zhelin | Harbin Institute of Technology |
| Zhang, Wei | Harbin Institute of Techonolgy |
| Shan, Junqi | Harbin Institute of Technology |
| Zhang, Ao | Harbin Institute of Technology |
| Zhou, Ruyi | Harbin Institute of Technology |
| Deng, Zongquan | Harbin Institute of Technology |
| Ding, Liang | Harbin Institute of Technology |
Keywords: Integrated Planning and Control, Machine Learning for Robot Control, Simulation and Animation
Abstract: Diffusion models offer powerful generative capabilities for robot trajectory planning, yet their practical deployment on robots is hindered by a critical bottleneck: reliance on imitation learning from expert demonstrations. This paradigm is often impractical for specialized robots where data is scarce and creates an inefficient, theoretically suboptimal training pipeline. To overcome this, we introduce PegasusFlow, a hierarchical rolling-denoising framework that enables direct and parallel sampling of trajectory score gradients from environmental interaction, completely bypassing the need for expert data. Our core innovation is a novel sampling algorithm, Weighted Basis Function Optimization (WBFO), which leverages spline basis representations to achieve superior sample efficiency and faster convergence compared to traditional methods like MPPI. The framework is embedded within a scalable, asynchronous parallel simulation architecture that supports massively parallel rollouts for efficient data collection. Extensive experiments on trajectory optimization and robotic navigation tasks demonstrate that our approach, particularly Action-Value WBFO (AVWBFO) combined with a reinforcement learning warm-start, significantly outperforms baselines. In a challenging barrier-crossing task, our method achieved a 100% success rate and was 18% faster than the next-best method, validating its effectiveness for complex terrain locomotion planning. https://masteryip.github.io/pegasusflow.github.io/
|
| |
| 09:00-10:30, Paper ThI1I.65 | Add to My Program |
| CoBEVMoE: Heterogeneity-Aware Feature Fusion with Dynamic Mixture-Of-Experts for Collaborative Perception |
|
| Kong, Lingzhao | Hunan University |
| Lin, Jiacheng | Hunan University |
| Li, Siyu | Hunan University |
| Luo, Kai | Hunan University |
| Li, Zhiyong | HUNAN UNIVERSITY |
| Yang, Kailun | Hunan University |
Keywords: Computer Vision for Transportation, Deep Learning for Visual Perception, Multi-Robot Systems
Abstract: Collaborative perception aims to extend sensing coverage and improve perception accuracy by sharing information among multiple agents. However, due to differences in viewpoints and spatial positions, agents often acquire heterogeneous observations. Existing intermediate fusion methods primarily focus on aligning similar features, often overlooking the perceptual diversity among agents. To address this limitation, we propose CoBEVMoE, a novel collaborative perception framework that operates in the Bird’s Eye View (BEV) space and incorporates a Dynamic Mixture-of-Experts (DMoE) architecture. In DMoE, each expert is dynamically generated based on the input features of a specific agent, enabling it to extract distinctive and reliable cues while attending to shared semantics. This design allows the fusion process to explicitly model both feature similarity and heterogeneity across agents. Furthermore, we introduce a Dynamic Expert Metric Loss (DEML) to enhance inter-expert diversity and improve the discriminability of the fused representation. Extensive experiments on the OPV2V and DAIR-V2X-C datasets demonstrate that CoBEVMoE achieves state-of-the-art performance. Specifically, it improves the IoU for Camera-based BEV segmentation by +1.5% on OPV2V and the AP@0.5 for LiDAR-based 3D object detection by 3.0% on DAIR-V2X-C, verifying the effectiveness of expert-based heterogeneous feature modeling in multi-agent collaborative perception. The source code will be made available.
|
| |
| 09:00-10:30, Paper ThI1I.66 | Add to My Program |
| RGA-Net: A Vision Enhancement Framework for Robotic Surgical Systems Using Reciprocal Attention Mechanisms |
|
| Li, Quanjun | Guangdong University of Technology |
| Li, Weixuan | Guangdong University of Technology |
| Xia, Han | Guangdong University of Technology |
| Zhou, Junhua | Guangdong University of Technology |
| Pun, Chi Man | University of Macau |
| Xuhang, Chen | University of Macau |
Keywords: Deep Learning for Visual Perception, Visual Learning, Computer Vision for Automation
Abstract: Robotic surgical systems rely heavily on high-quality visual feedback for precise teleoperation; yet, surgical smoke from energy-based devices significantly degrades endoscopic video feeds, compromising the human-robot interface and surgical outcomes. This paper presents RGA-Net (Reciprocal Gating and Attention-fusion Network), a novel deep learning framework specifically designed for smoke removal in robotic surgery workflows. Our approach addresses the unique challenges of surgical smoke-including dense, non-homogeneous distribution and complex light scattering-through a hierarchical encoder-decoder architecture featuring two key innovations: (1) a Dual-Stream Hybrid Attention (DHA) module that combines shifted window attention with frequency-domain processing to capture both local surgical details and global illumination changes, and (2) an Axis-Decomposed Attention (ADA) module that efficiently processes multi-scale features through factorized attention mechanisms. These components are connected via reciprocal cross-gating blocks that enable bidirectional feature modulation between encoder and decoder pathways. Extensive experiments on the DesmokeData and LSD3K surgical datasets demonstrate that RGA-Net achieves superior performance in restoring visual clarity suitable for robotic surgery integration. Our method enhances the surgeon-robot interface by providing consistently clear visualization, laying a technical foundation for alleviating surgeons' cognitive burden, optimizing operation workflows, and reducing iatrogenic injury risks in minimally invasive procedures. These practical benefits could be further validated through future clinical trials involving surgeon usability assessments. The proposed framework represents a significant step toward more reliable and safer robotic surgical systems through computational vision enhancement.
|
| |
| 09:00-10:30, Paper ThI1I.67 | Add to My Program |
| Collaborative Planning with Concurrent Synchronization for Operationally Constrained UAV-UGV Teams |
|
| Deng, Zihao | University of Massachusetts Amherst |
| Li, Qianhuang | University of Massachusetts Amherst |
| Gao, Peng | North Carolina State University |
| Wigness, Maggie | U.S. Army Research Laboratory |
| Rogers III, John G. | DEVCOM Army Research Laboratory |
| Kim, Donghyun | University of Massachusetts Amherst |
| Zhang, Hao | University of Massachusetts Amherst |
Keywords: Multi-Robot Systems
Abstract: Collaborative planning under operational constraints is an essential capability for heterogeneous robot teams tackling complex large-scale real-world tasks. Unmanned Aerial Vehicles (UAVs) offer rapid environmental coverage, but flight time is often limited by energy constraints, whereas Unmanned Ground Vehicles (UGVs) have greater energy capacity to support long-duration missions, but movement is constrained by traversable terrain. Individually, neither can complete tasks such as environmental monitoring. Effective UAV-UGV collaboration therefore requires energy-constrained multi-UAV task planning, traversability-constrained multi-UGV path planning, and crucially, synchronized concurrent co-planning to ensure timely in-mission recharging. To enable these capabilities, we propose Collaborative Planning with Concurrent Synchronization (CoPCS), a learning-based approach that integrates a heterogeneous graph transformer for operationally constrained task encoding with a transformer decoder for joint, synchronized co-planning that enables UAVs and UGVs to act concurrently in a coordinated manner. CoPCS is trained end-to-end under a unified imitation learning paradigm. We conducted extensive experiments to evaluate CoPCS in both robotic simulations and physical robot teams. Experimental results demonstrate that our method provides the novel multi-robot capability of synchronized concurrent co-planning and substantially improves team performance. More details of this work are available on the project website: https://hcrlab.gitlab.io/project/CoPCS.
|
| |
| 09:00-10:30, Paper ThI1I.68 | Add to My Program |
| Acoustic Sensing for Universal Jamming Grippers |
|
| Weber, Lion-Constantin | Technische Universität Berlin |
| Wienert, Theodor Marius | Technische Universität Berlin |
| Splettstößer, Martin | Technische Universität Berlin |
| Koenig, Alexander | Technische Universität Berlin |
| Brock, Oliver | Technische Universität Berlin |
Keywords: Soft Sensors and Actuators, Perception for Grasping and Manipulation, Force and Tactile Sensing
Abstract: Universal jamming grippers excel at grasping unknown objects due to their compliant bodies. Traditional tactile sensors can compromise this compliance, reducing grasping performance. We present acoustic sensing as a form of morphological sensing, where the gripper's soft body itself becomes the sensor. A speaker and microphone are placed inside the gripper cavity, away from the deformable membrane, fully preserving compliance. Sound propagates through the gripper and object, encoding object properties, which are then reconstructed via machine learning. Our sensor achieves high spatial resolution in sensing object size (2.6 mm error) and orientation (0.6 deg error), remains robust to external noise levels of 80 dBA, and discriminates object materials (up to 100% accuracy) and 16 everyday objects (85.6% accuracy). We validate the sensor in a realistic tactile object sorting task, achieving 53 minutes of uninterrupted grasping and sensing, confirming the preserved grasping performance. Finally, we demonstrate that disentangled acoustic representations can be learned, improving robustness to irrelevant acoustic variations.
|
| |
| 09:00-10:30, Paper ThI1I.69 | Add to My Program |
| Spatio-Temporal Consistent Semantic Mapping for Robotics Fruit Growth Monitoring |
|
| Lobefaro, Luca | University of Bonn |
| Sodano, Matteo | Photogrammetry and Robotics Lab, University of Bonn |
| Fusaro, Daniel | Department of Information Engineering (DEI), University of Padova |
| Magistri, Federico | University of Bonn |
| Malladi, Meher Venkata Ramakrishna | University of Bonn |
| Guadagnino, Tiziano | University of Bonn |
| Pretto, Alberto | University of Padova |
| Stachniss, Cyrill | University of Bonn |
|
|
| |
| 09:00-10:30, Paper ThI1I.70 | Add to My Program |
| HOGraspFlow: Taxonomy-Aware Hand-Object Retargeting for Multi-Modal SE(3) Grasp Generation |
|
| Shi, Yitian | Karlsruhe Institute of Technology |
| Guo, Zicheng | Karlsruher Institut Für Technologie |
| Wolf, Rosa Petra | Karlsruhe Institute of Technology |
| Welte, Edgar | Karlsruhe Institute of Technology (KIT) |
| Rayyes, Rania | Karlsruhe Institute for Technology (KIT) |
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation, Grasping
Abstract: We introduce Hand-Objectemph{(HO)GraspFlow}, an affordance-centric approach that retargets a single RGB with hand-object interaction (HOI) into multi-modal executable parallel jaw grasps without explicit geometric priors on target objects. Building on existing learning-based hand reconstruction and the vision foundation model, we synthesize SE(3) grasp poses with denoising flow matching (FM), conditioned on the following three complementary cues: RGB foundation features as visual semantics, HOI contact reconstruction, and taxonomy-aware prior on grasp types. Our approach demonstrates high fidelity in grasp synthesis without explicit HOI contact input or object geometry, while maintaining strong contact and taxonomy recognition. Another controlled comparison shows that emph{HOGraspFlow} consistently outperforms diffusion-based variants (emph{HOGraspDiff}), achieving high distributional fidelity and more stable optimization in SE(3). We demonstrate a reliable, object-agnostic grasp synthesis from human demonstrations in real-world experiments, where an average success rate of over 83% is achieved. Code: https://github.com/YitianShi/HOGraspFlow
|
| |
| 09:00-10:30, Paper ThI1I.71 | Add to My Program |
| Planning-Guided Diffusion Policy Learning for Contact-Rich Bimanual Object Reorientation |
|
| Li, Xuanlin | University of California San Diego |
| Zhao, Tong | Massachusetts Institute of Technology |
| Ai, Bo | University of California San Diego |
| Zhu, Xinghao | University of California, Berkeley |
| Wang, Jiuguang | Boston Dynamics AI Institute |
| Pang, Tao | Boston Dynamics AI Institute |
| Fang, Kuan | Cornell University |
Keywords: Bimanual Manipulation, Imitation Learning, Deep Learning in Grasping and Manipulation
Abstract: Contact-rich bimanual manipulation involves precise coordination of two arms to change object states through strategically selected contacts and motions. Due to the inherent complexity of these tasks, acquiring sufficient demonstration data and training policies that generalize to unseen scenarios remains a largely unresolved challenge. Building on recent advances in planning through contacts, we introduce Planning-Guided Diffusion Policy Learning (LIDE), an approach that effectively learns to solve contact-rich bimanual manipulation tasks by leveraging model-based motion planners to generate demonstration data in high-fidelity physics simulation. Through efficient planning in randomized environments, our approach generates large-scale and high-quality synthetic motion trajectories for tasks involving diverse objects and transformations. We then train a task-conditioned diffusion policy via behavior cloning using these demonstrations. To reduce the sim-to-real gap, we propose a set of designs in feature extraction, action prediction, and data augmentation that enable learning robust prediction of smooth action sequences and generalization to unseen scenarios. Through experiments in both simulation and the real world, we demonstrate that our approach can enable a bimanual robotic system to effectively manipulate objects of diverse geometries, dimensions, and physical properties.
|
| |
| 09:00-10:30, Paper ThI1I.72 | Add to My Program |
| Not Throwing Away My Shot: Planning Ahead with Dual Subgoals in Long-Horizon Robot Manipulation Tasks |
|
| Chen, Longrui | University of Leeds |
| Huang, Yanlong | University of Leeds |
| Dogar, Mehmet R | University of Leeds |
Keywords: Imitation Learning, Manipulation Planning, Deep Learning in Grasping and Manipulation
Abstract: Policy learning often encounters difficulties in long-horizon tasks. Subgoal-conditioned policies address long-horizon problems by decomposing them into manageable segments, but they usually struggle with identifying informative subgoals. To address this limitation, we propose PDS (planning with dual subgoal), an architecture that learns short-horizon and low-variance subgoals in embedding space, ensuring the planning both reachable and consistent. We begin by analyzing the impact of horizon and consistency on the performance of subgoal-conditioned policies. We evaluate the performance of commonly used subgoal definitions (time-based, visual-based, and language-based) in tasks with different lengths. Subsequently, we demonstrate that our approach, which predicts and conditions on dual subgoals, improves success rates and enhances stability across diverse tasks in simulation and real-world.
|
| |
| 09:00-10:30, Paper ThI1I.73 | Add to My Program |
| GraspGen: A Diffusion-Based Framework for 6-DOF Grasping with On-Generator Training |
|
| Murali, Adithyavairavan | NVIDIA |
| Sundaralingam, Balakumar | NVIDIA |
| Chao, Yu-Wei | NVIDIA |
| Yuan, Wentao | University of Washington, NVIDIA |
| Yamada, Jun | University of Oxford, NVIDIA |
| Carlson, Mark | NVIDIA |
| Ramos, Fabio | University of Sydney, NVIDIA |
| Birchfield, Stan | NVIDIA |
| Fox, Dieter | University of Washington, NVIDIA |
| Eppner, Clemens | N/A |
Keywords: Grasping, Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation
Abstract: Grasping is a fundamental robot skill, yet despite significant research advancements, learning-based 6-DOF grasping approaches are still not turnkey and struggle to generalize across different embodiments and in-the-wild settings. We build upon the recent success on modeling the object-centric grasp generation process as an iterative diffusion process. Our proposed framework, GraspGen, consists of a Diffusion-Transformer architecture that enhances grasp generation, paired with an efficient discriminator to score and filter sampled grasps. We introduce a novel and performant on-generator training recipe for the discriminator. To scale GraspGen to both objects and grippers, we release a new simulated dataset consisting of over 53 million grasps. We demonstrate that GraspGen outperforms prior methods in simulations with singulated objects across different grippers, achieves state-of-the-art performance on the FetchBench benchmark for grasping in clutter, and performs well on a real robot with noisy visual observations.
|
| |
| 09:00-10:30, Paper ThI1I.74 | Add to My Program |
| CareBot-H: Enhancing Patient Transfer with Biomimetic Design and Trajectory Deformation Algorithm |
|
| Zhu, Deliang | Fudan University |
| Li, Peizheng | Hebei University of Technology |
| Meng, Chunlei | Fudan University |
| Xie, Jiexin | Guilin University of Electronic Technology |
| Li, Yang | Hebei University of Technology |
| Guo, Shijie | Hebei University of Technology |
Keywords: Safety in HRI, Touch in HRI, Physical Human-Robot Interaction
Abstract: This paper introduces the CareBot-H Robot, a humanoid nursing robot designed to perform patient transfer tasks in confined environments. The robot is equipped with biomimetic arms that replicate human arm size and function, and distributed tactile sensors that enhance operational safety during physical contact. To achieve stable and anthropomorphic motion, a trajectory deformation algorithm is proposed. The method comprises an offline phase, where expert demonstrations are encoded into prior trajectories using a Variational Autoencoder (VAE), and an online phase, where a tactile-informed Zero-Moment Point (ZMP) model enables real-time trajectory adjustment. Experimental validation with human participants demonstrates that the proposed approach outperforms manual teleoperation, producing smoother and more efficient transfer trajectories while significantly reducing deviations between actual and ideal ZMP. These results indicate that the CareBot-H achieves reliable and safe patient transfer performance, offering practical potential for deployment in real-world nursing care scenarios.
|
| |
| 09:00-10:30, Paper ThI1I.75 | Add to My Program |
| A Centerline-Aligned Frenet Graph Framework for Surface-Based Path Planning in Pipeline Environments |
|
| Liu, Hao | Lab for High Technology |
| Liu, Gang | Tsinghua University |
| Qin, Chuan | Institute for AI Industry Research |
| Wang, Yu | Tsinghua University |
Keywords: Motion and Path Planning, Climbing Robots, Nonholonomic Motion Planning
Abstract: Pipeline inspection is essential for maintaining the safety of critical infrastructure, but manual inspection is dangerous and inefficient, and existing robotic solutions struggle to handle curved and constrained surfaces. Traditional planning methods are either computationally expensive or prone to redundancy and discretization artifacts. To address these challenges, this paper proposes a centerline-aligned Frenet graph framework for surface-based path planning in pipeline environments. By embedding the pipeline surface into a structured two-dimensional manifold passing through the pipeline's central axis, the framework enables efficient heuristic search while maintaining geometric consistency. By combining quadratic programming with kinematic limits, an initial geodesic constrained path is generated and optimized, resulting in a smooth and executable trajectory. Extensive experiments on pipelines with sharp bends, intersections, and real-world pipeline environments demonstrate significant improvements in computational efficiency, path quality, and robustness compared to traditional methods.
|
| |
| 09:00-10:30, Paper ThI1I.76 | Add to My Program |
| Omnidirectional Solid-State mmWave Radar Perception for UAV Power Line Collision Avoidance |
|
| Malle, Nicolaj | University of Southern Denmark |
| Ebeid, Emad | University of Southern Denmark |
Keywords: Aerial Systems: Perception and Autonomy, Omnidirectional Vision, Collision Avoidance
Abstract: Detecting and estimating distances to power lines is a challenge for both human UAV pilots and autonomous systems, which increases the risk of unintended collisions. We present a mmWave radar–based perception system that provides spherical sensing coverage around a small UAV for robust power line detection and avoidance. The system integrates multiple compact solid-state mmWave radar modules to synthesize an omnidirectional field of view while remaining lightweight. We characterize the sensing behavior of this omnidirectional radar arrangement in power line environments and develop a robust detection-and-avoidance algorithm tailored to that behavior. Field experiments on real power lines demonstrate reliable detection at ranges up to 10 m, successful avoidance maneuvers at flight speeds upwards of 10 m/s, and detection of wires as thin as 1.2 mm in diameter. These results indicate the approach’s suitability as an additional safety layer for both autonomous and manual UAV flight.
|
| |
| 09:00-10:30, Paper ThI1I.77 | Add to My Program |
| PSKDNet: Position-Supervised Keypoints Diffusion Network for Online Vectorized HD Map Construction |
|
| Jiang, Mingkun | Hefei Institute of Physical Sciences, Chinese Academy of Sciences;University of Science and Technology of China |
| Dong, Jun | Hefei Institutes of Physical Science, Chinese Academy of Sciences |
| He, JunMing | Hefei Institutes of Physical Science, Chinese Academy of Sciences |
| Hou, Guangyu | University of Science and Technology of China |
| Ma, Fan | Hefei Institute of Physical Sciences, Chinese Academy of Sciences |
| Wu, Shuang | Hefei Institutes of Physical Science, Chinese Academy of Sciences |
| Zhang, Yujing | Hefei Institute of Physical Sciences, Chinese Academy of Sciences |
|
|
| |
| 09:00-10:30, Paper ThI1I.78 | Add to My Program |
| Demonstration-Augmented Deep Reinforcement Learning with Mixed Reality Human-In-The-Loop Guidance |
|
| Matour, Mohammad-Ehsan | Hochschule Mittweida, University of Applied Sciences |
| Winkler, Alexander | Hochschule Mittweida, University of Applied Sciences |
Keywords: Agent-Based Systems, AI-Based Methods, Human Factors and Human-in-the-Loop
Abstract: The integration of human expertise into reinforcement learning has gained increasing attention as a means to improve sample efficiency and stability. Current approaches often depend on pre-collected expert demonstrations or virtual reality setups, which are costly to generate and difficult to adapt to dynamic training conditions. In this work, a framework is introduced that augments deep reinforcement learning with real-time demonstrations provided through mixed reality interaction. A structured robotic pick-and-place task serves as the benchmark, where a robot must execute sequential phases of grasping, transporting, and releasing an object. Expert guidance is delivered via mixed reality annotations, which are converted into reference trajectories and injected into the learning process whenever performance falls below a predefined threshold. A modified replay buffer accommodates both agent-generated and expert-generated transitions, allowing controlled sampling with a dynamically adjusted expert-to-agent ratio. Training in the real workspace through mixed reality reduces the simulation-to-reality gap considerably, as confirmed by experiments on a physical robot platform. Experimental evaluation demonstrates that the proposed framework accelerates policy convergence, ensures stability under noisy feedback, and achieves strong generalization to unseen task configurations. These findings highlight the potential of demonstration-augmented reinforcement learning through mixed reality as a data-efficient and robust approach to robot training in real-world scenarios.
|
| |
| 09:00-10:30, Paper ThI1I.79 | Add to My Program |
| Segment-To-Act: Label-Noise-Robust Action-Prompted Video Segmentation towards Embodied Intelligence |
|
| Li, Wenxin | Hunan University |
| Peng, Kunyu | Karlsruhe Institute of Technology |
| Wen, Di | Karlsruhe Institute of Technology |
| Liu, Ruiping | Karlsruhe Institute of Technology |
| Duan, Mengfei | Hunan University |
| Luo, Kai | Hunan University |
| Yang, Kailun | Hunan University |
Keywords: Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization, Semantic Scene Understanding
Abstract: Embodied intelligence relies on accurately segmenting objects actively involved in interactions. Action-based video object segmentation addresses this by linking segmentation with action semantics, but it depends on large-scale annotations and prompts that are costly, inconsistent, and prone to multimodal noise such as imprecise masks and referential ambiguity. To date, this challenge remains unexplored. In this work, we take the first step by studying action-based video object segmentation under label noise, focusing on two sources: textual prompt noise (category flips and within-category noun substitutions) and mask annotation noise (perturbed object boundaries to mimic imprecise supervision). Our contributions are threefold. First, we introduce two types of label noises for the action-based video object segmentation task. Second, we build up the first action-based video object segmentation under a label noise benchmark ActiSeg-NL and adapt six label-noise learning strategies to this setting, and establish protocols for evaluating them under textual, boundary, and mixed noise. Third, we provide a comprehensive analysis linking noise types to failure modes and robustness gains, and we introduce a Parallel Mask Head Mechanism (PMHM) to address mask annotation noise. Qualitative evaluations further reveal characteristic failure modes, including boundary leakage and mislocalization under boundary perturbations, as well as occasional identity substitutions under textual flips. Our comparative analysis reveals that different learning strategies exhibit distinct robustness profiles, governed by a foreground-background trade-off where some achieve balanced performance while others prioritize foreground accuracy at the cost of background precision. These results establish a clear sensitivity profile of action-based video object segmentation to imperfect annotations and set a benchmark for studying noise-robust learning in embodied perception.
|
| |
| 09:00-10:30, Paper ThI1I.80 | Add to My Program |
| A Lightweight Hip Exoskeleton with High Torque-To-Mass Ratio: Design, Gait-Synchronized Control, and Physiological Validation |
|
| Pai, Tzy-Qian | National Cheng Kung University |
| Lin, Shi Mou | National Cheng Kung University |
| Lan, Chao-Chieh | National Taiwan University |
Keywords: Prosthetics and Exoskeletons, Wearable Robotics, Physical Human-Robot Interaction
Abstract: This paper presents the design, control, and experimental validation of a lightweight hip exoskeleton for walking assistance. By integrating quasi-direct drive actuators, single-piece stainless steel frames, and passive revolute joints, the device achieves a high torque-to-mass ratio while maintaining a compact and lightweight structure. A delayed output feedback control strategy synchronizes assistive torque with the gait cycle by actively leading the wearer's hip motion, with user studies identifying a consistent optimal phase difference across participants and walking speeds, eliminating repeated calibration. Surface electromyography validates the assistance, demonstrating substantial reductions in activation of the vastus medialis and vastus lateralis at the optimal time delay. Power analysis further confirms that this setting maximizes positive power transfer while minimizing resistive effects. The proposed exoskeleton delivers physiologically meaningful and energetically efficient hip assistance suitable for everyday mobility support.
|
| |
| 09:00-10:30, Paper ThI1I.81 | Add to My Program |
| Nighttime Autonomous Driving Scene Reconstruction with Physically-Based Gaussian Splatting |
|
| Kim, Tae-Kyeong | University of Toronto, Noah's Ark Lab |
| Chen, Xingxin | Huawei Technologies, Nanjing University, University of Waterloo |
| Wu, Guile | Huawei Noah's Ark Lab |
| Huang, Chengjie | University of Waterloo |
| Bai, Dongfeng | Noah's Ark Lab, Huawei Technologies |
| Liu, Bingbing | Huawei Technologies |
Keywords: Simulation and Animation, Computer Vision for Transportation, Deep Learning for Visual Perception
Abstract: This paper focuses on scene reconstruction under nighttime conditions in autonomous driving simulation. Recent methods based on Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) have achieved photorealistic modeling in autonomous driving scene reconstruction, but they primarily focus on normal-light conditions. Low-light driving scenes are more challenging to model due to their complex lighting and appearance conditions, which often causes performance degradation of existing methods. To address this problem, this work presents a novel approach that integrates physically based rendering into 3DGS to enhance nighttime scene reconstruction for autonomous driving. Specifically, our approach integrates physically based rendering into composite scene Gaussian representations and jointly optimizes Bidirectional Reflectance Distribution Function (BRDF) based material properties. We explicitly model diffuse components through a global illumination module and specular components by anisotropic spherical Gaussians. As a result, our approach improves reconstruction quality for outdoor nighttime driving scenes, while maintaining real-time rendering. Extensive experiments across diverse nighttime scenarios on two real-world autonomous driving datasets, including nuScenes and Waymo, demonstrate that our approach outperforms the state-of-the-art methods both quantitatively and qualitatively.
|
| |
| 09:00-10:30, Paper ThI1I.82 | Add to My Program |
| Towards Distributed Robotic Casualty Assessment Using Multimodal, Non-Contact Perception and Probabilistic Inference |
|
| Bortoff, Zachary | University of Maryland |
| Poojari, Srijal Shekhar | University of Maryland |
| Baxevani, Kleio | University of Maryland |
| Gaus, Joshua | University of Maryland |
| Titus, Christopher | University of Maryland |
| Ashry, Ahmed | University of Maryland |
| Paley, Derek | University of Maryland |
Keywords: Field Robots, Search and Rescue Robots
Abstract: Mass-casualty incidents demand rapid and accurate triage, but the scale and acuity of injuries often overwhelm available medical personnel. To address this, we present a system that enables ground and aerial robots to localize and assess casualties using non-contact sensors, including color and thermal cameras, millimeter wave radar, and microphones. Injury and vital sign measurements from modality-specific classifiers are fused using a probabilistic model that captures correlations between injury states and supports distributed, asynchronous evidence accumulation. We validate the system through a series of timed mass-casualty field experiments using custom-built drones and Boston Dynamics Spot ground robots customized for robotic medical triage, demonstrating reliable estimation of casualty states and robustness to noisy conditions and sensor drop out.
|
| |
| 09:00-10:30, Paper ThI1I.83 | Add to My Program |
| The Case of Metadata Leakage in ROS 2: Fingerprintability, Security Implications, and Internet-Wide Vulnerability Measurements |
|
| Alshammari, Fayzah | University of California, Irvine |
| Der, Sam | University of California, Irvine |
| Chen, Qi Alfred | University of California, Irvine |
Keywords: Networked Robots, Software, Middleware and Programming Environments
Abstract: The Robot Operating System (ROS) is widely adopted in the robotics community, powering applications from self-driving vehicles to industrial automation. ROS 2 utilizes the Data Distribution Service (DDS) middleware for decentralized communication, making it inherently susceptible to reconnaissance and exploitation attacks. Previous research has examined the security implications of DDS implementations but has not systematically distinguished ROS 2 nodes from standalone DDS deployments, a critical distinction that significantly influences the execution and outcome of cyberattacks. This paper presents the first systematic fingerprinting framework designed specifically for ROS 2, demonstrating how DDS-based metadata leakage can facilitate precise identification and targeted exploitation of robotic systems. Through controlled experiments and an Internet-wide scan of DDS deployments, we identify extensive metadata exposure across actively supported ROS 2 implementations. Despite existing security solutions such as Secure ROS 2 (SROS2), deployments using default configurations remain vulnerable, highlighting the need for enhanced metadata obfuscation, stricter network access policies, and deployment of real-time anomaly detection mechanisms to strengthen the security posture of ROS 2 systems.
|
| |
| 09:00-10:30, Paper ThI1I.84 | Add to My Program |
| Event-LAB: Towards Standardized Evaluation of Neuromorphic Localization Methods |
|
| Hines, Adam D. | Queensland University of Technology |
| Fontan, Alejandro | Queensland University of Technology |
| Milford, Michael J | Queensland University of Technology |
| Fischer, Tobias | Queensland University of Technology |
Keywords: Localization, Software Tools for Benchmarking and Reproducibility, SLAM
Abstract: Event-based localization research and datasets are a rapidly growing area of interest, with a tenfold increase in the cumulative total number of published papers on this topic over the past 10 years. Whilst the rapid expansion in the field is exciting, it brings with it an associated challenge: a growth in the variety of required code and package dependencies as well as data formats, making comparisons difficult and cumbersome for researchers to implement reliably. To address this challenge, we present Event-LAB: a new and unified framework for running several event-based localization methodologies across multiple datasets. Event-LAB is implemented using the Pixi package and dependency manager, that enables a single command-line installation and invocation for combinations of localization methods and datasets. To demonstrate the capabilities of the framework, we implement two common event-based localization pipelines: Visual Place Recognition (VPR) and Simultaneous Localization and Mapping (SLAM). We demonstrate the ability of the framework to systematically visualize and analyze the results of multiple methods and datasets, revealing key insights such as the association of parameters that control event collection counts and window sizes for frame generation to large variations in performance. The results and analysis demonstrate the importance of fairly comparing methodologies with consistent event image generation parameters. Our Event-LAB framework provides this ability for the research community, by contributing a streamlined workflow for easily setting up multiple conditions.
|
| |
| 09:00-10:30, Paper ThI1I.85 | Add to My Program |
| Monorail-Like Gripper System with Dynamic and Modular Reconfiguration for Diverse Finger Layouts |
|
| Ikeda, Haruki | The University of Osaka |
| Higashi, Kazuki | Osaka University |
| Fukuda, Osamu | Saga University |
| Higashimori, Mitsuru | The University of Osaka |
Keywords: Grippers and Other End-Effectors, Multifingered Hands
Abstract: Robotic grasping requires flexible reconfiguration to handle diverse objects and tasks. This paper proposes a monorail-like reconfiguration framework for robotic grippers, inspired by train–rail relationships, that generates diverse finger layouts. The proposed framework unifies two complementary forms: dynamic reconfiguration, in which finger units move along an arbitrary non-circular track defined by the palm shape (palm track) to change the finger layout, and modular reconfiguration, in which the palm track shape and the number of fingers are modified to alter the achievable finger layout space. We developed a prototype gripper system that embodies the proposed framework and experimentally validated its unified reconfiguration capability. Dynamic reconfiguration with the S-shaped palm achieved seven distinct finger layouts with successful object grasping, while on-the-fly modular reconfiguration expanded the achievable finger layout space, enabling rapid adaptation to different grasping tasks. This work establishes a new design principle for reconfigurable grippers toward highly adaptive and versatile robotic grasping.
|
| |
| 09:00-10:30, Paper ThI1I.86 | Add to My Program |
| Uni-Skill: Building Self-Evolving Skill Repository for Generalizable Robotic Manipulation |
|
| Xie, Senwei | Institute of Computing Technology, Chinese Academy of Sciences |
| Zhang, Yuntian | Institute of Computing Technology, Chinese Academy of Sciences |
| Wang, Ruiping | Institute of Computing Technology, Chinese Academy of Sciences |
| Chen, Xilin | Institute of Computing Technology, Chinese Academy |
Keywords: Learning from Demonstration, Integrated Planning and Learning, Manipulation Planning
Abstract: While skill-centric approaches leverage foundation models to enhance generalization in compositional tasks, they often rely on fixed skill libraries, limiting adaptability to new tasks without manual intervention. To address this, we propose Uni-Skill, a Unified Skill-centric framework that supports skill-aware planning and facilitates automatic skill evolution. Unlike prior methods that restrict planning to predefined skills, Uni-Skill requests for new skill implementations when existing ones are insufficient, ensuring adaptable planning with self-augmented skill library. To support automatic implementation of diverse skills requested by the planning module, we construct SkillFolder, a VerbNet-inspired repository derived from large-scale unstructured robotic videos. SkillFolder introduces a hierarchical skill taxonomy that captures diverse skill descriptions at multiple levels of abstraction. By populating this taxonomy with large-scale, automatically annotated demonstrations, Uni-Skill shifts the paradigm of skill acquisition from inefficient manual annotation to efficient offline structural retrieval. Retrieved examples provide semantic supervision over behavior patterns and fine-grained references for spatial trajectories, enabling few-shot skill inference without deployment-time demonstrations. Comprehensive experiments in both simulation and real-world settings verify the state-of-the-art performance of Uni-Skill over existing VLM-based skill-centric approaches, highlighting its advanced reasoning capabilities and strong zero-shot generalization across a wide range of novel tasks.
|
| |
| 09:00-10:30, Paper ThI1I.87 | Add to My Program |
| MASTD3R-SLAM: Monocular Adaptive Semantic Tracking and Dynamic Reconstruction SLAM |
|
| Yang, Fengwei | Duke University |
| Lin, Qingran | Georgia Institute of Technology |
| Zhu, Chaolun | Waseda University |
Keywords: SLAM, Mapping, Visual Tracking
Abstract: The challenge of dynamic scenes has long been one of the core issues in the application and generalization of SLAM systems. Traditional visual SLAM systems often rely on depth sensors and prior camera parameters, making it difficult to correct dynamic challenges from arbitrary input images while simultaneously constructing dense maps. Recently, neural network-based methods for two-view point cloud prediction have gained attention, and SLAM systems such as DUST3R and MAST3R have emerged based on this approach. However, these systems face challenges when applied to dynamic scenes and cannot directly use traditional methods for correction, such as semantic masking or optical flow segmentation. To address this issue, we propose MASTD3R-SLAM, a SLAM method specifically designed for dynamic scenes that supports arbitrary video inputs. The method combines fused mask-based processing with coarse-to-fine pointmap alignment and optimization to achieve point cloud–to–pose re-mapping correction, and further performs Gaussian rendering to remove rendering artifacts and suppress dynamic mapping interference. Compared to the original baseline, our approach improves tracking ATE accuracy by more than 90% and successfully restores the correct 3D map.
|
| |
| 09:00-10:30, Paper ThI1I.88 | Add to My Program |
| Physically-Based Lighting Generation for Robotic Manipulation |
|
| Jin, Shutong | KTH Royal Institute of Technology |
| Wang, Lezhong | DTU |
| Temming, Ben | KTH Royal Institute of Technology |
| Pokorny, Florian T. | KTH Royal Institute of Technology |
Keywords: Imitation Learning, Learning from Demonstration, Deep Learning Methods
Abstract: We propose the first framework that leverages physically-based inverse rendering for novel lighting generation on existing real-world human demonstrations of robotic manipulation tasks. Specifically, inverse rendering decomposes the first frame in each demonstration into geometric (surface normal, depth) and material (albedo, roughness, metallic) properties, which are then used to render appearance changes under different lighting sources. To improve efficiency and maintain consistency across each generated sequence, we fine-tune Stable Video Diffusion on robot execution videos for temporal lighting propagation. We evaluate our framework by measuring the visual quality of the generated sequences, assessing its effectiveness in improving the imitation learning policy performance (38.75%) under six unseen real-world lighting conditions, and conducting ablation studies on individual modules of the proposed framework. We further showcase three downstream applications enabled by the proposed framework: background generation, object texture generation and distractor positioning.
|
| |
| 09:00-10:30, Paper ThI1I.89 | Add to My Program |
| Feasibility Study: Using Bypass Directly in Structured Warehouse for Multi-Agent Path Finding |
|
| Xu, Sen | Chongqing University |
| Zhao, Kai | Chongqing University |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Motion and Path Planning
Abstract: In warehouse environments, corridor conflicts often lead to traffic congestion, which result in many Multi-Agent Path Finding (MAPF) algorithms failing to find solutions within a reasonable time limit. Previous works have studied using corridor reasoning techniques or incorporating guidance, such as highways, to address this problem. However, these approaches often either encounter timeouts or yield low-quality solutions. In this work, based on Conflict-Based Search (CBS), we propose a technique called Reversible Lanes, specifically designed to address corridor conflicts by imposing a new constraint that forces agents to use bypasses for conflict resolution. Our approach is motivated by three key observations from prior research: (1) in warehouse maps, the overhead associated with maintaining solution optimality via corridor reasoning technique is often disproportionate to the benefits gained; (2) the fixed nature of manually designed highways exhibits a lack of adaptability, leading to poor solution quality on certain instances; and (3) the structural properties of warehouse layouts render direct bypass usage feasible and incur minimal additional costs. Theoretically, we demonstrate the feasibility of our algorithm by analyzing its relationship to both corridor reasoning techniques and highways. Experimentally, the results show that our algorithm provides a more effective approach for resolving corridor conflicts compared to these existing methods, achieving a superior trade off between solution quality and computational efficiency by finding near-optimal solutions with reduced runtime.
|
| |
| 09:00-10:30, Paper ThI1I.90 | Add to My Program |
| Effective Trajectory Tracking with Convex-Optimization Based Obstacle-Avoidance Method for Continuum Robot |
|
| Deng, Ping | The University of Hong Kong |
| Peng, Rui | The University of Hong Kong |
| Tang, Duo | The University of Hong Kong |
| Cao, Xiao | University of Hong Kong |
| Lu, Peng | The University of Hong Kong |
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Robot Applications, Collision Avoidance
Abstract: A cable-driven continuum robot with high redundancy is capable of performing the tip trajectory tracking task while simultaneously satisfying additional safety constraints, such as joint limits or external obstacles in the environment. To address these challenges, efficient motion planning methods are required. This paper proposes a quadratic programming based method in conjunction with convex polytopes based distance computation. Our methodology integrates safety constraints based on the robots' posture states, thus enabling barriers evasion in dynamic situations. Simulation outcomes demonstrate effective trajectory tracking in the presence of various objects and provide a comprehensive performance evaluation based on the generated robot state. Finally, real-word experiment was conducted on a prototype of a three-segment cable-driven continuum manipulator, which confirmed the efficacy of the proposed obstacle avoidance approach. The approach is versatile and can be adapted to similar multiple segments cable-driven continuum robotic systems by designing the robot parameters, enabling the success of tip trajectory tracking tasks under complex obstacle conditions.
|
| |
| 09:00-10:30, Paper ThI1I.91 | Add to My Program |
| A Novel Reconfigurable Dexterous Hand Based on Triple-Symmetric Bricard Parallel Mechanism |
|
| Tian, Chunxu | Fudan University |
| Huang, Zhichao | Fudan University |
| Li, Hongzeng | Fudan University |
| Wang, Bo | Fudan University |
| Jia, Jinghao | Fudan University |
| Sun, Yirui | Fudan University |
| Zhang, Dan | The Hong Kong Polytechnic University |
Keywords: Parallel Robots, Mechanism Design, Kinematics
Abstract: This paper introduces a novel design for a robotic hand based on parallel mechanisms. The proposed hand uses a triple-symmetric Bricard linkage as its reconfigurable palm, enhancing adaptability to objects of varying shapes and sizes. Through topological and dimensional synthesis, the mechanism achieves a well-balanced degree of freedom and link configuration suitable for reconfigurable palm motion, balancing dexterity, stability, and load capacity. Furthermore, kinematic analysis is performed using screw theory and closed-loop constraints, and performance is evaluated based on workspace, stiffness, and motion/force transmission efficiency. Finally, a prototype is developed and tested through a series of grasping experiments, demonstrating the ability to perform stable and efficient manipulation across a wide range of objects. The results validate the effectiveness of the design in improving grasping versatility and operational precision, offering a promising solution for advanced robotic manipulation tasks.
|
| |
| 09:00-10:30, Paper ThI1I.92 | Add to My Program |
| In-Situ Automated Robotic Crown Preparation with MPC-Based Adaptive Control |
|
| Liu, Heng | Beihang University |
| Fang, Huayu | The Fourth Military Medical University |
| Bai, Shizhu | The Fourth Military Medical University |
| Zhao, Yimin | The Fourth Military Medical University |
| Wang, Junchen | Beihang University |
Keywords: Surgical Robotics: Planning, Collision Avoidance, Robust/Adaptive Control
Abstract: Crown preparation aims to create an optimal foundation for durable and functional restoration by reshaping the tooth with a cutting tool. Robotic crown preparation has emerged as a promising approach to overcome the inherent limitations of manual procedures, yet challenges remain in achieving efficient cutting path generation, collision-free orientation adjustment and precise cutting path following, since the oral cavity is a confined space with the target tooth tightly surrounded by other teeth. This paper introduces a novel, in-situ automated robotic full crown preparation system comprising (1) Preoperative Path Planning: generating high-efficiency universal cutting paths based on tooth morphological features; (2) Intraoral Collision Avoidance: optimizing the cutting tool's orientation within the constrained oral cavity; (3) MPC-Based Adaptive Control: modulating the path-following feed rate using model predictive control (MPC) according to intraoperative force feedback. The proposed system was thoroughly validated on a human head phantom targeting a permanent tooth to simulate a real clinical scenario, yielding an average root-mean-square (RMS) error (tooth shape after preparation) of 0.17 mm and an overall mean execution time of 347.77 s, achieving a 74.2% improvement in cutting efficiency over state-of-the-art methods. A comparative evaluation against conventional dental guides further demonstrates its technical feasibility and significant potential for clinical translation.
|
| |
| 09:00-10:30, Paper ThI1I.93 | Add to My Program |
| Give Me Scissors: Collision-Free Dual-Arm Surgical Assistive Robot for Instrument Delivery |
|
| Luo, Xuejin | Beihang University |
| Sun, Shiquan | Beihang University |
| Zhang, Runshi | Beihang University |
| Zhang, Ruizhi | Beihang University |
| Wang, Junchen | Beihang University |
Keywords: Medical Robots and Systems, Collision Avoidance, Dual Arm Manipulation
Abstract: During surgery, scrub nurses are required to frequently deliver surgical instruments to surgeons, which can lead to physical fatigue and decreased focus. Robotic scrub nurses provide a promising solution that can replace repetitive tasks and enhance efficiency. Existing research on robotic scrub nurses relies on predefined paths for instrument delivery, which limits their generalizability and poses safety risks in dynamic environments. To address these challenges, we present a collision-free dual-arm surgical assistive robot capable of performing instrument delivery. A vision-language model is utilized to automatically generate the robot's grasping and delivery trajectories in a zero-shot manner based on surgeons' instructions. A real-time obstacle minimum distance perception method is proposed and integrated into a unified quadratic programming framework. This framework ensures reactive obstacle avoidance and self-collision prevention during the dual-arm robot's autonomous movement in dynamic environments. Extensive experimental validations demonstrate that the proposed robotic system achieves an 83.33% success rate in surgical instrument delivery while maintaining smooth, collision-free movement throughout all trials. The project page and source code are available at https://give-me-scissors.github.io/.
|
| |
| 09:00-10:30, Paper ThI1I.94 | Add to My Program |
| GRAPE: Generalizing Robot Policy Via Preference Alignment |
|
| Zhang, Zijian | University of Minnesota |
| Zheng, Kaiyuan | University of Washington |
| Chen, Zhaorun | Purdue University |
| Jang, Joel | NVIDIA |
| Li, Yi | University of Washington |
| Han, Siwei | University of North Carolina at Chapel Hill |
| Wang, Chaoqi | University of Chicago |
| Ding, Mingyu | University of North Carolina at Chapel Hill |
| Fox, Dieter | University of Washington |
| Yao, Huaxiu | UNC-Chapel Hill |
Keywords: Machine Learning for Robot Control, Reinforcement Learning, Learning from Demonstration
Abstract: Despite the recent advancements of vision-language-action (VLA) models on a variety of robotics tasks, they suffer from critical issues such as poor generalizability to unseen tasks, due to their reliance on behavior cloning exclusively from successful rollouts. Furthermore, they are typically fine-tuned to replicate demonstrations collected by experts under different settings, thus introducing distribution bias and limiting their adaptability to diverse manipulation objectives, such as efficiency, safety, and task completion. To bridge this gap, we introduce GRAPE: Generalizing Robot Policy via Preference Alignment. Specifically, GRAPE aligns VLAs on a trajectory level and implicitly models reward from both successful and failure trials to boost generalizability to diverse tasks. Moreover, GRAPE breaks down complex manipulation tasks to independent stages and automatically guides preference modeling through customized spatiotemporal constraints with keypoints proposed by a large vision-language model. Notably, these constraints are flexible and can be customized to align the model with varying objectives, such as safety, efficiency, or task success. We evaluate GRAPE across a diverse array of tasks in both real-world and simulated environments. Experimental results demonstrate that GRAPE enhances the performance of state-of-the-art VLA models, increasing success rates on in-domain and unseen manipulation tasks by 51.79% and 58.20%, respectively. Additionally, GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 37.44% and rollout step-length by 11.15%, respectively.
|
| |
| 09:00-10:30, Paper ThI1I.95 | Add to My Program |
| When Attention Betrays: Erasing Backdoor Attacks in Robotic Policies by Reconstructing Visual Tokens |
|
| Li, Xuetao | Wuhan University |
| Fu, Pinhan | Wuhan University |
| Huang, Wenke | Wuhan University |
| Pan, Nengyuan | Hubei University |
| Yang, Songhua | Wuhan University |
| Zhao, Kaiyan | Wuhan University |
| Wan, Guancheng | University of California, Los Angeles (UCLA) |
| Li, Mengde | The Institute of Technological Sciences, Wuhan University, Hubei, China |
| Xuan, Jifeng | Wuhan University |
| Li, Miao | Wuhan University |
|
|
| |
| 09:00-10:30, Paper ThI1I.96 | Add to My Program |
| DemoDiffusion: One-Shot Human Imitation Using Pre-Trained Diffusion Policy |
|
| Park, Sungjae | Carnegie Mellon University |
| Bharadhwaj, Homanga | Carnegie Mellon University |
| Tulsiani, Shubham | Carnegie Mellon University |
Keywords: Deep Learning in Grasping and Manipulation, Dexterous Manipulation, Learning from Demonstration
Abstract: We propose DemoDiffusion, a simple method for enabling robots to perform manipulation tasks by imitating a single human demonstration, without requiring task-specific training or paired human-robot data. Our approach is based on two insights. First, the hand motion in a human demonstration provides a useful prior for the robot’s end-effector trajectory, which we can convert into a rough open-loop robot motion trajectory via kinematic retargeting. Second, while this retargeted motion captures the overall structure of the task, it may not align well with plausible robot actions in-context. To address this, we leverage a pre-trained generalist diffusion policy to modify the trajectory, ensuring it both follows the human motion and remains within the distribution of plausible robot actions. Unlike approaches based on online reinforcement learning or paired human-robot data, our method enables robust adaptation to new tasks and scenes with minimal effort. In real-world experiments across 8 diverse manipulation tasks, DemoDiffusion achieves 83.8% average success rate, compared to 13.8% for the pre-trained policy and 52.5% for kinematic retargeting, succeeding even on tasks where the pre-trained generalist policy fails entirely. Project page: https://demodiffusion.github.io/
|
| |
| 09:00-10:30, Paper ThI1I.97 | Add to My Program |
| UCA-SLAM: Tightly Coupled Visual-LiDAR SLAM with DoF-Wise Uncertainty-Driven Constraint Analysis |
|
| Yu, Shizhuo | Nankai University |
| Zhu, Wenbin | Nankai University |
| Yuan, Jing | Nankai University |
| Gao, Yuanxi | Nankai University |
Keywords: SLAM, Localization
Abstract: Single sensor (visual or LiDAR) simultaneous localization and mapping (SLAM) is fragile in the complex environment, which makes visual-LiDAR fusion a mainstream in SLAM research. However, most existing fusion methods omit explicit modeling of feature uncertainties and do not quantify each feature's constraint strength on each degree of freedom (DoF) of the 6-DoF pose, thereby hindering the full exploitation of the complementary information across different sensors. In this paper, a tightly coupled visual-LiDAR SLAM method termed UCA-SLAM is proposed, which integrates the closed-form uncertainty propagation and the DoF-wise constraint analysis. Specifically, UCA-SLAM maintains uncertainties for visual map points and LiDAR voxel planes, and computes DoF-wise constraint strength for each feature. In the front-end tracking, the DoF-wise constraints of features are comprehensively analyzed, which provides an adaptive fusion mechanism for pose estimation, and an explicit uncertainty propagation from feature measurements to the 6-DoF pose is derived. The resultant feature and pose uncertainties are then used to weight the cost function in local bundle adjustment (BA) optimization of UCA-SLAM to improve the accuracy of the system. Extensive experiments conducted on public datasets and in real-world environments demonstrate that UCA-SLAM outperforms state-of-the-art visual-LiDAR fusion SLAM methods. UCA-SLAM is open-sourced to benefit the community.
|
| |
| 09:00-10:30, Paper ThI1I.98 | Add to My Program |
| GGD-SLAM: Monocular 3DGS SLAM Powered by Generalizable Motion Model for Dynamic Environments |
|
| Liu, Yi | Tsinghua University |
| Xu, Haoxuan | The Hong Kong University of Science and Technology (Guangzhou) |
| Duan, Hongbo | Tsinghua University |
| Fan, Keyu | Tsinghua University |
| Zhang, Zhengyang | Tsinghua University |
| Zhuang, Peiyu | Yat-Sen University |
| Luo, Pengting | Huawei |
| Liu, Houde | Shenzhen Graduate School, Tsinghua University |
Keywords: Mapping, Computer Vision for Automation, SLAM
Abstract: Visual SLAM algorithms achieve significant improvements through the exploration of 3D Gaussian Splatting (3DGS) representations, particularly in generating high-fidelity dense maps. However, they depend on a static environment assumption and experience significant performance degradation in dynamic environments. This paper presents GGD-SLAM, a framework that employs a generalizable motion model to address the challenges of localization and dense mapping in dynamic environments—without predefined semantic annotations or depth input. Specifically, the proposed system employs a First-In-First-Out (FIFO) queue to manage incoming frames, facilitating dynamic semantic feature extraction through a sequential attention mechanism. This is integrated with a dynamic feature enhancer to separate static and dynamic components. Additionally, to minimize dynamic distractors' impact on the static components, we devise a method to fill occluded areas via static information sampling and design a distractor-adaptive Structure Similarity Index Measure (SSIM) loss tailored for dynamic environments, significantly enhancing the system's resilience. Experiments conducted on real-world dynamic datasets demonstrate that the proposed system achieves state-of-the-art performance in camera pose estimation and dense reconstruction in dynamic scenes.
|
| |
| 09:00-10:30, Paper ThI1I.99 | Add to My Program |
| Mash, Spread, Slice! Learning to Manipulate Object States Via Visual Spatial Progress |
|
| Mandikal, Priyanka | The University of Texas at Austin |
| Hu, Jiaheng | UT Austin |
| Dass, Shivin | UT Austin |
| Majumder, Sagnik | UT Austin |
| Martín-Martín, Roberto | University of Texas at Austin |
| Grauman, Kristen | UT Austin and Facebook AI Research |
Keywords: Perception for Grasping and Manipulation, Sensorimotor Learning
Abstract: Most robot manipulation focuses on changing the kinematic state of objects: picking, placing, opening, or rotating them. However, a wide range of real-world manipulation tasks involve a different class of object state change—such as mashing, spreading, or slicing—where the object’s physical and visual state evolve progressively without necessarily changing its position. We present SPARTA, the first unified framework for the family of object state change manipulation tasks. Our key insight is that these tasks share a common structural pattern: they involve spatially-progressing, object-centric changes that can be represented as regions transitioning from an actionable to a transformed state. Building on this insight, SPARTA integrates spatially progressing object change segmentation maps, a visual skill to perceive actionable vs. transformed regions for specific object state change tasks, to generate a) structured policy observations that strip away appearance variability, and b) dense rewards that capture incremental progress over time. These are leveraged in two SPARTA policy variants: reinforcement learning for fine-grained control without demonstrations or simulation; and greedy control for fast, lightweight deployment. We validate SPARTA on a real robot for three challenging tasks across 10 diverse real-world objects, achieving significant improvements in training time and accuracy over sparse rewards and visual goal-conditioned baselines. Our results highlight progress-aware visual representations as a versatile foundation for the broader family of object state manipulation tasks. More information at https://vision.cs.utexas.edu/projects/sparta-robot
|
| |
| 09:00-10:30, Paper ThI1I.100 | Add to My Program |
| Learning End-To-End Dexterous Arm-Hand VLA Policies with Shared Autonomy: DexGrasp AI Copilot for Efficient Teleoperation |
|
| Cui, Yu | ByteDance Inc |
| Zhang, Yujian | ByteDance Inc |
| Tao, Lina | ByteDance Inc |
| Li, Yang | ByteDance Inc |
| Yi, Xinyu | ByteDance Inc |
| Li, Zhibin (Alex) | ByteDance Inc |
Keywords: Learning from Demonstration, Imitation Learning, Dexterous Manipulation
Abstract: Achieving human-like dexterous manipulation is essential for general-purpose robots but remains a challenge. Recent advances in Vision-Language-Action (VLA) models offer the potential to learn flexible skills from demonstration data. However, training effective VLAs requires a large amount of high-quality data, which is difficult to obtain: fully manual teleoperation cognitively overloads human operators, while automated planning produces unnatural motions and lacks data diversity. We present a Shared Autonomy framework: a human operator teleoperates the arm for global motion, while an autonomous DexGrasp-VLA policy, as an AI Copilot, generates force-adaptive actions for a five-finger hand with tactile feedback -- drastically reducing human effort and enabling efficient collection of high-quality demonstrations. Using these data, we train an end-to-end VLA policy with a novel Arm-Hand Feature Enhancement module -- shared representations are conjunct with separate arm and hand latent features, representing the distinct dynamics of macro and micro movements, leading to more robust and natural coordination of arm-hand motions. Our Corrective Teleoperation can further refine the policy with failure-recovery demonstrations via human intervention. Experiments show our approach efficiently generates high-quality data and learns policies with a high success rate and natural behaviors. The trained arm-hand VLA policy is effectively generalized to both seen and unseen objects, with a success rate of around 90% in more than 50 diverse objects.
|
| |
| 09:00-10:30, Paper ThI1I.101 | Add to My Program |
| Scene-Aware Robotic Light Pipe Control for Vitreoretinal Surgery |
|
| Lin, Wenjun | National University of Singapore |
| Zhang, Wending | National University of Singapore |
| Chng, Chin-Boon | National University of Singapore |
| Tan, Yong Jun | National University of Singapore |
| Chui, Chee Kong | National University of Singapore |
Keywords: Medical Robots and Systems, Computer Vision for Medical Robotics, Human-Robot Collaboration
Abstract: Surgical robotics have revolutionized medical procedures by offering enhanced precision and reduced complications. However, vitreoretinal surgery still relies heavily on manual techniques, where surgeons manage both a surgical tool and a light pipe, complicating operations and potentially affecting outcomes. To improve efficiency and outcomes while reducing workloads on surgeons, a novel vision-based robot-assisted system with advanced surgical scene understanding ability is proposed. The system automatically positions a light pipe held by a specialized surgical robot through optimization-based visual collaborative control. By identifying target areas for automatic illumination, the system allows surgeons to focus on surgical tasks and supports more complex surgeries such as three-arm procedures. Besides, the system enhances surgical safety by detecting surgical activities and dangerous areas and issuing alerts accordingly. Postoperatively, the system records tool trajectories and detected activities, providing data for surgical reports, skill evaluation, and training. Experiments prove the effectiveness of the control system, visual algorithm, and overall collaborative system.
|
| |
| 09:00-10:30, Paper ThI1I.102 | Add to My Program |
| 4D Radar Diffusion with Adaptive Visual-Aided Condition for Point Cloud Enhancement |
|
| Xiao, Renxiang | Harbin Institute of Technology, Shenzhen |
| Zhang, Yuanfan | Harbin Institute of Technology |
| Liu, Wei | Harbin Institute of Technology, Shenzhen |
| Dong, Guangzhong | Harbin Institute of Technology, Shenzhen |
| Lou, Yunjiang | Harbin Institute of Technology, Shenzhen |
| Hu, Liang | Harbin Institute of Technology, Shenzhen |
Keywords: Deep Learning Methods, Range Sensing
Abstract: Despite its resilience in adverse weather, millimeter-wave (mmWave) radar yields sparse and noisy point clouds that limit its perception and localization performance. Diffusion models have recently gained attention for enhancing millimeter-wave radar in perception tasks due to their strong denoising and generative capabilities. Yet, the enhanced radar point cloud is still far from expected due to a lack of texture information and errors caused by inherent sensor–model mismatch between LiDAR and radar. In this paper, we propose an adaptive vision-aided radar data enhancement method based on a conditional diffusion model for denoising and densifying radar point clouds. The pipeline decomposes mmWave radar into depth and BEV views, fuses the depth view with synchronized images, and uses the fused features together with BEV tokens to condition the diffusion model. LiDAR is used only for training supervision, but not for inference. Extensive experiments demonstrate that our proposed method produces dense and geometrically consistent radar point clouds, validating the effectiveness of the introduced vision-aid for radar enhancement. Notably, our method even works well in scenarios under visual occlusions. The accurate odometry and high-fidelity map reconstruction using enhanced radar point cloud highlights the great potential of our method for other downstream tasks in robotics and autonomous driving.
|
| |
| 09:00-10:30, Paper ThI1I.103 | Add to My Program |
| BFMPF-Net: Bidirectional Frequency-Domain Modulation Progressive Fusion Network for Road Crack Segmentation |
|
| Yang, Wen | China Three Gorges University |
| Zheng, Yingying | China Three Gorges University |
| Sun, Hang | China Three Gorges University |
| Liang, Chao | Wuhan University |
| Fang, Lei | CAAZ (Zhejiang) Information Technology Co., Ltd |
Keywords: Object Detection, Segmentation and Categorization
Abstract: Recently, deep learning–based methods for road crack segmentation have achieved promising performance, particularly in robotic vision applications such as automated inspection and maintenance. However, most frequency-domain methods employ a decoupled processing strategy, overlooking the dynamic modulation mechanism between high- and low-frequency components, which constrains the model's effectiveness in detecting cracks within complex environments. Moreover, existing methods suffer from low information fidelity during feature transmission, where critical encoder details are progressively lost in the decoder, making it difficult to reconstruct complete crack structures. To address these issues, we propose a Bidirectional Frequency-domain Modulation Progressive Fusion Network (BFMPF-Net). Specifically, we propose a Bidirectional Frequency-domain Modulation Enhancement (BFME) module that effectively exploits bidirectional modulation between high- and low-frequency components and learns the spatial weights of high-frequency features to attenuate noise and preserve crack edge details, thereby improving the performance of crack segmentation. Furthermore, the Progressive Guidance Fusion module serves as another core component of our framework. It leverages the spatial prior provided by the original low-resolution image to guide feature refinement via stepwise optimization from coarse contours to fine edges, thereby ensuring the integrity of crack segmentation. Evaluation on three publicly available datasets—CrackTree260, CrackLS315, and Crack760—affirms the superior segmentation accuracy of the proposed BFMPF-Net compared to current mainstream methods.
|
| |
| 09:00-10:30, Paper ThI1I.104 | Add to My Program |
| Implicit LiDAR SLAM with Confidence-Guided SDF and Normal-Driven Sampling |
|
| Liu, Hong | Southeast University |
| Huang, Feixuan | Southeast University |
| Gao, Wang | Southeast University |
| Xu, Jinle | Southeast University |
| Pan, Shuguo | Southeast University |
| Ling, Keck-Voon | Nanyang Technological University |
Keywords: Localization, SLAM, Autonomous Vehicle Navigation
Abstract: Implicit representations for LiDAR-based Simultaneous Localization and Mapping (SLAM) offer significant advantages in storage efficiency and expressive power over traditional explicit maps. However, a critical limitation for implicit SLAM is their deterministic nature, which prevents the quantification of prediction uncertainty in sparse or noisy conditions. Furthermore, the accuracy of the underlying Signed Distance Field (SDF) is often compromised by systematic errors arising from the angular dependency of LiDAR measurements, where oblique incident angles lead to biased distance estimations and degrade map quality. To address these challenges, this paper introduces a framework that enhances the robustness and accuracy of implicit LiDAR SLAM by integrating uncertainty estimation and an adaptive sampling strategy. We propose a neural network-based approach to learn and predict SDF uncertainty, which is then effectively incorporated into both localization and mapping processes. Concurrently, to mitigate incident angle-induced errors, we develop an adaptive sampling scheme that weights LiDAR rays based on surface normal information. Validation on public datasets and a custom experimental platform demonstrates that our approach outperforms baseline methods in terms of localization, mapping accuracy, and robustness.
|
| |
| 09:00-10:30, Paper ThI1I.105 | Add to My Program |
| MIMO: A Multimodal Imitation Learning Framework for Mobile Manipulation with Exoskeleton-VR Teleoperation |
|
| Mei, Jie | Beihang University |
| Wu, Xinkai | Beihang University |
| Zhang, Yue | Beihang University |
| Song, Tao | Beihang University |
| Xiong, Zhongxia | Beihang University |
Keywords: Mobile Manipulation, Imitation Learning, Wheeled Robots
Abstract: In whole-body mobile manipulation, existing teleoperation systems often suffer from high complexity and cost, while imitation learning approaches are frequently limited by insufficient modeling of long-horizon action sequences and inadequate fusion of multi-receptive-field visual features. These constraints significantly hinder the collection of high-quality demonstration data and the effective transfer of complex robotic skills. To address these challenges, this paper proposes an integrated exoskeleton-VR teleoperation system that enables single-operator whole-body control of mobile manipulators with basic force feedback, substantially reducing the cost of data collection while improving demonstration quality. Furthermore, we introduce MIMO, an encoder–decoder imitation learning framework, which incorporates an Efficient Context Modeling Network (ECM-Net) based on linear-complexity temporal modeling to mitigate error accumulation in long-horizon tasks, and a Multi-Receptive Field Fusion Network (MRF-Net) that employs dual-path attention to achieve precise alignment between multi-scale visual cues and motion phases. Real-world experiments on a mobile manipulator demonstrate that MIMO consistently outperforms state-of-the-art baselines across multiple whole-body mobile manipulation tasks, confirming its effectiveness in long-horizon, fine-grained robotic control.
|
| |
| 09:00-10:30, Paper ThI1I.106 | Add to My Program |
| Toward Robust Collaborative Perception under Adverse Weather Conditions Via Dual-Branch Network |
|
| Yang, Yuquan | University of Science and Technology of China |
| Zhang, Hui | University of Science and Technology of China |
| Zhang, ZiYin | University of Science and Technology of China |
| Lu, Wenyu | University of Science and Technology of China |
| Xu, Xiaohua | University of Science and Technology of China |
Keywords: Computer Vision for Automation, Deep Learning for Visual Perception
Abstract: Recent advances in collaborative perception systems have led to significant improvements in 3D object detection performance. While widely deployed LiDAR and camera systems often experience performance degradation under adverse weather conditions, weather-robust 4D radar offers a promising alternative to address this challenge. However, effectively fusing 4D radar measurements with degraded LiDAR data remains a critical challenge. In this work, we decompose the weather-induced degradation in LiDAR perception into feature attenuation requiring enhancement and feature contamination requiring suppression, based on the underlying physical interactions. Building upon this decomposition, we propose a dual-branch network to handle each degradation pattern in a specialized manner. One branch focuses on enhancement based on spatial and channel attention, guided by 4D radar cues. The other branch focuses on suppression based on intra-modal structural consistency and cross-modal consistency. To achieve adaptive branch integration, we propose a dynamic decision network to generate a decision weight map for each branch and capture the complex interaction between branches. To validate the effectiveness of our method, we conduct extensive experiments on V2X-R, the only publicly available collaborative LiDAR-4D radar dataset. Extensive experimental results demonstrate that our method achieves improvements of 3.65% and 10.80% in mAP@0.7 under fog and snow conditions, respectively, outperforming previous state-of-the-art approaches.
|
| |
| 09:00-10:30, Paper ThI1I.107 | Add to My Program |
| DemoBot: Efficient Learning of Bimanual Manipulation with Dexterous Hands from Third-Person Human Videos |
|
| Xu, Yucheng | University of Edinburgh |
| Mao, Xiaofeng | Edinburgh University |
| Miller, Elle | University of Edinburgh |
| Li, Yang | ByteDance Inc |
| Yi, Xinyu | ByteDance Inc |
| Li, Zhibin (Alex) | University College London |
| Fisher, Robert | University of Edinburgh |
Keywords: Learning from Demonstration, Reinforcement Learning, Dexterous Manipulation
Abstract: This work presents DemoBot, a learning framework that enables a dual-arm, multi-finger robotic system to acquire complex manipulation skills from a single unannotated RGB-D video demonstration. The method extracts structured motion trajectories of both hands and objects from raw video data. These trajectories serve as motion priors for a novel reinforcement learning (RL) pipeline that learns to refine them through contact-rich interactions, thereby eliminating the need to learn from scratch. To address the challenge of learning long-horizon manipulation skills, we introduce: (1) Temporal-segment based RL to enforce temporal alignment of the current state with demonstrations; (2) Success-Gated Reset strategy to balance the refinement of readily acquired skills and the exploration of subsequent task stages; and (3) Event-Driven Reward curriculum with adaptive thresholding to guide the RL learning of high-precision manipulation. The novel video processing and RL framework successfully achieved long-horizon synchronous and asynchronous bimanual assembly tasks, offering a scalable approach for direct skill acquisition from human videos.
|
| |
| 09:00-10:30, Paper ThI1I.108 | Add to My Program |
| PaiP: An Operational Aware Interactive Planner for Unknown Cabinet Environments |
|
| Wang, Chengjin | Tongji University |
| Yan, Zheng | Tongji University |
| Zhou, Yanmin | Tongji University |
| Shen, Runjie | Tongji University |
| Wang, Zhipeng | Tongji University |
| Cheng, Bin | Tongji University |
| He, Bin | Tongji University |
Keywords: Motion and Path Planning, Perception-Action Coupling, Contact Modeling
Abstract: Box/cabinet scenarios pose with stacked objects significant challenges for robotic motion due to visual occlusions and constrained free space. Traditional collision-free trajectory planning methods often fail when no collision-free paths exist, and may even lead to catastrophic collisions caused by invisible objects. To overcome these challenges, we propose an operational aware interactive motion planner (PaiP) a real-time closed-loop planning framework utilizing multimodal tactile perception. This framework autonomously infers object interaction features by perceiving motion effects at interaction interfaces. These interaction features are incorporated into grid maps to generate operational cost maps. Building upon this representation, we extend sampling-based planning methods to interactive planning by optimizing both path cost and operational cost. Experimental results demonstrate that PaiP achieves robust motion in narrow spaces. Project page: https://travelers-lab.github.io/PaiP/
|
| |
| 09:00-10:30, Paper ThI1I.109 | Add to My Program |
| Reference-Free Sampling-Based Model Predictive Control |
|
| Schramm, Fabian | Inria Paris, ENS Paris |
| Fabre, Pierre | Inria Paris |
| Perrin-Gilbert, Nicolas | Université Pierre Et Marie Curie-Paris 6, CNRS UMR 7222 |
| Carpentier, Justin | Inria Paris, ENS Paris |
Keywords: Legged Robots, Multi-Contact Whole-Body Motion Planning and Control, Humanoid and Bipedal Locomotion
Abstract: We present a sampling-based model predictive control (MPC) framework that enables emergent locomotion without relying on handcrafted gait patterns or predefined contact sequences. Our method discovers diverse motion patterns, ranging from trotting to galloping, robust standing policies, jumping, and handstand balancing, purely through the optimization of high-level objectives. Building on model predictive path integral (MPPI), we propose a cubic Hermite spline parameterization that operates on position and velocity control points. Our approach enables contact-making and contact-breaking strategies that adapt automatically to task requirements, requiring only a limited number of sampled trajectories. This sample efficiency enables real-time control on standard CPU hardware, eliminating the GPU acceleration typically required by other state-of-the-art MPPI methods. We validate our approach on the Go2 quadrupedal robot, demonstrating a range of emergent gaits and basic jumping capabilities. In simulation, we further showcase more complex behaviors, such as backflips, dynamic handstand balancing and locomotion on a Humanoid, all without requiring reference tracking or offline pre-training.
|
| |
| 09:00-10:30, Paper ThI1I.110 | Add to My Program |
| Embodiment‑Aware Generalist Specialist Distillation for Unified Humanoid Whole-Body Control |
|
| Peng, Quanquan | Shanghai Jiao Tong University |
| Lin, Yunfeng | Shanghai Jiao Tong University |
| Xue, Yufei | Shanghai Jiao Tong University |
| Pang, Jiangmiao | Shanghai AI Laboratory |
| Zhang, Weinan | Shanghai Jiao Tong University |
Keywords: Humanoid and Bipedal Locomotion, Legged Robots, Reinforcement Learning
Abstract: Humanoid Whole-Body Controllers trained with reinforcement learning (RL) have recently achieved remarkable performance, yet many target a single robot embodiment. Variations in dynamics, degrees of freedom (DoFs), and kinematic topology still hinder a single policy from commanding diverse humanoids. Moreover, obtaining a generalist policy that not only transfers across embodiments but also supports richer behaviors—beyond simple walking to squatting, leaning—remains especially challenging. In this work, we tackle these obstacles by introducing EAGLE, an iterative generalist-specialist distillation framework that produces a single unified policy that controls multiple heterogeneous humanoids without per-robot reward tuning. During each cycle, embodiment-specific specialists are forked from the current generalist, refined on their respective robots, and new skills are distilled back into the generalist by training on the pooled embodiment set. Repeating this loop until performance convergence produces a robust Whole-Body Controller validated on robots such as Unitree H1, G1, and Fourier N1. We conducted experiments on five different robots in simulation environments and four in real-world settings. Through quantitative evaluations, EAGLE achieves high tracking accuracy and robustness compared to other methods, marking a step toward scalable, fleet-level humanoid control.
|
| |
| 09:00-10:30, Paper ThI1I.111 | Add to My Program |
| EndoDDC: Learning Sparse to Dense Reconstruction for Endoscopic Robotic Navigation Via Diffusion Depth Completion |
|
| Lin, Yinheng | The Chinese University of Hong Kong |
| Huang, Yiming | The Chinese University of Hong Kong |
| Cui, Beilei | The Chinese University of Hong Kong |
| Bai, Long | Alibaba DAMO Academy |
| Gao, Huxin | The Chinese University of Hong Kong |
| Ren, Hongliang | The Chinese University of Hong Kong |
| Lai, Jiewen | The Chinese University of Hong Kong |
Keywords: Computer Vision for Medical Robotics, Surgical Robotics: Laparoscopy, Vision-Based Navigation
Abstract: Accurate depth estimation plays a critical role in the navigation of endoscopic surgical robots, forming the foundation for 3D reconstruction and safe instrument guidance. Fine-tuning pretrained models heavily relies on endoscopic surgical datasets with precise depth annotations. While existing self-supervised depth estimation techniques eliminate the need for accurate depth annotations, their performance degrades in environments with weak textures and variable lighting, leading to sparse reconstruction with invalid depth estimation. Depth completion using sparse depth maps can mitigate these issues and improve accuracy. Despite the advances in depth completion techniques in general fields, their application in endoscopy remains limited. To overcome these limitations, we propose EndoDDC, an endoscopy depth completion method that integrates images, sparse depth information with depth gradient features, and optimizes depth maps through a diffusion model, addressing the issues of weak texture and light reflection in endoscopic environments. Extensive experiments on two publicly available endoscopy datasets show that our approach outperforms state-of-the-art models in both depth accuracy and robustness. This demonstrates the potential of our method to reduce visual errors in complex endoscopic environments. Our code will be released at https://github.com/yinheng-lin/EndoDDC.
|
| |
| 09:00-10:30, Paper ThI1I.112 | Add to My Program |
| Lazy Anytime Planning for the Dubins Moving Target Traveling Salesman Problem with Obstacles |
|
| Bhat, Anoop | Carnegie Mellon University |
| Gutow, Geordan | Michigan Technological University |
| Singh, Surya | Robotics and AI Institute |
| Ren, Zhongqiang | Shanghai Jiao Tong University |
| Rathinam, Sivakumar | TAMU |
| Choset, Howie | Carnegie Mellon University |
Keywords: Motion and Path Planning, Nonholonomic Motion Planning, Logistics
Abstract: The Dubins Moving Target Traveling Salesman Problem with Obstacles (Dubins MT-TSP-O) seeks an obstacle-free trajectory for an agent with a fixed speed and minimum turning radius that intercepts several moving targets. To tackle this NP-hard problem, we introduce the Lazy Iterated Random Generalized TSP (Lazy IRG) algorithm. Each iteration of Lazy IRG samples a set of possible interception points in space-time along the trajectories of the targets. Lazy IRG then manages the high computational cost of motion planning by alternating between two steps: first, it optimistically selects a sequence of interception points by solving a Generalized TSP (GTSP) assuming an obstacle-free world; second, it searches for obstacle-free trajectories between consecutive points in the sequence using an obstacle-aware RRT-Connect planner. If a trajectory is not found, Lazy IRG solves the GTSP again; otherwise, Lazy IRG enters its next iteration and samples new interception points. By deferring expensive collision-checking, our method efficiently focuses computational effort on the most promising solutions. Numerical results show that Lazy IRG finds significantly lower-cost solutions within a 1-minute time budget compared to the existing IRG-PGLNS algorithm.
|
| |
| 09:00-10:30, Paper ThI1I.113 | Add to My Program |
| Active Scene Reconstruction with Topological Reasoning and Semantic-Augmented Reinforcement Learning |
|
| Yuan, Yiqing | Sun Yat-Sen University |
| Li, Zhi | Sun Yat-Sen University |
| Ren, Hao | Sun Yat-Sen University |
| Zheng, Kairao | Sun Yat-Sen University |
| Cheng, Hui | Sun Yat-Sen University |
Keywords: Mapping, RGB-D Perception, Deep Learning for Visual Perception
Abstract: Active scene reconstruction aims to autonomously recover the fine-grained appearance and structural details of a complex unknown scenes. Existing approaches based on 2D topological or voxel-based abstractions often scale poorly to large environments and rely heavily on handcrafted features and heuristic rules, limiting scalability and robustness. To address these challenges, using a RGB-D camera on a mobile robot, we present a graph-based planning framework by integrating skeleton-derived topology, Bird’s-Eye-View (BEV)-augmented graph inference, and offline Reinforcement Learning (RL) for policy optimization. The 3D skeleton graph captures full spatial connectivity, overcoming the limitations of 2D representations. BEV-augmented graph inference enriches node embeddings with semantic context, avoiding handcrafted feature design. The offline RL approach replaces heuristic planning with data-driven decision-making, while an additional Maximum Mean Discrepancy (MMD) term mitigates distributional shift before and after feature injection, improving stability. Extensive simulation results validate the efficacy of the proposed method. Real-world experiments demonstrate the zero-shot transferability of the learned policy, highlighting its potential for scalable, fine-grained scene reconstruction.
|
| |
| 09:00-10:30, Paper ThI1I.114 | Add to My Program |
| Latent Representations for Visual Proprioception in Inexpensive Robots |
|
| Sheikholeslami, Sahara | University of Central Florida |
| Bölöni, Ladislau | University of Central Florida |
Keywords: Perception for Grasping and Manipulation, Deep Learning for Visual Perception
Abstract: Robotic manipulation requires explicit or implicit knowledge of the robot's joint positions. Precise proprioception is standard in high-quality industrial robots but is often unavailable in inexpensive robots operating in unstructured environments. In this paper, we ask: to what extent can a fast, single-pass regression architecture perform visual proprioception from a single external camera image, available even in the simplest manipulation settings? We explore several latent representations, including CNNs, VAEs, ViTs, and bags of uncalibrated fiducial markers, using fine-tuning techniques adapted to the limited data available. We evaluate the achievable accuracy through experiments on an inexpensive 6-DoF robot.
|
| |
| 09:00-10:30, Paper ThI1I.115 | Add to My Program |
| A Multimodal Perception-Based Method for Slip Detection in Grasping Unknown Objects |
|
| Huang, Yu | Sichuan University |
| Chen, Yu | Sichuan University |
| Zhou, Qinghua | Sichuan University |
| Li, Wendong | Sichuan University |
| |
| 09:00-10:30, Paper ThI1I.116 | Add to My Program |
| TOCALib: Optimal Control Library with Interpolation for Bimanual Manipulation and Obstacles Avoidance |
|
| Danik, Yulia | MIRAI |
| Makarov, Dmitry | MIRAI |
| Arkhipova, Aleksandra | DGIST |
| Davidenko, Sergei | CESS |
| Panov, Aleksandr | MIRAI |
Keywords: Dual Arm Manipulation, Optimization and Optimal Control, Collision Avoidance
Abstract: The paper presents a new approach for constructing a library of optimal trajectories for two robotic manipulators, Two-Arm Optimal Control and Avoidance Library (TOCALib). The optimization takes into account kinodynamic and other constraints within the FROST framework. The novelty of the method lies in the consideration of collisions using the DCOL method, which allows obtaining symbolic expressions for assessing the presence of collisions and using them in gradient-based optimization control methods. The proposed approach is applicable for complex bimanual manipulations that require precision. In this paper we tested TOCALib on Mobile Aloha robot, as an example. The approach can be extended to other bimanual robots, as well as to gait control of bipedal robots. It can also be used to construct training data for machine learning tasks for manipulation.
|
| |
| 09:00-10:30, Paper ThI1I.117 | Add to My Program |
| Healthcare Robotics for Light-Based Cosmetic Treatments |
|
| Duan, Anqing | Mohamed Bin Zayed University of Artificial Intelligence |
| Liuchen, Wanli | The Hong Kong Polytechnic University |
| Gomez, Domingo | The Hong Kong Polytechnic University |
| Muddassir, Muhammmad | The Hong Kong Polytechnic University |
| Navarro-Alarcon, David | The Hong Kong Polytechnic University |
Keywords: Service Robotics, Sensor-based Control, Deep Learning for Visual Perception
Abstract: Healthcare robotics has been gaining traction as a key area of research focused on enhancing human wellness. This paper explores the use of intelligent robots in the beauty industry, specifically within the context of photorejuvenation-based cosmetic dermatology, aimed at improving facial skin aesthetics. The beauty industry, traditionally labor-intensive, is experiencing a critical shortage of skilled beauticians, highlighting the opportunity for robotic technologies to meet this demand. However, integrating robots into cosmetic procedures presents unique challenges, particularly in tasks requiring high precision, such as laser pulse delivery and thermal dose management. This study addresses these challenges by introducing a deep learning approach for trajectory generation in laser path planning and a model-based control strategy for thermal dose regulation. Our empirical results demonstrate that the presented healthcare robots can deliver effective photorejuvenation treatments, suggesting a promising future for increased automation in cosmetic services.
|
| |
| 09:00-10:30, Paper ThI1I.118 | Add to My Program |
| DeepSkate: Reinforcement Learning of a Robust Controller for Energy Efficient Quadruped Skating |
|
| Petri, James Florin | Maynooth University |
| Lacey, Gerard | Maynooth University |
Keywords: Reinforcement Learning, Legged Robots, Sensorimotor Learning
Abstract: Wheeled-legged hybrid robots have generated growing interest in the research community due to the need for more efficient and versatile locomotion. Most recent research has focused on active wheels, but passive wheeled systems have great potential in improving energy efficiency. However, skating remains highly complex due to the difficulties of balancing dynamic motion, managing wheel-ground interactions, achieving precise torque control for smooth rolling, and adapting to unpredictable terrain while maintaining stability. We present an end-to-end model-free reinforcement learning approach that enables quadrupedal robots to skate efficiently, achieving agile and robust locomotion on both flat and rough terrain. Our skating-specific policy and sim-to-real pipeline are validated on a physical quadruped across diverse terrains with varying roughness, slopes, and features, consistently demonstrating controlled and efficient traversal. The robot achieves velocities up to 1.5 m/s with a cost of transport 40.9% lower than the skating state of the art and 70.9% lower than standard legged locomotion. These results establish skating as a feasible and efficient alternative mode of urban locomotion for quadrupedal robots, setting a foundation for future wheeled-legged research.
|
| |
| 09:00-10:30, Paper ThI1I.119 | Add to My Program |
| Event-Based Motion & Appearance Fusion for 6D Object Pose Tracking |
|
| Li, Zhichao | Istituto Italiano Di Tecnologia |
| Bartolozzi, Chiara | Istituto Italiano Di Tecnologia |
| Natale, Lorenzo | Istituto Italiano Di Tecnologia |
| Glover, Arren | Istituto Italiano Di Tecnologia |
Keywords: Visual Tracking
Abstract: Object pose tracking is a fundamental and essential task for robotics to perform tasks in the home and industrial settings. The most commonly used sensors to do so are RGB-D cameras, which can hit limitations in highly dynamic environments due to motion blur and frame-rate constraints. Event cameras have remarkable features such as high temporal resolution and low latency, which make them a potentially ideal vision sensors for object pose tracking at high speed. Even so, there are still only few works on 6D pose tracking with event cameras. In this work, we take advantage of the high temporal resolution and propose a method that uses both a propagation step fused with a pose correction strategy. Specifically, we use 6D object velocity obtained from event-based optical flow for pose propagation, after which, a template-based local pose correction module is utilized for pose correction. Our learning-free method has comparable performance to the state-of-the-art algorithms, and in some cases out performs them for fast-moving objects. The results indicate the potential for using event cameras in highly-dynamic scenarios where the use of deep network approaches are limited by low update rates.
|
| |
| 09:00-10:30, Paper ThI1I.120 | Add to My Program |
| Physics-Constrained Imitation Learning for Autonomous Racing |
|
| Yang, Haohan | Nanyang Technological University |
| Liu, Haochen | Nanyang Technological University |
| Yanxin, Zhou | Nanyang Technological University |
| Wu, Shuge | Nanyang Technological University |
| Lv, Chen | Nanyang Technological University |
Keywords: Autonomous Agents, Agent-Based Systems, AI-Based Methods
Abstract: Autonomous racing has become increasingly pop-ular in both academia and industry as a testbed for pushing general autonomous driving modules, such as perception, plan-ning, and control, to their limits. Although traditional control approaches can generate optimal control sequences at the edge of the racing vehicles’ physical controllability, they are highly sensitive to the accuracy of modeling parameters, such as tire model coefficients. Meanwhile, end-to-end learning methods are susceptible to distributional shifts, leading to unpredictable and irreversible failures. To address these challenges, this work introduces a physics-constrained imitation learning (PCIL) framework that effectively leverages the advantages of deep learning techniques and knowledge-driven strategies. Specifically, a fallback strategy would be automatically triggered when the vehicle states exceed predefined physical constraints. Meanwhile, the data from the knowledge-driven strategy will be augmented into the original dataset, and repeated re-training using an aggregated dataset could progressively improve PCIL. A series of simulations and real-world shadow testing are conducted at the Yas Marina circuit, and experimental results demonstrate superior performance compared to state-of-the-art methods, which suggests that it provides a promising solution for real-world autonomous racing.
|
| |
| 09:00-10:30, Paper ThI1I.121 | Add to My Program |
| Assistant Placement Aria: A Benchmark for Egocentric Placement Assistance |
|
| Belder, Amir | Technion |
| Dias Pais, Gonçalo | Instituto Sistemas E Robótica, Lisboa |
| Vivanyi, Refael | Meta |
| DeTone, Daniel | Meta |
| Carmi, Omri | Meta |
| Gattegno, Ido, Binyamin | Meta |
| Shrout, Oren | Technion |
| Tal, Ayellet | Technion |
Keywords: Semantic Scene Understanding, Data Sets for Robot Learning, Computer Vision for Automation
Abstract: Human assistance in robotics spans around several tasks such as navigation, object manipulation, and placement, where a key challenge is selecting target destinations that align with human intentions or preferences. We focus on this challenge in the context of Virtual Placement (VP), the task of identifying all plausible target locations given scene context and human-centric constraints. This differs from traditional placement tasks that typically focus on a single, predefined target location. The VP problem is complex, as it requires both global and local reasoning about the scene's geometry, semantics, and plausibility. To address this gap, we introduce {bf Assistant Placement Aria}, the first benchmark to explore diverse aspects of VP, including global, local, and human-centric constraints. It contains both synthetic and real indoor scenes annotated for three tasks: (i)~2D Panel Placement, (ii)~Sitting Suggestion, and (iii)~TV Placement. Each scene includes 2D images, a 3D point cloud, and a textual description of the objects within the scene. By contributing this benchmark, we aim to encourage further research in this underexplored and challenging field that is critically dependent on relevant data. We also evaluate several foundation models for object detection and segmentation on our benchmark.
|
| |
| 09:00-10:30, Paper ThI1I.122 | Add to My Program |
| One-Shot Cross-Geometry Skill Transfer through Part Decomposition |
|
| Thompson, Rory | Brown |
| Biza, Ondrej | Robotics and AI Institute |
| Konidaris, George | Brown University |
Keywords: Transfer Learning, Learning from Demonstration, Representation Learning
Abstract: Given a demonstration, a robot should be able to generalize a skill to any object it encounters—but existing approaches to skill transfer often fail to adapt to objects with unfamiliar shapes. Motivated by examples of improved transfer from compositional modeling, we propose a method for improving transfer by decomposing objects into their constituent semantic parts. We leverage data-efficient generative shape models to accurately transfer interaction points from the parts of a demonstration object to a novel object. We autonomously construct an objective to optimize the alignment of those points on skill-relevant object parts. Our method generalizes to a wider range of object geometries than existing work, and achieves successful one-shot transfer for a range of skills and objects from a single demonstration, in both simulated and real environments.
|
| |
| 09:00-10:30, Paper ThI1I.123 | Add to My Program |
| When Planners Meet Reality: How Learned, Reactive Traffic Agents Shift nuPlan Benchmarks |
|
| Hagedorn, Steffen | Universitaet Zu Lübeck, Robert Bosch GmbH |
| Donkov, Luka | DHBW Stuttgart, Robert Bosch GmbH |
| Distelzweig, Aron | Albert-Ludwigs-Universität Freiburg |
| Condurache, Alexandru Paul | University of Luebeck, Institute for Signal Processing |
Keywords: Performance Evaluation and Benchmarking, Intelligent Transportation Systems, Software Tools for Benchmarking and Reproducibility
Abstract: Planner evaluation in closed-loop simulation often uses rule-based traffic agents, whose simplistic and passive behavior can hide planner deficiencies and bias rankings. Widely used IDM agents simply follow a lead vehicle and cannot react to vehicles in adjacent lanes, hindering tests of complex interaction capabilities. We address this issue by integrating the state-of-the-art learned traffic agent model SMART into nuPlan. Thus, we are the first to evaluate planners under more realistic conditions and quantify how conclusions shift when narrowing the sim-to-real gap. Our analysis covers 14 recent planners and established baselines and shows that IDM-based simulation overestimates planning performance: nearly all scores deteriorate. In contrast, many planners interact better than previously assumed and even improve in multi-lane, interaction-heavy scenarios like lane changes or turns. Methods trained in closed-loop demonstrate the best and most stable driving performance. However, when reaching their limits in augmented edge-case scenarios, all learned planners degrade abruptly, whereas rule-based planners maintain reasonable basic behavior. Based on our results, we suggest SMART-reactive simulation as a new standard closed-loop benchmark in nuPlan and release the SMART agents as a drop-in alternative to IDM. Code, models, and scripts will be released upon acceptance.
|
| |
| 09:00-10:30, Paper ThI1I.124 | Add to My Program |
| Learning Cooperative Strategies for Drone Swarms Using Multi-Agent Reinforcement Learning |
|
| Llanes, Christian | The Georgia Institute of Technology |
| Williams, Kyle | Sandia National Laboratories |
| Jensen, Spencer | Sandia National Laboratories |
| Coogan, Samuel | Georgia Tech |
Keywords: Swarm Robotics, Cooperating Robots, Reinforcement Learning
Abstract: In this work, we investigate cooperative strategies for an evader drone team of various sizes using multi-agent reinforcement learning in a multi-agent pursuit-evasion scenario. The objective of the evader team is to reach a goal with minimal velocity while not colliding with the pursuer team. The objective of the pursuer team is to defend the goal by catching evaders before they reach it. In this environment, we allow the pursuer to have superior control authority compared to the evader such that reaching the goal is challenging for the evader in a one-on-one scenario. The proposed strategy for an evader is to team up with an ally to lead pursuers into a collision with each other instead of intercepting the evader. We design policies using multi-agent proximal policy optimization, an actor-critic reinforcement learning method, and investigate how the learned strategy changes when we vary the size of the pursuer and evader teams. Finally, we demonstrate the learned policy's sim-to-real capabilities through a hardware demonstration.
|
| |
| 09:00-10:30, Paper ThI1I.125 | Add to My Program |
| A Generalizable Physics-Guided Causal Model for Trajectory Prediction in Autonomous Driving |
|
| Zong, Zhenyu | College of William and Mary |
| Wang, Yuchen | William & Mary |
| Lin, Haohong | Carnegie Mellon University |
| Gan, Lu | Georgia Institute of Technology |
| Shao, Huajie | William & Mary |
Keywords: Motion and Path Planning, Representation Learning
Abstract: Trajectory prediction for traffic agents is critical for safe autonomous driving. However, achieving effective zero-shot generalization in previously unseen domains remains a significant challenge. Motivated by the consistent nature of kinematics across diverse domains, we aim to incorporate domain-invariant knowledge to enhance zero-shot trajectory prediction capabilities. The key challenges include: 1) effectively extracting domain-invariant scene representations, and 2) integrating invariant features with kinematic models to enable generalized predictions. To address these challenges, we propose a novel generalizable Physics-guided Causal Model (PCM), which comprises two core components: a Disentangled Scene Encoder, which adopts intervention-based disentanglement to extract domain-invariant features from scenes, and a CausalODE Decoder, which employs a causal attention mechanism to effectively integrate kinematic models with meaningful contextual information. Extensive experiments on real-world autonomous driving datasets demonstrate our method's superior zero-shot generalization performance in unseen cities, significantly outperforming competitive baselines. The source code is released at https://github.com/ZY-Zong/Physics-guided-Causal-Model.
|
| |
| 09:00-10:30, Paper ThI1I.126 | Add to My Program |
| Unified Generation-Refinement Planning: Bridging Guided Flow Matching and Sampling-Based MPC for Social Navigation |
|
| Mizuta, Kazuki | University of Washington |
| Leung, Karen | University of Washington |
Keywords: Human-Aware Motion Planning, Machine Learning for Robot Control, AI-Enabled Robotics
Abstract: Robust robot planning in dynamic, human-centric environments remains challenging due to multimodal uncertainty, the need for real-time adaptation, and safety requirements. Optimization-based planners enable explicit constraint handling but can be sensitive to initialization and struggle in dynamic settings. Learning-based planners capture multimodal solution spaces more naturally, but often lack reliable constraint satisfaction. In this paper, we introduce a unified generation-refinement framework that combines reward-guided conditional flow matching (CFM) with model predictive path integral (MPPI) control. Our key idea is a bidirectional information exchange between generation and optimization: reward-guided CFM produces diverse, informed trajectory priors for MPPI refinement, while the optimized MPPI trajectory warm-starts the next CFM generation step. Using autonomous social navigation as a motivating application, we demonstrate that the proposed approach improves the trade-off between safety, task performance, and computation time, while adapting to dynamic environments in real-time. The source code is publicly available at https://cfm-mppi.github.io.
|
| |
| 09:00-10:30, Paper ThI1I.127 | Add to My Program |
| Compose by Focus: Scene Graph-Based Atomic Skills |
|
| Qi, Han | Harvard University |
| Chen, Changhe | University of Michigan |
| Yang, Heng | Harvard University |
Keywords: Deep Learning in Grasping and Manipulation, Imitation Learning, Learning from Demonstration
Abstract: A key requirement for generalist robots is compositional generalization—the ability to combine atomic skills to solve complex, long-horizon tasks. While prior work has primarily focused on synthesizing a planner that sequences pre-learned skills, robust execution of the individual skills themselves remains challenging, as visuomotor policies often fail under distribution shifts induced by scene composition. To address this, we introduce a scene graph-based representation that focuses on task-relevant objects and relations, thereby mitigating sensitivity to irrelevant variation. Building on this idea, we develop a scene-graph skill learning framework that integrates graph neural networks with diffusion-based imitation learning, and further combine “focused” scene-graph skills with a vision-language model (VLM) based task planner. Experiments in both simulation and real-world manipulation tasks demonstrate substantially higher success rates than state-of-the-art baselines, highlighting improved robustness and compositional generalization in long-horizon tasks.
|
| |
| 09:00-10:30, Paper ThI1I.128 | Add to My Program |
| Connectivity-Aware Representations for Constrained Motion Planning Via Multi-Scale Contrastive Learning |
|
| Jeon, Suhyun | Seoul National University |
| Lim, Yumin | Seoul National University |
| Baek, Woo-Jeong | Seoul National University, Karlsruhe Institute of Technology |
| Kim, Hyeonseo | Seoul National University |
| Park, Suhan | Kwangwoon University |
| Park, Jaeheung | Seoul National University |
Keywords: Constrained Motion Planning, Bimanual Manipulation, Representation Learning
Abstract: The objective of constrained motion planning is to connect start and goal configurations while satisfying task-specific constraints. Motion planning becomes inefficient or infeasible when the configurations lie in disconnected regions, known as essentially mutually disconnected components (EMDs). Constraints further restrict feasible space to a lower-dimensional submanifold, while redundancy introduces additional complexity because a single end-effector pose admits infinitely many inverse kinematic solutions that may form discrete self-motion manifolds. This paper addresses these challenges by learning a connectivity-aware representation for selecting start and goal configurations prior to planning. Joint configurations are embedded into a latent space through multi-scale manifold learning across neighborhood ranges from local to global, and clustering generates pseudo-labels that supervise a contrastive learning framework. The proposed framework provides a connectivity-aware measure that biases the selection of start and goal configurations in connected regions, avoiding EMDs and yielding higher success rates with reduced planning time. Experiments on various manipulation tasks showed that our method achieves 1.9 times higher success rates and reduces the planning time by a factor of 0.43 compared to baselines.
|
| |
| 09:00-10:30, Paper ThI1I.129 | Add to My Program |
| Refinery: Active Fine-Tuning and Deployment-Time Optimization for Contact-Rich Policies |
|
| Tang, Bingjie | University of Southern California |
| Akinola, Iretiayo | NVIDIA |
| Xu, Jie | NVIDIA |
| Wen, Bowen | NVIDIA |
| Fox, Dieter | University of Washington |
| Sukhatme, Gaurav | University of Southern California |
| Ramos, Fabio | University of Sydney, NVIDIA |
| Gupta, Abhishek | University of Washington |
| Narang, Yashraj | NVIDIA |
Keywords: Assembly, Reinforcement Learning, Continual Learning
Abstract: Simulation-based learning has enabled policies for precise, contact-rich tasks (e.g., robotic assembly) to reach high success rates (~80%) under high levels of observation noise and control error. Although such performance may be sufficient for research applications, it falls short of industry standards and makes policy chaining exceptionally brittle. A key limitation is the high variance in individual policy performance across diverse initial conditions. We introduce Refinery, an effective framework that bridges this performance gap, robustifying policy performance across initial conditions. We propose Bayesian Optimization-guided fine-tuning to improve individual policies, and Gaussian Mixture Model-based sampling during deployment to select initializations that maximize execution success. Using Refinery, we improve mean success rates by 10.98% over state-of-the-art methods in simulation-based learning for robotic assembly, reaching 91.51% in simulation and comparable performance in the real world. Furthermore, we demonstrate that these fine-tuned policies can be chained to accomplish long-horizon, multi-part assembly—successfully assembling up to 8 parts without requiring explicit multi-step training.
|
| |
| 09:00-10:30, Paper ThI1I.130 | Add to My Program |
| A New Repetitive Control Framework for Robot Manipulators: Optimal Controller Design and Stability Analysis |
|
| Song, Geun Il | Postech |
| Kwak, Dohyeok | Pohang University of Science and Technology (POSTECH) |
| Kim, Taewan | Pohang University of Science and Technology (POSTECH) |
| Kang, Oe Ryung | Pohang University of Science and Technology |
| Kim, Jung Hoon | Pohang University of Science and Technology |
| Kwon, Wookyong | ETRI |
Keywords: Optimization and Optimal Control, Robust/Adaptive Control, Motion Control
Abstract: This paper provides a new repetitive control framework for robot manipulators with periodic reference and disturbance signals. We first take the inverse dynamics (ID) approach to a robot manipulator to transform its nonlinear input/output behavior into an equivalent linear time-invariant (LTI) system, for which the conventional repetitive control strategy is employed. To facilitate an optimal controller synthesis and an associated stability analysis, we next derive the so-called delay-feedback system. We then provide two linear matrix inequality (LMI)-based optimal controller synthesis procedures for minimizing the H_infty and the generalized H_2 norms from the disturbance to the tracking error, respectively. We next established operator-theoretic stability assertions in terms of the monodromy operator. In particular, a necessary and sufficient condition for the exponential stability of the delay-feedback system is derived for the case without external disturbances and we show that the delay-feedback system is input-to-state stable if it is exponentially stable. Finally, experiment comparisons are given to demonstrate the overall developed arguments.
|
| |
| 09:00-10:30, Paper ThI1I.131 | Add to My Program |
| Real-Time Modeling of Environmental Forces During Pushing in Granular Media Using S-RFT |
|
| Liu, Jiaxin | Ritsumeikan University |
| Tian, Yang | Shinshu University |
| Li, Longchuan | Beijing University of Chemical Technology |
| Ma, Shugen | Hong Kong University of Science and Technology (Guangzhou) |
| Wang, Zhongkui | Ritsumeikan University |
Keywords: Wheeled Robots, Field Robots, Dynamics
Abstract: Modeling environmental forces remains a critical challenge in the design and control of robots operating on granular terrain. In pushing locomotion, propulsion is generated by displacing a large number of particles; however, the resulting terrain deformation complicates accurate real-time force prediction. Most existing resistive force models do not explicitly account for these deformation effects. To address this limitation, we develop a force model that incorporates motion-induced terrain deformation for pushing motions in granular media. A wheel lug is adopted as a representative element. We first investigate translational motion using the discrete element method (DEM) to characterize terrain deformation under different velocity directions. The analysis identifies dominant deformation patterns, which are embedded in the force model. Building on this analysis, we examine the rotational motion of a single lug through experiments, DEM simulations, and model predictions. The results demonstrate that the proposed model accurately captures force responses across varying velocity directions, exhibiting closer agreement with DEM and experiments than conventional approaches. This work advances real-time force modeling for robot-granular terrain interactions and highlights the potential of deformation-integrated models in deformable environments.
|
| |
| 09:00-10:30, Paper ThI1I.132 | Add to My Program |
| M3CAD: Towards Generic Cooperative Autonomous Driving Benchmark |
|
| Zhu, Morui | University of North Texas |
| Zhu, Yongqi | University of North Texas |
| Zhu, Yihao | University of North Texas |
| Chen, Qi | Toyota Motor North America, InfoTech Labs |
| Qu, Deyuan | Toyota Motor North America, InfoTech Labs |
| Fu, Song | University of North Texas |
| Yang, Qing | University of North Texas |
Keywords: Intelligent Transportation Systems, Cooperating Robots, Data Sets for Robot Learning
Abstract: We introduce M3CAD, a comprehensive benchmark designed to advance research in generic cooperative autonomous driving. M3CAD comprises 204 sequences with 30,000 frames. Each sequence includes data from multiple vehicles and different types of sensors, e.g., LiDAR point clouds, RGB images, and GPS/IMU, supporting a variety of autonomous driving tasks, including object detection and tracking, mapping, motion forecasting, occupancy prediction, and path planning. This rich multimodal setup enables M3CAD to support both single-vehicle and multi-vehicle cooperative autonomous driving research. To the best of our knowledge, M3CAD is the most complete benchmark specifically designed for cooperative, multi-task autonomous driving research. To test its effectiveness, we use M3CAD to evaluate both state-of-the-art single-vehicle and cooperative driving solutions, setting baseline performance results. Since most existing cooperative perception methods focus on merging features but often ignore network bandwidth requirements, we propose a new multi-level fusion approach which adaptively balances communication efficiency and perception accuracy based on the current network conditions. We release M3CAD, along with the baseline models and evaluation results, to support the development of robust cooperative autonomous driving systems. All resources will be made publicly available on our project webpage.
|
| |
| 09:00-10:30, Paper ThI1I.133 | Add to My Program |
| Transformer-Based Hierarchical Reinforcement Learning for Sequential Decision-Making in Swarm Confrontation |
|
| Sun, Ruozhai | Beijing Institude of Technology |
| Wu, Qizhen | Beihang University |
| Chen, Lei | Beijing Institute of Technology |
Keywords: Reinforcement Learning, Multi-Robot Systems, Task and Motion Planning
Abstract: Hierarchical Reinforcement Learning (HRL) is a potent paradigm for addressing long-horizon sequential decision-making in swarm confrontation. However, its strategic capabilities are often bottlenecked by high-level policies that struggle to reason over the dynamic, variable-sized observations of other agents. To address this, we introduce a novel decentralized HRL framework featuring a Transformer-based strategic policy. The Transformer's self-attention mechanism is uniquely suited to capture complex spatio-temporal relationships among a varying number of entities, enabling robust long-horizon task allocation. This high-level strategy is then translated by a low-level policy into collision-free navigation. In complex swarm confrontation scenarios, our method significantly outperforms established baselines, achieving win rates of up to 81%. Beyond this performance, the learned policies exhibit strong zero-shot generalization to larger swarms, offer decision-making interpretability via the attention mechanism, and foster the autonomous emergence of complex cooperative tactics. This work provides a blueprint for scalable, strategically sophisticated, and interpretable multi-agent systems.
|
| |
| 09:00-10:30, Paper ThI1I.134 | Add to My Program |
| NovaFlow: Zero-Shot Manipulation Via Actionable Flow from Generated Videos |
|
| Li, Hongyu | Brown University |
| Sun, Lingfeng | Robotics and AI Institute |
| Hu, Yafei | Robotics and AI Institute |
| Ta, Duy | Robotics and AI Institute |
| Barry, Jennifer | Robotics and AI Institute |
| Konidaris, George | Brown University |
| Fu, Jiahui | Robotics and AI Institute |
Keywords: Deep Learning Methods, Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation
Abstract: Enabling robots to execute novel manipulation tasks zero-shot is a central goal in robotics. Most existing methods assume in-distribution tasks or rely on fine-tuning with embodiment-matched data, limiting transfer across platforms. We present NovaFlow, an autonomous manipulation framework that converts a task description into an actionable plan for a target robot without any demonstrations. Given a task description, NovaFlow synthesizes a video using a video generation model and distills it into 3D actionable object flow using off-the-shelf perception modules. From the object flow, it computes relative poses for rigid objects and realizes them as robot actions via grasp proposals and trajectory optimization. For deformable objects, this flow serves as a tracking objective for model-based planning with a particle-based dynamics model. By decoupling task understanding from low-level control, NovaFlow naturally transfers across embodiments. We validate on rigid, articulated, and deformable object manipulation tasks using a table-top Franka arm and a Spot quadrupedal mobile robot, and achieve effective zero-shot execution without demonstrations or embodiment-specific training.
|
| |
| 09:00-10:30, Paper ThI1I.135 | Add to My Program |
| Inchworm-Inspired Adaptive Multimodal Neural Control for an Autonomous Inspection Robot |
|
| Ausrivong, Wasuthorn | VISTEC : Vidyasirimedhi Institute of Science and Technology |
| Srisuchinnawong, Arthicha | University of Southern Denmark and Vidyasirimedhi Institute of Science and Technology |
| Manoonpong, Poramate | Vidyasirimedhi Institute of Science and Technology (VISTEC) |
|
|
| |
| 09:00-10:30, Paper ThI1I.136 | Add to My Program |
| MIMIC-D: Multi-Modal Imitation for MultI-Agent Coordination with Decentralized Diffusion Policies |
|
| Dong, Dayi, E | University of California Berkeley |
| Bhatt, Maulik | University of California, Berkeley |
| Choi, Seoyeon | University of California, Berkeley |
| Mehr, Negar | University of California Berkeley |
Keywords: Multi-Robot Systems, Imitation Learning, Cooperating Robots
Abstract: As robots become more integrated in society, their ability to coordinate with other robots and humans on multi-modal tasks (those with multiple valid solutions) is crucial. Such behaviors can be learned from expert demonstrations via imitation learning (IL), but when expert demonstrations are multi-modal, standard IL approaches usually average across modes or collapse to a single mode, preventing effective coordination. Being inspired by diffusion models’ ability to capture complex multi-modal trajectory distributions in single-agent settings, we develop a diffusion-based framework for coordinated multi-modal behavior in multi-agent systems. However, existing multi-agent diffusion approaches typically require a centralized planner or explicit communication among agents. This assumption can fail in real-world scenarios where robots must operate independently or with agents like humans that they cannot directly communicate with. Therefore, we propose MIMIC-D, a Centralized Training, Decentralized Execution paradigm for multi-modal multi-agent IL via diffusion. We jointly train all agents' policies with full information, but execute using only local information to achieve implicit coordination. In simulation and hardware experiments, our method exhibits robust multi-modal coordination behavior in various tasks and environments, improving upon state-of-the-art baselines.
|
| |
| 09:00-10:30, Paper ThI1I.137 | Add to My Program |
| SemGS: Feed-Forward Semantic 3D Gaussian Splatting from Sparse Views for Generalizable Scene Understanding |
|
| Ye, Sheng | Tsinghua University |
| Dong, Zhen-Hui | Tsinghua University |
| Fan, Ruoyu | Tsinghua University |
| Lv, Tian | Tsinghua University |
| Liu, Yong-Jin | Tsinghua University |
Keywords: Semantic Scene Understanding, Deep Learning for Visual Perception, Deep Learning Methods
Abstract: Semantic understanding of 3D scenes is essential for robots to operate effectively and safely in complex environments. Existing methods for semantic scene reconstruction and semantic-aware novel view synthesis often rely on dense multi-view inputs and require scene-specific optimization, limiting their practicality and scalability in real-world applications. To address these challenges, we propose SemGS, a feed-forward framework for reconstructing generalizable semantic fields from sparse image inputs. SemGS uses a dual-branch architecture to extract color and semantic features, where the two branches share shallow CNN layers, allowing semantic reasoning to leverage textural and structural cues in color appearance. We also incorporate a camera-aware attention mechanism into the feature extractor to explicitly model geometric relationships between camera viewpoints. The extracted features are decoded into dual-Gaussians that share geometric consistency while preserving branch-specific attributes, and further rasterized to synthesize semantic maps under novel viewpoints. Additionally, we introduce a regional smoothness loss to enhance semantic coherence. Experiments show that SemGS achieves state-of-the-art performance on benchmark datasets, while providing rapid inference and strong generalization capabilities across diverse synthetic and real-world scenarios.
|
| |
| 09:00-10:30, Paper ThI1I.138 | Add to My Program |
| P3T: Prototypical Point-Level Prompt Tuning with Enhanced Generalization for 3D Vision-Language Models |
|
| Jung, Geunyoung | University of Seoul |
| Kim, Soohong | University of Seoul |
| Song, Kyungwoo | Yonsei University |
| Jung, Jiyoung | University of Seoul |
Keywords: Deep Learning for Visual Perception, Recognition, Visual Learning
Abstract: With the rise of pre-trained models in the 3D point cloud domain for a wide range of real-world applications, adapting them to downstream tasks has become increasingly important. However, conventional full fine-tuning methods are computationally expensive and storage-intensive. Although prompt tuning has emerged as an efficient alternative, it often suffers from overfitting, thereby compromising generalization capability. To address this issue, we propose Prototypical Point-level Prompt Tuning (P3T), a parameter-efficient prompt tuning method designed for pre-trained 3D vision-language models (VLMs). P3T consists of two components: 1) Point Prompter, which generates instance-aware point-level prompts for the input point cloud, and 2) Text Prompter, which employs learnable prompts into the input text instead of hand-crafted ones. Since both prompters operate directly on input data, P3T enables task-specific adaptation of 3D VLMs without sacrificing generalizability. Furthermore, to enhance embedding space alignment, which is key to fine-tuning 3D VLMs, we introduce a prototypical loss that reduces intra-category variance. Extensive experiments demonstrate that our method matches or outperforms full fine-tuning in classification and few-shot learning, and further exhibits robust generalization under data shift in the cross-dataset setting. The code is available at https://github.com/gyjung975/P3T.
|
| |
| 09:00-10:30, Paper ThI1I.139 | Add to My Program |
| Fast Monocular Depth Estimation for Underwater Robotics Leveraging Attenuation Differences As Supplementary Information |
|
| Wang, Hao | Huazhong University of Science and Technology |
| Lu, Liang | Huazhong University of Science and Technology |
| Dong, Yan | Huazhong University of Science and Technology |
| Han, Bin | Huazhong University of Science and Technology |
Keywords: Computer Vision for Automation, RGB-D Perception, Deep Learning Methods
Abstract: Underwater and in-air environments exhibit distinct imaging characteristics, which should be carefully considered and effectively exploited for accurate depth estimation. In this work, we analyze the effectiveness of wavelength-dependent attenuation for underwater depth estimation and show that it is helpful but insufficient to perform depth estimation independently. Therefore, we propose a fast underwater monocular depth estimation network that incorporates underwater light absorption difference (ULAD) as supplementary information. Compared with methods that rely solely on RGB input, the proposed approach provides more accurate depth predictions. In our network, RGB and ULAD features are extracted by MobileNetV4 and fused using FusionMamba, followed by decoding and refinement with a micro Vision Transformer. The network is trained on the USOD10K dataset and evaluated on both its test set and the FLSea dataset. Experimental results demonstrate that our method achieves more accurate depth estimation and higher efficiency compared with other lightweight networks. Furthermore, Compared with existing state-of-the-art fast underwater depth estimation methods, our network further reduces the number of parameters by 10% and improves inference speed by 43%.
|
| |
| 09:00-10:30, Paper ThI1I.140 | Add to My Program |
| TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation |
|
| Liu, Jiaxing | Beijing University of Technology |
| Zhang, Zexi | Imperial College London |
| Li, Xiaoyan | Chinese Academy of Sciences |
| Wang, Boyue | Beijing University of Technology |
| Hu, Yongli | Beijing University of Technology |
| Yin, Baocai | Beijing University of Technology |
Keywords: Vision-Based Navigation, Autonomous Vehicle Navigation, Deep Learning for Visual Perception
Abstract: Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large-model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities. To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge information, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM’s self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction. On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate (SR) of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling. The code can be found on our project page: https://apex-bjut.github.io/Taga-VLM/.
|
| |
| 09:00-10:30, Paper ThI1I.141 | Add to My Program |
| OVerSeeC: Open-Vocabulary Costmap Generation from Satellite Images and Natural Language |
|
| Rana, Rwik | University of Texas at Austin |
| Quattrociocchi, Jesse | University of Texas at Austin, U.S. Army Research Laboratory |
| Lee, Dongmyeong | University of Texas at Austin |
| Ellis, Christian | University of Texas at Austin, U.S. Army Research Laboratory |
| Adkins, Amanda | University of Texas at Austin |
| Uccello, Adam | The University of Texas at Austin, U.S. Army Research Laboratory |
| Warnell, Garrett | U.S. Army Research Laboratory |
| Biswas, Joydeep | The University of Texas at Austin |
Keywords: Motion and Path Planning, Semantic Scene Understanding, Field Robots
Abstract: Aerial imagery provides essential global context for autonomous navigation, enabling route planning at scales inaccessible to onboard sensing. We address the problem of generating global costmaps for long-range planning directly from satellite imagery when entities and mission-specific traversal rules are expressed in natural language at test time. This setting is challenging since mission requirements vary, terrain entities may be unknown at deployment, and user prompts often encode compositional traversal logic. Existing approaches relying on fixed ontologies and static cost mappings cannot accommodate such flexibility. While foundation models excel at language interpretation and open-vocabulary perception, no single model can simultaneously parse nuanced mission directives, locate arbitrary entities in large-scale imagery, and synthesize them into an executable cost function for planners. We therefore propose OVerSeeC, a zero-shot modular framework that decomposes the problem into Interpret–Locate–Synthesize: (i) an LLM extracts entities and ranked preferences, (ii) an open-vocabulary segmentation pipeline identifies these entities from high-resolution imagery, and (iii) the LLM uses user's natural language preferences and masks to synthesize executable costmap code. Empirically, OVerSeeC handles novel entities, respects ranked and compositional preferences, and produces routes consistent with human-drawn trajectories across diverse regions, demonstrating robustness to distribution shifts. This shows that modular composition of foundation models enables open-vocabulary, preference-aligned costmap generation for scalable, mission-adaptive global planning.
|
| |
| 09:00-10:30, Paper ThI1I.142 | Add to My Program |
| CoDex: Learning Compositional Dexterous Functional Manipulation without Demonstrations |
|
| Jiang, Bowen | University of Texas at Austin |
| Reger, William | University of Texas at Austin |
| Martín-Martín, Roberto | University of Texas at Austin |
Keywords: Dexterous Manipulation, AI-Enabled Robotics, Reinforcement Learning
Abstract: In this work, we study Compositional Dexterous Functional Object Manipulation (CD-FOM): tasks such as aiming and actuating a spray bottle on a plant or a glue gun on wood, which require both actuating an object's internal mechanism and controlling its pose to apply the object's function to the environment. These tasks pose significant challenges for robots due to the demanding integration of semantic understanding—of the object's function, actuation mode, and application area—with intricate physical dexterity—to manage grasp stability, movement trajectory, and actuation. We introduce CoDex, a zero-demonstration framework that autonomously discovers CD-FOM manipulation strategies. CoDex uses vision–language models (VLMs) to infer semantic constraints from the task and scene. These constraints guide analytic constrained optimization to generate a short list of functional grasp candidates that can be efficiently refined with reinforcement learning to generate full grasp–move–actuate policies transferrable from simulation to the real world. We evaluate CoDex on a 7-DoF robot arm with a 16-DoF multi-fingered hand across six CD-FOM tasks involving previously unseen objects with internal mechanisms (spray bottles, hot glue guns, air dusters, flashlights, pepper grinders) and their application to unseen target objects, showcasing its ability to autonomously discover and execute complex, physically viable dexterous behaviors without human demonstrations. More information at https://robin-lab.cs.utexas.edu/CoDex/.
|
| |
| 09:00-10:30, Paper ThI1I.143 | Add to My Program |
| V2V-GoT: Vehicle-To-Vehicle Cooperative Autonomous Driving with Multimodal Large Language Models and Graph-Of-Thoughts |
|
| Chiu, Hsu-kuang | NVIDIA, Carnegie Mellon University |
| Hachiuma, Ryo | NVIDIA |
| Wang, Chien-Yi | NVIDIA |
| Wang, Yu-Chiang Frank | NVIDIA |
| Chen, Min-Hung | NVIDIA |
| Smith, Stephen F. | Carnegie Mellon University |
Keywords: Computer Vision for Transportation, Intelligent Transportation Systems, Deep Learning for Visual Perception
Abstract: Current state-of-the-art autonomous vehicles could face safety critical situations when their local sensors are occluded by large objects on the road nearby. Vehicle-to-vehicle (V2V) cooperative autonomous driving is proposed to address this problem. More recent work further adopts a new approach that applies Multimodal Large Language Models (MLLMs) for cooperative autonomous driving due to its potential multimodal understanding and reasoning abilities. However, graph-of-thoughts reasoning frameworks have not been considered for prior research on V2V cooperative autonomous driving. In this paper, we propose a novel graph-of-thoughts framework specifically designed for MLLM-based cooperative autonomous driving. Our graph-of-thoughts includes our proposed novel ideas of occlusion-aware perception and planning-aware prediction. We curate the V2V-GoT-QA dataset and develop the V2V-GoT model for training and testing the cooperative driving graph-of-thoughts. Our experimental results show that our proposed method outperforms other baselines in cooperative perception, prediction, and planning tasks. Our code and dataset are released to facilitate open-source research at https://eddyhkchiu.github.io/v2vgot.github.io/.
|
| |
| 09:00-10:30, Paper ThI1I.144 | Add to My Program |
| TrajBooster: Boosting Humanoid Whole-Body Manipulation Via Trajectory-Centric Learning |
|
| Liu, Jiacheng | Zhejiang University |
| Ding, Pengxiang | Westlake University |
| Zhou, Qihang | Shanghai Jiao Tong University |
| Wu, Yuxuan | Shanghai Jiao Tong University |
| Huang, Da | Shanghai Jiao Tong University |
| Peng, Zimian | Zhejiang University |
| Xiao, Wei | Westlake University |
| Zhang, Weinan | Shanghai Jiao Tong University |
| Yang, Lixin | Shanghai Jiao Tong University |
| Lu, Cewu | ShangHai Jiao Tong University |
| Wang, Donglin | Westlake University |
Keywords: Deep Learning Methods, Whole-Body Motion Planning and Control, Dual Arm Manipulation
Abstract: Recent Vision-Language-Action (VLA) models show potential to generalize across embodiments but struggle to quickly align with a new robot’s action space when high-quality demonstrations are scarce, especially for bipedal humanoids. We present TrajBooster, a cross-embodiment framework that leverages abundant wheeled-humanoid data to boost bipedal VLA. Our key idea is to use end-effector trajectories as a morphology-agnostic interface. TrajBooster (i) extracts 6D dual-arm end-effector trajectories from real-world wheeled humanoids, (ii) retargets them in simulation to Unitree G1 with a whole-body controller trained via a heuristic-enhanced harmonized online DAgger to lift low-dimensional trajectory references into feasible high-dimensional whole-body actions, and (iii) forms heterogeneous triplets that couple source vision/language with target humanoid-compatible actions to post-pre-train a VLA, followed by only 10 minutes of teleoperation data collection on the target humanoid domain. Deployed on Unitree G1, our policy achieves beyond-tabletop household tasks, enabling squatting, cross-height manipulation, and coordinated whole-body motion with markedly improved robustness and generalization. Results show that TrajBooster allows existing wheeled-humanoid data to efficiently strengthen bipedal humanoid VLA performance, reducing reliance on costly same-embodiment data while enhancing action space understanding and zero-shot skill transfer capabilities.
|
| |
| 09:00-10:30, Paper ThI1I.145 | Add to My Program |
| MoE-Powered Fast VLMs Via Curriculum Learning-Based Knowledge Distillation: Taming Regular and Corner Cases in Autonomous Driving |
|
| Zhao, Xue | SJTU |
| Fang, Zhou | SHU |
Keywords: Intelligent Transportation Systems, AI-Based Methods, Deep Learning for Visual Perception
Abstract: Autonomous driving has advanced significantly with the integration of large Vision-Language Models (VLMs), which excel in understanding and analyzing driving data. However, existing VLMs face challenges, particularly in terms of latency, which is crucial for real-time driving tasks. While shrinking the model size can reduce latency, it also limits the model's ability to handle both regular and corner cases effectively. To address this challenge, we propose the Curriculum Learning-based Knowledge Distillation (CLKD) framework. CLKD enhances student model performance through three key innovations: (1) integration of a Mixture-of-Experts (MoE) architecture to preserve model expressiveness; (2) Hardness-explored at Two Granularities (H2G), which dynamically identifies easy and difficult samples at both instance and feature levels; and (3) Progressive Release Distillation strategy that gradually reduces reliance on the teacher model, thereby fostering the student’s autonomy and improving its generalization capability in complex driving scenarios. In real-world data experiments, CLKD has achieved a twofold increase in speed compared to existing approaches while maintaining comparable performance.
|
| |
| 09:00-10:30, Paper ThI1I.146 | Add to My Program |
| A Novel Tilting Mechanism for Personal Mobility Robot Platform: Mathematical Modeling and HILS-Based Control Verification |
|
| Lee, Sunyeop | Yeungnam University |
| Yang, Wonchang | DGIST |
| Lim, Yongseob | DGIST |
| Nam, Kanghyun | DGIST |
Keywords: Mechanism Design, Motion Control, Robust/Adaptive Control
Abstract: This paper introduces a suspension-integrated tilting mechanism for narrow-track mobility robot platforms, offering a novel means of achieving commanded body tilt without departing from conventional suspension layouts. The architecture provides a two-degree-of-freedom roll path with inherent passive self-centering, thereby reconciling mechanical simplicity with enhanced motion capability. A linearized dynamic model forms the foundation for a dual-layer control scheme, consisting of a reference generator and a high-bandwidth tracking loop. The proposed approach is rigorously evaluated through hardware-in-the-loop simulations that combine a real-time driving simulator with a physical hardware bench. Relative to a non-tilting baseline, the platform demonstrates substantial reductions in perceived lateral acceleration and lateral load transfer, signifying improved ride comfort and rollover stability. These findings establish suspension-integrated tilting with model-based control as a compelling pathway toward safe and stable next-generation mobility robot platforms.
|
| |
| 09:00-10:30, Paper ThI1I.147 | Add to My Program |
| PeRoI: A Pedestrian-Robot Interaction Dataset for Learning Avoidance, Neutrality, and Attraction Behaviors in Social Navigation |
|
| Agrawal, Subham | University of Bonn |
| Ostermann-Myrau, Nico | University of Bonn |
| Dengler, Nils | University of Bonn |
| Bennewitz, Maren | University of Bonn |
Keywords: Datasets for Human Motion
Abstract: Robots are increasingly being deployed in public spaces such as shopping malls, sidewalks, and hospitals, where safe and socially aware navigation depends on anticipating how pedestrians respond to their presence. However, existing datasets rarely capture the full spectrum of robot-induced reactions, e.g., avoidance, neutrality, attraction, which limits progress in modeling these interactions. In this paper, we present the Pedestrian-Robot Interaction (PeRoI) dataset that captures pedestrian motions categorized into attraction, neutrality, and repulsion across two outdoor sites under three controlled conditions: no robot present, with stationary robot, and with moving robot. This design explicitly reveals how pedestrian behavior varies across robot contexts, and we provide qualitative and quantitative comparisons to established state-of-the-art datasets. Building on these data, we propose the Neural Robot Social Force Model (NeuRoSFM), an extension of the Social Force Model that integrates neural networks to augment inter-human dynamics with learned components and explicit robot-induced forces to better predict pedestrian motion in vicinity of robots. We evaluate NeuRoSFM by generating trajectories on multiple real-world datasets. The results demonstrate improved modeling of pedestrian-robot interactions, leading to better prediction accuracy, and highlight the value of our dataset and method for advancing socially aware navigation strategies in human-centered environments.
|
| |
| 09:00-10:30, Paper ThI1I.148 | Add to My Program |
| Transfer Your Safety: Learning Transferable Model-Free Safety Filters from a Single Policy to Enhance Safety across Diverse Tasks |
|
| Xie, Junjun | Harbin Institute of Technology, Shenzhen |
| Li, Siru | Harbin Institute of Technology, Shenzhen |
| Zhao, Shuhao | School of Mechanical Engineering and Automation Harbin Institute of Technology at Shenzhen Shenzhen, China |
| Xie, Xiaochen | Harbin Institute of Technology, Shenzhen |
| Hu, Liang | Harbin Institute of Technology, Shenzhen |
|
|
| |
| 09:00-10:30, Paper ThI1I.149 | Add to My Program |
| Learning-Based Torque Estimation for Harmonic Drive Actuators |
|
| Huang, Chun-Hung | National Taiwan University |
| Chen, Chun Wei | National Cheng Kung University |
| Lan, Chao-Chieh | National Taiwan University |
Keywords: Compliant Joints and Mechanisms, Force Control, Mechanism Design
Abstract: Accurate torque estimation in robotic actuators with harmonic drives is challenging due to nonlinear hysteresis and efficiency losses, often necessitating external torque sensors. This paper presents a learning-based torque estimation method that leverages encoder-derived features and mechanical compliance to enhance estimation accuracy without additional sensors. An actuator design incorporating a compliant helical tube provides deformation features that are effectively modeled using a Long Short-Term Memory (LSTM) network. Unlike conventional calibration or parametric approaches, the proposed framework captures nonlinear, history-dependent behaviors across varying operating conditions. Experimental evaluations demonstrate that compliant tubes significantly improve estimation accuracy compared with designs using stiffer or even rigid tubes, enabling more robust generalization under different torques, impedance modes, and stiffness levels. These results highlight the importance of co-designing actuator compliance and deep learning models to achieve reliable and compact torque estimation for harmonic drive actuators.
|
| |
| 09:00-10:30, Paper ThI1I.150 | Add to My Program |
| SlipSense: Multimodal Sensing for Online Slip Detection in Legged Robots |
|
| Liu, Iris Szu-Yao | Nanyang Technological University |
| Cheah, Chien Chern | Nanyang Technological University |
| Chuah, Meng Yee (Michael) | Agency for Science, Technology and Research (A*STAR) |
Keywords: Legged Robots, Force and Tactile Sensing
Abstract: Legged robots rely on accurate ground interaction awareness to traverse variable terrains, such as slippery surfaces. Existing slip detection methods often rely on kinematics and proprioception, which lack the sensitivity to detect early-stage slips that occur prior to catastrophic instability. Thus, this paper presents SlipSense, a novel framework for online force-based slip detection using a custom lightweight sensorized foot for quadrupeds to detect slip. The framework integrates a multimodal sensor design with a LSTM-based model to infer ground reaction forces and detect slip-indicative anomalies during locomotion. The proposed framework is deployed on a Unitree Go1 quadruped to demonstrate blind online slip detection over a slippery terrain. Our method detects early-stage slips down to an average displacement of 24.1 ± 6.4mm with an overall accuracy of 85.9%. This represents a 3.3-fold finer detection resolution and a 24% relative accuracy improvement over a standard kinematic baseline that uses foot velocity inferred through state estimation. The work in this paper serves as a foundation for force-aware gait adaptation in legged robotic locomotion, allowing future controllers to estimate terrain friction and adjust constraints, thus improving the overall stability of the system.
|
| |
| 09:00-10:30, Paper ThI1I.151 | Add to My Program |
| Generalized Momenta-Based Koopman Formalism for Robust Control of Euler-Lagrangian Systems |
|
| Singh, Rajpal | Indian Institute of Science |
| Singh, Aditya | Indian Institute of Science |
| Kashyap, Chidre Shravista | Indian Institute of Science |
| Keshavan, Jishnu | Indian Institute of Science |
Keywords: Machine Learning for Robot Control, Dynamics, Robust/Adaptive Control
Abstract: This paper presents a novel Koopman operator formulation for Euler–Lagrangian dynamics that employs an implicit generalized momentum-based state space representation, which decouples a known linear actuation channel from state-dependent dynamics and makes the system more amenable to linear Koopman modeling. By leveraging this structural separation, the proposed formulation only requires to learn the unactuated dynamics rather than the complete actuation-dependent system, thereby significantly reducing the number of learnable parameters, improving data efficiency, and lowering overall model complexity. In contrast, conventional explicit formulations inherently couple inputs with the state-dependent terms in a nonlinear manner, making them more suitable for bilinear Koopman models, which are more computationally expensive to train and deploy. Notably, the proposed scheme enables the formulation of linear models that achieve superior prediction performance compared to conventional bilinear models while remaining substantially more efficient. To realize this framework, we present two neural network architectures that construct Koopman embeddings from actuated or unactuated data, enabling flexible and efficient modeling across different tasks. Robustness is ensured through the integration of a linear Generalized Extended State Observer (GESO), which explicitly estimates disturbances and compensates for them in real-time. The combined momentum-based Koopman and GESO framework is validated through comprehensive trajectory tracking simulations and experiments on robotic manipulators, demonstrating superior accuracy, robustness, and learning efficiency relative to state-of-the-art alternatives.
|
| |
| 09:00-10:30, Paper ThI1I.152 | Add to My Program |
| Re-MAE: Rethinking Masked Autoencoders towards Geometry-Aware Self-Supervised LiDAR-Based 3D Object Detection |
|
| Cheon, Youngho | DGIST |
| Lee, Jae-Keun | DGIST |
| Kwon, Soon | DGIST |
| Lee, Jin-Hee | DGIST |
| Lim, Yongseob | DGIST |
Keywords: Intelligent Transportation Systems, Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization
Abstract: Self-supervised pre-training with masked autoencoders has shown promise for 3D perception, yet most approaches treat LiDAR point clouds in a geometry-agnostic manner. In this paper, we introduce Re-MAE, a geometry-aware self-supervised learning framework for LiDAR-based 3D object detection that explicitly encodes core properties of LiDAR point clouds: occlusion, distance-driven sparsity, and occupied-empty voxel structure. Re-MAE rethinks the geometric characteristics of LiDAR point clouds from the perspectives of "what to learn" and "how to learn", and introduces three components: (i) Geometry-Aware Masking, which realistically simulates occlusions in LiDAR scans and enables learning complete object representations from partial observations; (ii) Reconstruction-Contextual BCE loss, which effectively guides a multi-scale occupancy prediction task to mitigate distance-dependent point sparsity and the strong occupied-empty voxel imbalance, improving detection of both large vehicles and small, distant pedestrians; and (iii) Realistic Object Augmentation, a label-free foreground augmentation strategy that promotes object-centric representation learning and yields consistent gains across categories. Experiments on ONCE and Waymo Open Dataset validate the effectiveness of Re-MAE, delivering 2.83 mAP and 1.53 L2 mAP respectively over baselines. These results demonstrate that explicitly incorporating the geometric characteristics of LiDAR point clouds enhances the effectiveness of self-supervised learning. The code will be released.
|
| |
| 09:00-10:30, Paper ThI1I.153 | Add to My Program |
| A CAD-Free Vision-Guided Framework for Robotic Deburring of Flexible Shoe Soles |
|
| Tafuro, Alessandra | Politecnico Di MIlano |
| Guarini, Marco | Politecnico Di Milano |
| Mineo, Angelo | Politecnico Di Milano |
| Zanchettin, Andrea Maria | Politecnico Di Milano |
| Rocco, Paolo | Politecnico Di Milano |
Keywords: Sensor-based Control, Computer Vision for Manufacturing, Engineering for Robotic Systems
Abstract: Robotic sole deburring is a key, yet underexplored, challenge in footwear automation, where the deformable nature of rubber, variability of burrs, and diversity of sole geometries make automation difficult. Existing deburring approaches typically rely on CAD models or large training datasets, and often lack the ability to adapt online during execution. This paper presents a CAD-free, vision-guided framework for robotic deburring of shoe soles that integrates: (i) defect detection using the Segment Anything Model 2 without sole-specific training; (ii) motion planning for burr removal; and (iii) motion execution combining Forward Dynamics Compliance Control with online vision-based path tracking. The framework was validated on a UR5e robot equipped with a custom vacuum gripper. Results demonstrate a 95% success rate across soles of varying sizes, colors, and shapes. By eliminating CAD dependence, ensuring robust online correction, and maintaining compatibility with existing industrial deburring machines, this work provides a scalable step toward robotic finishing solutions in footwear manufacturing.
|
| |
| 09:00-10:30, Paper ThI1I.154 | Add to My Program |
| MASAR: Motion–Appearance Synergy Refinement for Joint Detection and Trajectory Forecasting |
|
| Bencheikh Lehocine, Mohammed Amine | Mercedes-Benz AG |
| Schmidt, Julian | Mercedes-Benz AG |
| Moosmann, Frank | Mercedes Benz AG |
| Gupta, Dikshant | Mercedes-Benz AG |
| Flohr, Fabian | Munich University of Applied Sciences |
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception, Computer Vision for Transportation
Abstract: Classical autonomous driving systems connect perception and prediction modules via hand-crafted bounding-box interfaces, limiting information flow and propagating errors to downstream tasks. Recent research aims to develop end-to-end models that jointly address perception and prediction; however, they often fail to fully exploit the synergy between appearance and motion cues, relying mainly on short-term visual features. We follow the idea of ''looking backward to look forward'', and propose MASAR, a novel fully differentiable framework for joint 3D detection and trajectory forecasting compatible with any transformer-based 3D detector. MASAR employs an object-centric spatio-temporal mechanism that jointly encodes appearance and motion features. By predicting past trajectories and refining them using guidance from appearance cues, MASAR captures long-term temporal dependencies that enhance future trajectory forecasting. Experiments conducted on the nuScenes dataset demonstrate MASAR's effectiveness, showing improvements of over 20% in minADE and minFDE while maintaining robust detection performance. Code and models are available at link.
|
| |
| 09:00-10:30, Paper ThI1I.155 | Add to My Program |
| A Hybrid Magnetic Actuation System for Hybrid Microrobotic Targeted Delivery |
|
| Ma, Yuanbiao | Southeast University |
| Luo, Shengming | Southeast University |
| Wang, Bin | Southeast University |
| Zhang, Haoyu | Southeast University |
| Lang, Ji | Southeast University |
| Tang, Zhiqiang | Southeast University |
| Zhang, Li | The Chinese University of Hong Kong |
| Wang, Qianqian | Southeast University |
Keywords: Micro/Nano Robots, Automation at Micro-Nano Scales, Medical Robots and Systems
Abstract: Magnetic microrobots hold great promise for biomedical applications. However, achieving flexible magnetic field adjustment with a magnetic actuation system (MAS) to actuate diverse microrobots remains a significant challenge. In this work, we propose an Electromagnetic-Permanent Magnet Actuation (EPMA) system that generates controllable magnetic field variations to enable microrobot actuation for diverse tasks, including microrobotic actuation, microswarm pattern transformation and targeted delivery. Automatic ellipsoid calibration of the Hall sensors enables real-time magnetic field orientation measurement with an error under 3◦. Experimental results demonstrate the microrobot's actuation performance in four distinct scenarios, with a rotation frequency of 0.5 Hz. Furthermore, by adjusting the dynamic magnetic field, we achieve microswarm pattern reconfiguration under static conditions as well as targeted delivery in fluidic environments at a flow speed of 52 mm/s and a rotation frequency of 4 Hz. This study presents a hybrid MAS for the microrobotic actuation in diverse environments by controllable dynamic magnetic fields.
|
| |
| 09:00-10:30, Paper ThI1I.156 | Add to My Program |
| GUIDE: A Diffusion-Based Autonomous Robot Exploration Framework Using Global Graph Inference |
|
| Che, Zijun | The Hong Kong University of Science and Technology (Guangzhou) |
| Zhang, Yinghong | The Hong Kong University of Science and Technology (Guangzhou) |
| Liang, Shengyi | Hong Kong University of Science and Technology |
| Zhou, Boyu | Southern University of Science and Technology |
| Ma, Jun | The Hong Kong University of Science and Technology |
| Zhou, Jinni | Hong Kong University of Science and Technology (Guangzhou) |
Keywords: Integrated Planning and Learning, Motion and Path Planning, Search and Rescue Robots
Abstract: Autonomous exploration in structured and complex indoor environments remains a challenging task, as existing methods often struggle to appropriately model unobserved space and plan globally efficient paths. To address these limitations, we propose GUIDE, a novel exploration framework that synergistically combines global graph inference with diffusionbased decision-making. We introduce a region-evaluation global graph representation that integrates both observed environmental data and predictions of unexplored areas, enhanced by a region-level evaluation mechanism to prioritize reliable structural inferences while discounting uncertain predictions. Building upon this enriched representation, a diffusion policy network generates stable, foresighted action sequences with significantly reduced denoising steps. Extensive simulations and real-world deployments demonstrate that GUIDE consistently outperforms state-of-the-art methods, achieving up to 18.3% faster coverage completion and a 34.9% reduction in redundant movements.
|
| |
| 09:00-10:30, Paper ThI1I.157 | Add to My Program |
| Gold Points Sniper: Self-Guided Visual Reasoning in VLM for Fine-Grained Action Understanding |
|
| Liu, Haodi | Tsinghua University |
| Yang, Xinhang | Tsinghua University |
| Yan, Kunda | Tsinghua University |
| Cui, Sen | Tsinghua University |
| Zhang, Zeyu | Beijing Institute for General Artificial Intelligence |
| Zhang, Changshui | Tsinghua University |
Keywords: Domestic Robotics
Abstract: Robots operating in everyday environments must understand fine-grained human actions, intentions, and contextual cues from broad views where people occupy only small regions, a capability unmet by current systems. While open-vocabulary action recognition methods remain limited to assigning predefined labels, and vision-language models (VLMs) face an inherent trade-off between informational richness and factual fidelity in their outputs, neither approach achieves the deep semantic interpretation required for reliable human-robot interaction. We propose Gold Points Sniper (GPS), a novel framework that empowers lightweight VLMs with self-guided multimodal reasoning capabilities for fine-grained human action understanding. Our approach comprises three key modules: Gold Points Extractor trains VLMs to identify critical action-relevant details, Selective Socratic Ques- tioner validates and refines these details through selective self-questioning, and Semantic Entailment Evaluator quantitatively assesses factual consistency using semantic entailment classification. Extensive experiments on our curated instruction-tuning dataset based on the CAP benchmark demonstrate that GPS-enhanced lightweight VLMs achieve substantial performance improvements, with some models reaching performance comparable to proprietary GPT-4o while maintaining superior factual accuracy. Our work establishes a reliable foundation for fine-grained action understanding in domestic robotics, enabling robots to safely interpret human behavior through information-dense yet factually grounded descriptions. Source code, training configurations, annotation prompts, and dataset details are released at https://github.com/Haodi-Liu/GPS-Gold-Point-Sniper.
|
| |
| 09:00-10:30, Paper ThI1I.158 | Add to My Program |
| ViTac-Tracing: Visual-Tactile Imitation Learning of Deformable Object Tracing |
|
| Zhao, Yongqiang | King's College London |
| Luo, Haining | Imperial College London |
| Wang, Yupeng | King’s College London |
| Spyrakos-Papastavridis, Emmanouil | King's College London |
| Demiris, Yiannis | Imperial College London |
| Luo, Shan | King's College London |
Keywords: Force and Tactile Sensing, Dexterous Manipulation, Telerobotics and Teleoperation
Abstract: Deformable objects often appear in unstructured configurations. Tracing deformable objects helps bringing them into extended states and facilitating the downstream manipulation tasks. Due to the requirements for object-specific modeling or sim-to-real transfer, existing tracing methods either lack generalizability across different categories of deformable objects or struggle to complete tasks reliably in real world. To address this, we propose a novel visual-tactile imitation learning method to achieve one-dimensional (1D) and two-dimensional (2D) deformable object tracing with a unified model. Our method is designed from both local and global perspectives based on visual and tactile sensing. Locally, we introduce a weighted loss that emphasizes actions maintaining contact near the center of the tactile image, improving fine-grained adjustment. Globally, we propose a tracing task loss that helps the policy to regulate task progression. On the hardware side, to compensate for the limited features extracted from visual information, we integrate tactile sensing into a low-cost teleoperation system considering both the teleoperator and the robot. Extensive ablation and comparative experiments on diverse 1D and 2D deformable objects demonstrate the effectiveness of our approach, achieving an average success rate of 80% on seen objects and 65% on unseen objects. Demos, code and datasets are available at https://sites.google.com/view/vitac-tracing.
|
| |
| 09:00-10:30, Paper ThI1I.159 | Add to My Program |
| A Soft Oscillator with On-The-Fly Tunable Dynamics for Adaptive Robotics |
|
| Wang, Shaoxiang | University of Bristol |
| Yue, Tianqi | Southern University of Science and Technology |
| Ge, Hanwen | University of Bristol |
| Philamore, Hemma | University of Bristol |
| Conn, Andrew | University of Bristol |
Keywords: Soft Robot Materials and Design
Abstract: Autonomous and mobile soft robots require internal oscillators, similar to a biological heart, to generate rhythmic motions. However, existing soft oscillators typically have fixed operational parameters and suffer from an inherent coupling between control input and power output, limiting their versatility and adaptability. This paper addresses this challenge by introducing a new design paradigm: a soft, multi-port, bistable oscillator whose core nonlinear energy landscape can be continuously and actively tuned on-the-fly. Our approach, based on mechanically reconfiguring the physical constraints of a bistable elastomeric structure, achieves a decoupling of kinematics (frequency) from dynamics (output pressure). We demonstrate this principle in two modes: first, active programming, where we continuously modulate the oscillator’s coupled frequency-amplitude relationship in real-time under a constant power input. Secondly, we demonstrate passive adaptation, where an autonomous walker powered by our oscillator exhibits physical intelligence. By physically interacting with a confined environment, the walker autonomously and instantaneously adapts its gait from a low-frequency, large-amplitude mode to a high-frequency, small-amplitude mode. This work provides a new pathway for creating adaptive, intelligent soft robots that can autonomously respond to their physical world without any electronic computation.
|
| |
| 09:00-10:30, Paper ThI1I.160 | Add to My Program |
| Kinodynamic Task and Motion Planning Using VLM-Guided and Interleaved Sampling |
|
| Kwon, Minseo | Ewha Womans University |
| Kim, Young J. | Ewha Womans University |
Keywords: Task and Motion Planning, Task Planning
Abstract: Task and Motion Planning (TAMP) integrates high-level task planning with low-level motion feasibility, but existing methods are costly in long-horizon problems due to excessive motion sampling. While LLMs provide commonsense priors, they lack 3D spatial reasoning and cannot ensure geometric or dynamic feasibility. We propose a kinodynamic TAMP planner based on a hybrid state tree that uniformly represents symbolic and numeric states during planning, enabling task and motion decisions to be jointly decided. Kinodynamic constraints embedded in the TAMP problem are verified by an off-the-shelf motion planner and physics simulator, and a VLM guides exploring a TAMP solution and backtracks the search based on visual rendering of the states. Experiments on the simulated domains and in the real world show 32.14% ∼ 1166.67% increased average success rates compared to traditional and LLM-based TAMP planners and reduced planning time on complex problems, with ablations further highlighting the benefits of VLM backtracking. More details are available at https://graphics.ewha.ac.kr/kinodynamicTAMP/.
|
| |
| 09:00-10:30, Paper ThI1I.161 | Add to My Program |
| Supervisory Measurement-Guided Noise Covariance Estimation |
|
| Li, Haoying | Chinese University of Hong Kong (Shenzhen) |
| Peng, Yifan | Chinese University of Hong Kong (Shen Zhen) |
| Li, Xinghan | DeepMirror Inc |
| Wu, Junfeng | Chinese Unviersity of Hong Kong (Shenzhen) |
Keywords: Calibration and Identification, Probability and Statistical Methods, Sensor Fusion
Abstract: Reliable state estimation hinges on accurate specification of sensor noise covariances, which weigh heterogeneous measurements. In practice, these covariances are difficult to identify due to environmental variability, front-end preprocessing, and other reasons. We address this by formulating noise covariance estimation as a bilevel optimization that, from a Bayesian perspective, factorizes the joint likelihood of so-called odometry and supervisory measurements, thereby balancing information utilization with computational efficiency. The factorization converts the nested Bayesian dependency into a chain structure, enabling efficient parallel computation: at the lower level, an invariant extended Kalman filter with state augmentation estimates trajectories, while a derivative filter computes analytical gradients in parallel for upper-level gradient updates. The upper level refines the covariance to guide the lower-level estimation. Experiments on synthetic and real-world datasets show that our method achieves higher efficiency than existing baselines.
|
| |
| 09:00-10:30, Paper ThI1I.162 | Add to My Program |
| Reformulating AI-Based Multi-Object Relative State Estimation for Aleatoric Uncertainty-Based Outlier Rejection of Partial Measurements |
|
| Jantos, Thomas | University of Klagenfurt |
| Delama, Giulio | University of Klagenfurt |
| Weiss, Stephan | University of Klagenfurt |
| Steinbrener, Jan | Universität Klagenfurt |
Keywords: AI-Based Methods, Localization, Deep Learning for Visual Perception
Abstract: Precise localization with respect to a set of objects of interest enables mobile robots to perform various tasks. With the rise of edge devices capable of deploying deep neural networks (DNNs) for real-time inference, it stands to reason to use artificial intelligence (AI) for the extraction of object-specific, semantic information from raw image data, such as the object class and the relative six degrees of freedom (6-DoF) pose. However, fusing such AI-based measurements in an Extended Kalman Filter (EKF) requires quantifying the DNNs' uncertainty and outlier rejection capabilities. This paper presents the benefits of reformulating the measurement equation in AI-based, object-relative state estimation. By deriving an EKF using the direct object-relative pose measurement, we can decouple the position and rotation measurements, thus limiting the influence of erroneous rotation measurements and allowing partial measurement rejection. Furthermore, we investigate the performance and consistency improvements for state estimators provided by replacing the fixed measurement covariance matrix of the 6-DoF object-relative pose measurements with the predicted aleatoric uncertainty of the DNN.
|
| |
| 09:00-10:30, Paper ThI1I.163 | Add to My Program |
| RICE: Reactive Interaction Controller for Cluttered Canopy Environment |
|
| Parayil, Nidhi | QUT Centre for Robotics |
| Peynot, Thierry | Queensland University of Technology (QUT) |
| Lehnert, Christopher | Queensland University of Technology |
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation, Force and Tactile Sensing
Abstract: Robotic navigation in dense, cluttered environments such as agricultural canopies presents significant challenges due to physical and visual occlusion caused by leaves and branches. Traditional vision-based or model-dependent approaches often fail in these settings, where physical interaction without damaging foliage and branches is necessary to reach a target. We present a novel reactive controller that enables safe navigation for a robotic arm in a contact-rich, cluttered, deformable environment using end-effector position and real-time tactile feedback. Our proposed framework's interaction strategy is based on a trade-off between minimizing disturbance by maneuvering around obstacles and pushing through them to move towards the target. We show that over 35 trials in 3 experimental plant setups with an occluded target, the proposed controller successfully reached the target in all trials without breaking any branch, and outperformed the established control strategy for dense foliage in reliability and adaptability. This work lays the foundation for safe, adaptive interaction in cluttered, contact-rich deformable environments, enabling future agricultural tasks such as pruning and harvesting in plant canopies.
|
| |
| 09:00-10:30, Paper ThI1I.164 | Add to My Program |
| Balancing Marker and Markerless Modes in Vision-Based Tactile Sensors with a Translucent Skin |
|
| Tijani, Oluwatimilehin | King's College London |
| Chen, Zhuo | King's College London |
| Deng, Jiankang | Imperial College London |
| Luo, Shan | King's College London |
Keywords: Force and Tactile Sensing
Abstract: Vision-based tactile sensors (VBTS) face an inherent trade-off in tactile skin design. Opaque ink markers enable accurate force and tangential displacement estimation but occlude geometric features essential for object and texture classification. Conversely, markerless skins preserve surface details yet provide limited capability for tangential motion estimation. Existing approaches, including UV illumination and learning-based virtual marker transfer, increase hardware complexity or computational cost. We present a novel tactile skin with translucent, tinted markers balancing the modes of marker and markerless for VBTS. This design enables concurrent tangential displacement tracking, force estimation, and preservation of surface geometry. It integrates directly with the GelSight sensor family, requiring no additional hardware and minimal software modification. Experimental evaluation demonstrates that translucent skin improves overall sensing performance relative to both opaque-marker and markerless configurations. It achieves 99.17% accuracy in object classification, 93.51% in texture classification, 97% point retention in tangential displacement tracking, and a 66% reduction in total force error. These results indicate that translucent skin substantially mitigate the traditional trade-off between marker-based and markerless tactile sensing, thereby expanding the applicability of multi-modal VBTS in tactile robotics.
|
| |
| 09:00-10:30, Paper ThI1I.165 | Add to My Program |
| Model Predictive Control with Reference Learning for Soft Robotic Intracranial Pressure Waveform Modulation |
|
| Flürenbrock, Fabian | ETH Zurich |
| Büchel, Yanick | ETH Zurich |
| Köhler, Johannes | Imperial College London |
| Schmid Daners, Marianne | ETH Zurich |
| Zeilinger, Melanie N. | ETH Zurich |
Keywords: Medical Robots and Systems, Machine Learning for Robot Control, Optimization and Optimal Control
Abstract: This paper introduces a learning-based control framework for a soft robotic actuator system designed to modulate intracranial pressure (ICP) waveforms, which is essential for studying cerebrospinal fluid dynamics and pathological processes underlying neurological disorders. A two-layer framework is proposed to safely achieve a desired ICP waveform modulation. First, a model predictive controller (MPC) with a disturbance observer is used for offset-free tracking of the system’s motor position reference trajectory under safety constraints. Second, to address the unknown nonlinear dependence of ICP on the motor position, we employ a Bayesian optimization (BO) algorithm used for online learning of a motor position reference trajectory that yields the desired ICP modulation. The framework is experimentally validated using a test bench with a brain phantom that replicates realistic ICP dynamics in vitro. Compared to a previously employed proportional-integral-derivative controller, the MPC reduces mean and maximum motor position reference tracking errors by 83 % and 73 %, respectively. In less than 20 iterations, the BO algorithm learns a motor position reference trajectory that yields an ICP waveform with the desired mean and amplitude.
|
| |
| 09:00-10:30, Paper ThI1I.166 | Add to My Program |
| Multi-Robot Obstacle-Aware Shepherding of Non-Cohesive Target Agents |
|
| Tomaselli, Cinzia | Scuola Superiore Meridionale |
| Covone, Stefano | Scuola Superiore Meridionale |
| Reina, Andreagiovanni | Universität Konstanz & Max Planck Institute of Animal Behavior |
| Di Bernardo, Mario | University of Naples Federico II |
Keywords: Multi-Robot Systems, Cooperating Robots, Agent-Based Systems
Abstract: This paper presents a novel control strategy for multi-agent shepherding of non-cohesive targets in obstacle-rich environments. Unlike previous approaches that assume cohesive flocking behavior, our method handles targets that interact only with nearby herders through repulsive forces and exhibit no inter-target coordination. Each herder employs a hybrid control policy that combines direct goal-oriented steering with obstacle-tangent maneuvering, enabling targets to circumnavigate obstacles while being guided toward a goal region. The herder dynamics integrate three key behaviors: return-to-goal motion when idle, target steering with adaptive directional control, and obstacle avoidance using both normal and tangential force components. Numerical simulations demonstrate superior performance compared to existing shepherding methods, achieving higher target confinement rates in cluttered environments. Experimental validation using TurtleBot4 herders and Osoyoo target robots in an indoor arena confirms the practical effectiveness of the proposed approach.
|
| |
| 09:00-10:30, Paper ThI1I.167 | Add to My Program |
| A Novel Human-Machine Dual-Task Gaming Framework for Visual-Attention Training |
|
| Mu, Fengjun | University of Electronic Science and Technology of China |
| Zhang, Jingting | University of Electronic Science and Technology of China |
| Huang, Zonghai | University of Electronic Science and Technology of China |
| Chen, Chen | University of Electronic Science and Technology of China |
| Zou, Chaobin | University of Electronic Science and Technology of China |
| Song, Guangkui | University of Electronic Science and Technology of China |
| Cheng, Hong | University of Electronic Science and Technology |
Keywords: Human Performance Augmentation, Brain-Machine Interfaces, Human Factors and Human-in-the-Loop
Abstract: Efficient brain functional training with rehabilitation robots has been an important and challenging topic in the human-machine interaction (HMI) field. Adjusting the interaction and gaming behaviors between human and machine to effectively activate the brain’s functional behavior is still a substantial challenge. In this paper, we take the visual-attention training as an example, and propose a novel human-machine co-gaming interaction framework by integrating a dual-task gaming paradigm and a human–machine gaming strategy. It has a remarkable capability of effectively utilizing the gaming characteristics of HMI behaviors and tasks, to effectively and precisely activate the human’s active attention and passive attention for training. Specifically, we design a gaze-driven dual-task gaming paradigm to co-activate the active and passive attention-network competition for systematically engaging human visual-attention allocation and training. We further develop a reinforcement-learning-based human–machine gaming strategy to adjust the task parameters for improving the attention training efficiency. Consequently, we conduct an experiment study with 8 healthy participants, by jointly analyzing participants’ EEG and eye-tracking data through the training process. Results show that our method can achieve improvement of brain engagement by an average of 15.6% over the widely-employed staircase strategy.
|
| |
| 09:00-10:30, Paper ThI1I.168 | Add to My Program |
| A Precise Real-Time Force-Aware Grasping System for Robust Aerial Manipulation |
|
| Hoi, Kenghou | Zhejiang University |
| Wu, Yuze | Zhejiang University |
| Ding, Annan | Zhejiang University |
| Wang, Junjie | Zhejiang University |
| Zhao, Anke | Zhejiang University |
| Hou, Jialiang | Zhejiang University |
| Zhang, Chengqian | Zhejiang University |
| Gao, Fei | Zhejiang University |
Keywords: Aerial Systems: Applications, Aerial Systems: Mechanics and Control, Aerial Systems: Perception and Autonomy
Abstract: Aerial manipulation requires force-aware capabilities to enable safe and effective grasping and physical interaction. Previous works often rely on heavy, expensive force sensors unsuitable for typical quadrotor platforms, or perform grasping without force feedback, risking damage to fragile objects. To address these limitations, we propose a novel force-aware grasping framework incorporating six low-cost, sensitive skin-like tactile sensors. We introduce a magnetic-based tactile sensing module that provides high-precision three-dimensional force measurements. We eliminate geomagnetic interference through a reference Hall sensor and simplify the calibration process compared to previous work. The proposed framework enables precise force-aware grasping control, allowing safe manipulation of fragile objects and real-time weight measurement of grasped items. The system is validated through comprehensive real-world experiments, including balloon grasping, dynamic load variation tests, and ablation studies, demonstrating its effectiveness in various aerial manipulation scenarios. Our approach achieves fully onboard operation without external motion capture systems, significantly enhancing the practicality of force-sensitive aerial manipulation. The supplementary video is available at: https://www.youtube.com/watch?v=mbcZkrJEf1I.
|
| |
| 09:00-10:30, Paper ThI1I.169 | Add to My Program |
| STAGE: Structure-Adaptive Graph-Encoded Multi-Agent Policy Gradient for Moving Target Search in Uncertain Topological Networks |
|
| Peng, Qihang | The Hong Kong Polytechnic University |
| Zhu, Lizhou | University of Electronic Science and Technology of China |
| Chen, Lekai | University of Electronic Science and Technology of China (UESTC) |
| Guo, Hongliang | Sichuan University |
| Wen, Chih-yung | The Hong Kong Polytechnic University |
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents, Reinforcement Learning
Abstract: This paper investigates the multi-robot efficient search (MuRES) problem in uncertain topological networks. One unique characteristic of the studied problem is that the topology of the underlying network is uncertain, posing great challenges to canonical MuRES solutions which presumes a fixed network topology. To address the challenge, this paper proposes the STructure-Adaptive Graph-Encoded policy gradient (STAGE) algorithm for moving target search. STAGE comprises two main components: (1) the bi-scale graph attention network (GAT) encoder, which fuses a k-hop local GAT with a distance-augmented long-range GAT to enable the encoder to capture both local and long-range network structural changes; and (2) the entropy-regularized counterfactual policy gradient module, which employs a structure-aware centralized critic to estimate both the team returns and the network structure information, and train the decentralized actors via counterfactual marginalization with entropy regularization. Extensive simulation results and physical experiment demonstrate the feasibility and superiority of STAGE for solving MuRES in uncertain topological environments.
|
| |
| 09:00-10:30, Paper ThI1I.170 | Add to My Program |
| Automated Retinal Photocoagulation Using Instrument-Integrated OCT and Laser Pattern Mapping |
|
| Briel, Marius | Carl Zeiss AG |
| Haide, Ludwig | Carl Zeiss AG |
| Wu, Dongyue | Carl Zeiss AG |
| Hornstein, Justus | Karlsruhe Institute of Technology |
| Matten, Philipp | Carl Zeiss AG |
| Piccinelli, Nicola | University of Verona |
| Kronreif, Gernot | ACMIT Gmbh |
| Tagliabue, Eleonora | Carl Zeiss AG |
| Mathis-Ullrich, Franziska | Friedrich-Alexander-University Erlangen-Nurnberg (FAU) |
Keywords: Medical Robots and Systems, Surgical Robotics: Planning, Computer Vision for Automation
Abstract: Retinal endolaser photocoagulation (REPC) is a repetitive intraocular surgical procedure that could greatly benefit from automation and distance-based control, improving both efficiency and safety. This work presents a robotic system designed for automated REPC, utilizing instrument-integrated optical coherence tomography (iiOCT) to facilitate real-time distance measurements. The system employs intraoperative spherical and ellipsoidal retinal models to convert 2D laser patterns into 3D arrangements, which are further refined through a control loop that incorporates online feedback. Ex vivo experiments in porcine eyes demonstrated clinical-level accuracy, with lateral and axial errors of 44 micrometers and 29 micrometers, respectively. Additionally, the proposed mapping technique produced patterns with greater equidistance than baseline methods. This system showcases the potential to automate repetitive surgical tasks while maintaining the surgeon's control over critical decision-making processes in ophthalmic surgery.
|
| |
| 09:00-10:30, Paper ThI1I.171 | Add to My Program |
| Contributing Factors in Human-Robot Handshake: Compliance, Hand Grip, and Synchrony |
|
| Saood, Adnan | ENSTA - Institute Polytechnique De Paris |
| Tapus, Adriana | ENSTA Paris, Institut Polytechnique De Paris |
Keywords: Touch in HRI, Social HRI
Abstract: Physical touch, such as handshakes, plays a critical role in human-robot interaction (HRI), influencing perceived naturalness and social presence. This study investigates how arm compliance, hand grip strength, and motion synchrony jointly affect the subjective quality of a human-robot handshake. We implemented a fully actuated, tactile-sensorized humanoid hand mounted on a manipulator arm and designed a compliant, oscillatory handshake controller with adaptive synchronization. Sixteen participants experienced handshakes across a 2x2x2 factorial design, varying arm compliance, grip strength, and synchrony. Objective kinematic analysis revealed significant main and interaction effects across all factors. At the same time, subjective ratings showed clear preferences for a weaker grip and greater arm compliance, with synchrony exerting minimal influence on perceived naturalness. These results highlight a perceptual hierarchy in HRI: foundational haptic properties exert the strongest influence on user experience, while advanced kinematic adjustments have limited impact when basic comfort is lacking. This insight provides concrete guidance for designing robotic handshakes that feel more human-like and pleasant.
|
| |
| 09:00-10:30, Paper ThI1I.172 | Add to My Program |
| Whole-Body Model-Predictive Control of Legged Robots with MuJoCo |
|
| Zhang, John | Massachusetts Institute of Technology |
| Howell, Taylor | Google DeepMind |
| Yi, Zeji | Carnegie Mellon University |
| Pan, Chaoyi | Carnegie Mellon University |
| Shi, Guanya | Carnegie Mellon University |
| Qu, Guannan | Carnegie Mellon University |
| Erez, Tom | Google DeepMind |
| Tassa, Yuval | Google DeepMind |
| Manchester, Zachary | Massachusetts Institute of Technology |
Keywords: Legged Robots, Whole-Body Motion Planning and Control, Optimization and Optimal Control
Abstract: We demonstrate the surprising real-world effectiveness of a very simple approach to whole-body model- predictive control (MPC) of quadruped and humanoid robots: the iterative linear-quadratic regulator (iLQR) algorithm with MuJoCo dynamics and finite-difference approximated derivatives. Building upon the previous success of model-based behavior synthesis and control of locomotion and manipulation tasks with MuJoCo in simulation, we show that these policies can easily generalize to the real world with few sim-to-real considerations. Our baseline method achieves real-time MPC while leveraging whole-body dynamics collision detection on a variety of hardware experiments, including dynamic quadruped locomotion, quadruped walking on two legs, and full-sized humanoid bipedal locomotion. Additionally, our GUI system enables users to interactively update robot behavior in real-time on the robot hardware, making task-specific objective parameter tuning easy and intuitive. Our code is available at:https://johnzhang3.github.io/mujoco_ilqr
|
| |
| 09:00-10:30, Paper ThI1I.173 | Add to My Program |
| Trailer-Aware End-To-End Autonomous Driving for Tractor-Trailers with Deep Reinforcement Learning |
|
| Li, Congfei | City University of Hong Kong |
| Li, Yang | City University of Hong Kong |
| Liu, Peigen | Westwell Limited |
| Gu, Rongqi | Westwell Limited |
| Sun, Zuolei | Westwell Limited |
| Sun, Yuxiang | City University of Hong Kong |
Keywords: Autonomous Vehicle Navigation, Intelligent Transportation Systems
Abstract: End-to-end autonomous driving has been greatly advanced in recent years. However, most of existing work focuses on small vehicles (e.g., cars). Driving articulated trucks, such as tractor-trailers, still remains less being explored. The underactuated nature and extended wheelbase of tractor-trailers pose considerable driving challenges, especially when navigating narrow roads. For example, when a left-hand-drive tractor-trailer makes a right turn on a two-way two-lane narrow road, the tractor usually needs to encroach some spaces in the opposing lane. Otherwise, the trailer may have insufficient spaces to turn right and strike curbside objects. To provide a solution to this problem, we employ deep reinforcement learning to train an end-to-end autonomous driving policy with a trailer-aware reward function. Through planar rigid-body kinematics analysis, we locate the reference points on the tractor and the trailer. We also build a tractor-trailer model for CARLA. Experimental results demonstrate the effectiveness and superiority of our method in CARLA.
|
| |
| 09:00-10:30, Paper ThI1I.174 | Add to My Program |
| Streaming Loop-Closure Selection under Memory Constraints in Graph-SLAM |
|
| Vafaee, Reza | Boston College |
| Khan, Usman | Boston College |
Keywords: SLAM, Optimization and Optimal Control, Planning under Uncertainty
Abstract: Graph-based SLAM models robot poses as vertices and relative-pose measurements (odometry and loop-closures) as edges. Odometry edges are always kept to preserve connectivity, while loop-closure edges reduce drift but cannot all be stored due to memory or computation limits. Our challenge is to decide online which closures to keep under a strict budget, when the full set of measurements cannot be stored or centralized. Prior work instead addresses an offline problem that assumes access to the complete pose graph and optimizes a log-determinant (D-optimality) surrogate. In the online regime, an additional difficulty arises because the odometry backbone grows over time and the utility of each loop-closure changes as the graph evolves. We formulate this problem as streaming submodular maximization with a time-varying log-determinant objective. We propose a one-pass preemptive greedy policy that operates with exactly k memory slots for loop-closures. We show that, under arbitrary arrival order, it achieves a uniform constant-factor guarantee on the log-determinant improvement beyond an odometry-only baseline, relative to the hindsight-optimal size-k solution. On benchmark data, the proposed method closely matches offline greedy despite the conservative bound, showing that principled streaming selection can recover most of the benefit of loop-closures while respecting resource limits.
|
| |
| 09:00-10:30, Paper ThI1I.175 | Add to My Program |
| RAAP: Retrieval-Augmented Affordance Prediction with Cross-Image Action Alignment |
|
| Zhuang, Qiyuan | Southeast University |
| Xu, He-Yang | Southeast University |
| Wang, Yijun | Southeast University |
| Zhao, Xin-Yang | Nanjing University of Science and Technology |
| Li, Yang-Yang | Nanjing University of Science and Technology |
| Wei, Xiu-Shen | Southeast University |
Keywords: Deep Learning for Visual Perception, Visual Learning, RGB-D Perception
Abstract: Understanding object affordances is essential for enabling robots to perform purposeful and fine-grained interactions in diverse and unstructured environments. However, existing approaches either rely on retrieval, which is fragile due to sparsity and coverage gaps, or on large-scale models, which frequently mislocalize contact points and mispredict post-contact actions when applied to unseen categories, thereby hindering robust generalization. We introduce Retrieval-Augmented Affordance Prediction (RAAP), a framework that unifies affordance retrieval with alignment-based learning. By decoupling static contact localization and dynamic action direction, RAAP transfers contact points via dense correspondence and predicts action directions through a retrieval-augmented alignment model that consolidates multiple references with dual-weighted attention. Trained on compact subsets of DROID and HOI4D with as few as tens of samples per task, RAAP achieves consistent performance across unseen objects and categories, and enables zero-shot robotic manipulation in both simulation and the real world. Project website: https://github.com/SEU-VIPGroup/RAAP.
|
| |
| 09:00-10:30, Paper ThI1I.176 | Add to My Program |
| DOT-Sim: Differentiable Optical Tactile Simulation with Precise Real-To-Sim Physical Calibration |
|
| You, Yang | Stanford University |
| Do, Won Kyung | Stanford University |
| Swann, Aiden | Stanford |
| Antonova, Rika | Stanford University |
| Kennedy, Monroe | Stanford University |
| Guibas, Leonidas | Stanford University |
Keywords: Force and Tactile Sensing, Simulation and Animation, Deep Learning Methods
Abstract: Simulating optical tactile sensors presents significant challenges due to their high deformability and intricate optical properties. To address these issues and enable a physically accurate simulation, we propose DOT-Sim: Differentiable Optical Tactile Simulation. Unlike prior simulators that rely on simplified models of deformable sensors, DOT-Sim accurately captures the physical behavior of soft sensors by modeling them as elastic materials using the Material Point Method (MPM). DOT-Sim enables rapid calibration of optical tactile sensor simulation using a small number of demonstrations within minutes, which is substantially faster than existing methods. Compared to current baselines, our approach supports much larger and non-linear deformations. To handle the optical aspect, we propose a novel approach to simulating optical responses by learning a residual image relative to the real-world idle state. We validate the physical and visual realism of our method through a series of zero-shot sim-to-real tasks. Our experiments show that DOT-Sim (1) accurately replicates the physical dynamics of a DenseTact optical tactile sensor in reality, (2) generates realistic optical outputs in contact-rich scenarios, and (3) enables direct deployment of simulation-trained classifiers in the real world, achieving 85% classification accuracy on challenging objects and 90% accuracy in embedded tumor-type detection, and (4) allows precise trajectory following with policy trained from demonstrations in simulation with an average error of less than 0.9 mm.
|
| |
| 09:00-10:30, Paper ThI1I.177 | Add to My Program |
| Image-Based Closed-Loop Control of a Robotically Steerable Endoscopic Cannula for Minimally Invasive Neurosurgery |
|
| Malhotra, Nidhi | Georgia Institute of Technology |
| Konda, Revanth | Georgia Institute of Technology |
| Desai, Jaydev P. | Georgia Institute of Technology |
Keywords: Medical Robots and Systems, Sensor-based Control, Surgical Robotics: Steerable Catheters/Needles
Abstract: Robot-assisted minimally invasive neurosurgeries have shown great promise in enabling lower invasiveness and faster patient recovery times. However, performing such surgeries remains challenging, mainly due to the use of rigid surgical tools and limited accessibility to deep-seated brain structures. Employing robotically steerable tools could address these challenges, as these devices, being relatively more dexterous, can gain access to different regions in the brain. The autonomous control of these tools could further enable manipulation with higher precision and lower procedural time, facilitating less fatigue for surgeons. In this paper, we present a control strategy for the precise manipulation of a robotically steerable endoscopic cannula (RSEC). The proposed control architecture uses a combination of inverse kinematics, endoscopic imaging, and electromagnetic tracking feedback to perform task-space control of the RSEC in real-time. A joint angle estimation algorithm is proposed to estimate the bending angles of the RSEC using an endoscopic camera. The tip-position RMSE value of the RSEC when bending the proximal and distal joints, obtained using the proposed control strategy, was 0.7 mm. The results indicate that the proposed method can be used to achieve position control of the RSEC with sub-mm accuracy.
|
| |
| 09:00-10:30, Paper ThI1I.178 | Add to My Program |
| From Passive Monitoring to Active Defense: Resilient Control of Manipulators under Cyberattacks |
|
| Gualandi, Gabriele | Mälardalen University |
| Papadopoulos, Alessandro Vittorio | Mälardalen University |
Keywords: Probability and Statistical Methods, Optimization and Optimal Control, Failure Detection and Recovery
Abstract: Cyber-physical robotic systems are vulnerable to false data injection attacks (FDIAs), in which an adversary corrupts sensor signals while evading residual-based passive anomaly detectors such as the chi-squared test. Such stealthy attacks can induce substantial end-effector deviations without triggering alarms. This paper studies the resilience of redundant manipulators to stealthy FDIAs and advances the architecture from passive monitoring to active defence. We formulate a closed-loop model comprising a feedback-linearized manipulator, a steady-state Kalman filter, and a chi-squared-based anomaly detector. Building on this passive monitoring layer, we propose an active control-level defence that attenuates the control input through a monotone function of an anomaly score generated by a novel actuation-projected, measurement-free state predictor. The proposed design provides probabilistic guarantees on nominal actuation loss and preserves closed-loop stability. From the attacker perspective, we derive a convex QCQP for computing one-step optimal stealthy attacks. Simulations on a 6-DOF planar manipulator show that the proposed defence significantly reduces attack-induced end-effector deviation while preserving nominal task performance in the absence of attacks.
|
| |
| 09:00-10:30, Paper ThI1I.179 | Add to My Program |
| Track Any Motions under Any Disturbances |
|
| Zhang, Zhikai | Tsinghua University |
| Guo, Jun | Tsinghua University |
| Chen, Chao | Peking University |
| Wang, Jilong | Galaxy General Robot Co., Ltd |
| Lin, Chenghuai | Delft University of Technology |
| Lian, Yunrui | Tsinghua University |
| Xue, Han | Tsinghua University |
| Wang, Zhenrong | Galbot |
| Liu, Maoqi | Shandong University |
| Lyu, Jiangran | Peking University |
| Liu, Huaping | Tsinghua University |
| Wang, He | Peking University |
| Yi, Li | Tsinghua University |
Keywords: Humanoid Robot Systems, Legged Robots, Whole-Body Motion Planning and Control
Abstract: A foundational humanoid motion tracker is expected to be able to track diverse, highly dynamic, and contact-rich motions. More importantly, it needs to operate stably in real-world scenarios against various dynamics disturbances including terrains, external forces, and physical property changes for general practical usage. To achieve this goal, we propose Any2Track (Track Any motions under Any disturbances), a two-stage RL framework to track various motions under multiple disturbances in the real world. Any2Track reformulates dynamics adaptivity as an additional capability on top of basic action execution and consists of two key components: AnyTracker and AnyAdapter. AnyTracker is a general motion tracker with a series of careful designs to track various motions within a single policy. AnyAdapter is a history-informed adaptation module that endows the tracker with the online dynamics adaptivity to overcome sim2real gap and multiple real-world disturbances. We deploy Any2Track on Unitree G1 hardware and achieve successful sim2real transfer in a zero-shot manner. Any2Track performs remarkably well in tracking various motions under multiple real-world disturbances.
|
| |
| 09:00-10:30, Paper ThI1I.180 | Add to My Program |
| MS-CRL: Multi-Scale Global Path Planning with Progressive Curriculum Reinforcement Learning |
|
| Zhou, Nan | University of Electronic Science and Technology of China |
| Hu, Xuqing | University of Electronic Science and Technology of China |
| Zhou, Yixin | University of Electronic Science and Technology of China |
| Zhu, Rui | University of Electronic Science and Technology of China |
| Zhou, Fan | University of Electronic Science and Technology of China |
| Li, Ye | University of Electronic Science and Technology of China |
| Yin, Guangqiang | University of Electronic Science and Technology of China |
Keywords: Motion and Path Planning, Reinforcement Learning, AI-Based Methods
Abstract: Global path planning provides high-level guidance for autonomous navigation, supplying reference paths for downstream navigation and control modules. Deep Reinforcement Learning (DRL) has shown strong potential in this domain, but existing methods struggle with multi-scale map inputs. This limitation arises from inconsistent representations across different map sizes and trajectory length variations, which hinder feature extraction, destabilize policy learning. To address these challenges, we propose the Progressive Multi-Scale Curriculum Reinforcement Learning (MS-CRL) framework. MS-CRL incorporates a progressive curriculum reinforcement learning algorithm (ProgCRL) that mitigates instability from trajectory length discrepancies, a unified multi-scale representation (UniMS) that normalizes spatial scales and resolves representation inconsistencies, and a Global-Local Fusion Network (GLFNet) that fully extracts both global and local features from the new representation for robust cross-scale policy learning. Extensive experiments on multi-scale map datasets demonstrate that MS-CRL enables effective global path planning, stabilizes policy learning, and achieves superior performance in path success rate, path quality, and planning efficiency, while significantly improving training efficiency and cross-scale adaptability compared with state-of-the-art baselines.
|
| |
| 09:00-10:30, Paper ThI1I.181 | Add to My Program |
| VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving |
|
| Xu, Zhefan | Carnegie Mellon University |
| Jerfel, Ghassen | Waymo LLC |
| Haliem, Marina | Waymo LLC |
| Zhao, Qi | Waymo LLC |
| Kang, Jeonhyung | Waymo LLC |
| Refaat, Khaled | Waymo LLC |
Keywords: Intelligent Transportation Systems, Autonomous Agents, Deep Learning Methods
Abstract: The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model’s rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM’s trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.
|
| |
| 09:00-10:30, Paper ThI1I.182 | Add to My Program |
| Radio-Based Multi-Robot Odometry and Relative Localization |
|
| Martínez-Silva, Andrés | Universidad Pablo De Olavide |
| Alejo, David | University Pablo De Olavide |
| Merino, Luis | Universidad Pablo De Olavide |
| Caballero, Fernando | Universidad Pablo De Olavide |
Keywords: Range Sensing, Localization, Multi-Robot Systems
Abstract: Radio-based methods such as Ultra-Wideband (UWB) and RAdio Detection And Ranging (radar), which have traditionally seen limited adoption in robotics, are experiencing a boost in popularity thanks to their robustness to harsh environmental conditions and cluttered environments. This work proposes a multi-robot UGV-UAV localization system that leverages the two technologies with inexpensive and readily-available sensors, such as Inertial Measurement Units (IMUs) and wheel encoders, to estimate the relative position of an aerial robot with respect to a ground robot. The first stage of the system pipeline includes a nonlinear optimization framework to trilaterate the location of the aerial platform based on UWB range data, and a radar pre-processing module with loosely coupled ego-motion estimation which has been adapted for a multi-robot scenario. Then, the pre-processed radar data as well as the relative transformation are fed to a pose-graph optimization framework with odometry and inter-robot constraints. The system, implemented for the Robotic Operating System (ROS 2) with the Ceres optimizer, has been validated in Software-in-the-Loop (SITL) simulations and in a real-world dataset. The proposed relative localization module outperforms state-of-the-art closed-form methods which are less robust to noise. Our SITL environment includes a custom Gazebo plugin for generating realistic UWB measurements modeled after real data. Conveniently, the proposed factor graph formulation makes the system readily extensible to full Simultaneous Localization And Mapping (SLAM). Finally, all the code and experimental data is publicly available to support reproducibility and to serve as a common open dataset for benchmarking.
|
| |
| 09:00-10:30, Paper ThI1I.183 | Add to My Program |
| Multi-Modal Loop Closure Detection with Foundation Models in Severely Unstructured Environments |
|
| Encinar Gonzalez, Laura Alejandra | KTH |
| Folkesson, John | KTH |
| Triebel, Rudolph | German Aerospace Center (DLR) |
| Giubilato, Riccardo | German Aerospace Center (DLR) |
Keywords: Space Robotics and Automation, Localization, Field Robots
Abstract: Robust loop closure detection is a critical component of Simultaneous Localization and Mapping (SLAM) algorithms in GNSS-denied environments, such as in the context of planetary exploration. In these settings, visual place recognition often fails due to aliasing and weak textures, while LiDAR-based methods suffer from sparsity and ambiguity. This paper presents MPRF, a multimodal pipeline that leverages transformer-based foundation models for both vision and LiDAR modalities to achieve robust loop closure in severely unstructured environments. Unlike prior work limited to retrieval, MPRF integrates a two-stage visual retrieval strategy with explicit 6-DoF pose estimation, combining DINOv2 features with SALAD aggregation for efficient candidate screening and SONATA-based LiDAR descriptors for geometric verification. Experiments on the S3LI dataset and S3LI Vulcano dataset show that MPRF outperforms state-of-the-art retrieval methods in precision while enhancing pose estimation robustness in low-texture regions. By providing interpretable correspondences compatible with SLAM back-ends, MPRF achieves a favorable trade-off between accuracy, efficiency, and reliability, demonstrating the potential of foundation models to unify place recognition and pose estimation. Code and models will be released at github.com/DLR-RM/MPRF
|
| |
| 09:00-10:30, Paper ThI1I.184 | Add to My Program |
| Gravity-Assisted Shape-Locking Articulated Discrete Serial Robot for Ceiling-Mounted Manipulation in Patient Care |
|
| Lee, Seonhun | University of Massachusetts Amherst |
| Sup IV, Frank | University of Massachusetts Amherst |
Keywords: Actuation and Joint Mechanisms, Mechanism Design, Medical Robots and Systems
Abstract: Continuum robots are promising for assistive manipulation, but often lack the stiffness and payload capacity required for real-world tasks. This paper investigates the feasibility of a novel dual-mode, gravity-assisted ceiling-mounted articulated discrete serial robot that transitions between passive and active states using a friction-based shape-locking mechanism. In passive mode, joints are unlocked, allowing for chain-like flexibility similar to that of ceiling hoists. In active mode, joints are locked, allowing for rigid and accurate manipulation. To evaluate feasibility, we implemented a reduced-scale prototype with two passive joints and one active joint. We tracked its accuracy across 300 iterations of point-to-point motion in a 2D plane. Results show high repeatability and robustness, highlighting the potential of this architecture for ceiling-mounted manipulation. Beyond healthcare tasks such as patient handling, this approach contributes a scalable actuation and shape-locking strategy for articulated discrete serial robots in constrained environments.
|
| |
| 09:00-10:30, Paper ThI1I.185 | Add to My Program |
| GAF: Gaussian Action Field As a 4D Representation for Dynamic World Modeling in Robotic Manipulation |
|
| Chai, Ying | Tsinghua University |
| Deng, Litao | Beijing Normal University, Shadow AI |
| Shao, Ruizhi | Tsinghua University |
| Zhang, Jiajun | Tsinghua University |
| Lv, Kangchen | Tsinghua University |
| Xing, Liangjun | Tsinghua University |
| Li, Xiang | Tsinghua University |
| Zhang, Hongwen | Beijing Normal University |
| Liu, Yebin | Tsinghua University |
Keywords: Visual Learning, Perception for Grasping and Manipulation, Learning from Demonstration
Abstract: Accurate scene perception is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action V-A paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action V-3D-A paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we adopt a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing 4D modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF provides three interrelated outputs: reconstruction of the current scene, prediction of future frames, and estimation of init action via Gaussian motion. Furthermore, we employ an action-vision-aligned denoising framework, conditioned on a unified representation that combines the init action and the Gaussian perception, both generated by the GAF, to further obtain more precise actions. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR, +0.3864 SSIM and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average +7.3% success rate in robotic manipulation tasks over state-of-the-art methods.
|
| |
| 09:00-10:30, Paper ThI1I.186 | Add to My Program |
| Model-Based Engineering Framework for Soft Continuum Robots |
|
| Witucki, Linus | Karlsruhe Institute of Technology (KIT) |
| Rösler, Jan Eike | Karlsruhe Institute of Technology (KIT) |
| Barth, Mike | Karlsruhe Institute of Technology (KIT) |
Keywords: Cellular and Modular Robots, Hardware-Software Integration in Robotics, Flexible Robotics
Abstract: Soft continuum robots are gaining attention for their potential to enable inherently safe and adaptive human-robot collaboration, especially in dynamic industrial environments. However, the development of these robots varies drastically and no standardization exists. This is particularly problematic for soft continuum robots, because of the variety of different actuation methods, and control strategies. This paper addresses the challenge of engineering soft continuum robots by introducing a generalized framework that enables hardware abstraction and controller reuse. The approach combines a modular robot design by extending the Unified Robot Description Format (URDF) information model to support soft continuum robotics. Enabling the decoupling of hard- and software development. For the concept validation, a modular tendon-driven continuum robot was developed and integrated into the framework. The extended Unified (Continuum) Robot Description Format (U(C)RDF) enables visualization and controller parameterization through standardized interfaces, allowing for reusable software components across different actuation principles. This approach achieves a flexible and scalable engineering process for soft continuum robots, bridging the gap between research prototypes and industrial deployment. It lays the foundation for future developments in model-based design, automated control, and interoperability of soft continuum robotic systems.
|
| |
| 09:00-10:30, Paper ThI1I.187 | Add to My Program |
| Beyond Domain Randomization: Event-Inspired Perception for Visually Robust Adversarial Imitation from Videos |
|
| Ramazzina, Andrea | Mercedes-Benz AG - Technical University of Munich |
| Giammarino, Vittorio | Purdue University |
| El Hariry, Matteo | University of Luxembourg |
| Bijelic, Mario | Princeton University |
Keywords: Imitation Learning, Reinforcement Learning, Deep Learning for Visual Perception
Abstract: Imitation from videos often fails when expert demonstrations and learner environments exhibit domain shifts, such as discrepancies in lighting, color, or texture. While visual randomization partially addresses this problem by augmenting training data, it remains computationally intensive and inherently reactive, struggling with unseen scenarios. We propose a different approach: instead of randomizing appearances, we eliminate their influence entirely by rethinking the sensory representation itself. Inspired by biological vision systems that prioritize temporal transients (e.g., retinal ganglion cells) and by recent sensor advancements, we introduce event-inspired perception for visually robust imitation. Our method converts standard RGB videos into a sparse, event-based representation that encodes temporal intensity gradients, discarding static appearance features. This biologically grounded approach disentangles motion dynamics from visual style, enabling robust visual imitation from observations even in the presence of visual mismatches between expert and agent environments. By training policies on event streams, we achieve invariance to appearance-based distractors without requiring computationally expensive and environment-specific data augmentation techniques. Experiments across the DeepMind Control Suite and the Adroit platform for dynamic dexterous manipulation show the efficacy of our method.
|
| |
| 09:00-10:30, Paper ThI1I.188 | Add to My Program |
| Moth: A Low-Cost IR-Based Approach towards Autonomous Precision Drone Landing |
|
| Liu, Yanchen | Columbia University |
| Zhao, Minghui | Columbia University |
| Hou, Kaiyuan | Columbia University |
| Xia, Junxi | Northwestern University |
| Carver, Charles | Columbia University |
| Xia, Stephen | Northwestern University |
| Zhou, Xia | Columbia University |
| Jiang, Xiaofan | Columbia University |
Keywords: Aerial Systems: Perception and Autonomy, Aerial Systems: Mechanics and Control, Sensor-based Control
Abstract: As micro-drones become increasingly deployed in indoor environments for applications ranging from warehouse inspection to emergency response, the challenge of precise automated landing emerges as a crucial barrier to their practical operation and ubiquitous adoption. Existing landing approaches often require complex hardware and substantial computation or perform unreliably indoors, making them impractical for palm-sized microdrones. We propose Moth, a low-cost infrared light-based solution that targets precise and efficient landing of low-resource microdrones. Moth consists of an infrared light source at the landing station along with an energy-efficient photodiode (PD) sensing platform attached to the bottom of the drone. At a cost under 83 USD, Moth achieves comparable performance to vision-based methods but at a fraction of the energy consumption and computation. Moth requires only three PDs without any complex pattern recognition models to land the drone accurately, under 10 cm of error, from up to 11.1 meters away.
|
| |
| 09:00-10:30, Paper ThI1I.189 | Add to My Program |
| RoboMorph: Evolving Robot Morphology Using Large Language Models |
|
| Qiu, Kevin | University of Warsaw, IDEAS NCBR |
| Pałucki, Władysław | University of Warsaw |
| Ciebiera, Krzysztof | University of Warsaw |
| Fijałkowski, Paweł | University of Warsaw |
| Cygan, Marek | University of Warsaw, Nomagic |
| Kuciński, Łukasz | University of Warsaw, IDEAS NCBR, Polish Academy of Sciences |
Keywords: Methods and Tools for Robot System Design, Evolutionary Robotics, Cellular and Modular Robots
Abstract: We introduce RoboMorph, an automated approach for generating and optimizing modular robot designs using large language models (LLMs) and evolutionary algorithms. Each robot design is represented by a structured grammar, and we use LLMs to efficiently explore this design space. Traditionally, such exploration is time-consuming and computationally intensive. Using a best-shot prompting strategy combined with reinforcement learning (RL)-based control evaluation, RoboMorph iteratively refines robot designs within an evolutionary feedback loop. Across four terrain types, RoboMorph discovers diverse, terrain-specialized morphologies, including wheeled quadrupeds and hexapods, that match or outperform designs produced by Robogrammar's graph-search method. These results demonstrate that LLMs, when coupled with evolutionary selection, can serve as effective generative operators for automated robot design. Our project page and code are available at https://robomorph.github.io.
|
| |
| 09:00-10:30, Paper ThI1I.190 | Add to My Program |
| FLIP: Flowability-Informed Powder Weighing |
|
| Radulov, Nikola | University of Liverpool |
| Wright, Alex | University of Liverpool |
| Little, Thomas | University of Liverpool |
| Cooper, Andrew Ian | University of Liverpool |
| Pizzuto, Gabriella | University of Liverpool |
Keywords: Robotics and Automation in Life Sciences
Abstract: Autonomous manipulation of powders remains a significant challenge for robotic automation in scientific laboratories. The inherent variability and complex physical interactions of powders in flow, coupled with variability in laboratory conditions necessitates adaptive automation. This work introduces FLIP, a flowability-informed powder weighing framework designed to enhance robotic policy learning for granular material handling. Our key contribution lies in using material flowability, quantified by the angle of repose, to optimise physics-based simulations through Bayesian inference. This yields material-specific simulation environments capable of generating accurate training data, which reflects diverse powder behaviours, for training ‘robot chemists’. Building on this, FLIP integrates quantified flowability into a curriculum learning strategy, fostering efficient acquisition of robust robotic policies by gradually introducing more challenging, less flow able powders. We validate the efficacy of our method on a robotic powder weighing task under real-world laboratory conditions. Experimental results show that FLIP with a curriculum strategy achieves a low dispensing error of 2.12 ± 1.53mg, outperforming methods that do not leverage flowability data, such as domain randomisation (6.11 ± 3.92mg). These results demonstrate FLIP’s improved ability to generalise to previously unseen, more cohesive powders and to new target masses.
|
| |
| 09:00-10:30, Paper ThI1I.191 | Add to My Program |
| Mapping-Guided Task Discovery and Allocation for Robotic Inspection of Underwater Structures |
|
| Ruediger, Marina | University of Washington |
| Banerjee, Ashis | University of Washington |
Keywords: Marine Robotics, Field Robots, Task Planning
Abstract: This paper introduces Mapping-based Tasks for Inspection: Discovery and Allocation (Map-TIDAL), a method for generating environmentally informed tasks and distributing them in a heterogeneous multi-robot system for visual inspection of underwater structures. Map-TIDAL leverages the individual robot maps generated during SLAM (without prior knowledge of the environment) and tasks from all the robots through a communication-aware auction process to determine additional inspection locations as the structures are further explored by the robots. This allows the method to adaptively focus on geometrically interesting areas that need detailed inspection while still maintaining good overall coverage with a reasonably small number of inspection tasks. Experiments on both saline and fresh water tanks show that Map-TIDAL yields better coverage while inspecting areas with interesting geometric features more thoroughly, using equal or fewer inspection locations compared to prevalent coverage methods using Voronoi distributions and boustrophedon patterns.
|
| |
| 09:00-10:30, Paper ThI1I.192 | Add to My Program |
| MOVE: A Simple Motion-Based Data Collection Paradigm for Spatial Generalization in Robotic Manipulation |
|
| Wang, Huanqian | Tsinghua University |
| Chen, Chi Bene | Tsinghua University |
| Yue, Yang | Tsinghua Univeristy |
| Tao, Danhua | Southeast University |
| Guo, Tong | Tsinghua University |
| Xie, Shaoxuan | Beijing Academy of Artificial Intelligence |
| Huang, Denghang | Chongqing University |
| Song, Shiji | Tsinghua University |
| Yao, Guocai | Beijing Academy of Artificial Intelligence |
| Huang, Gao | Tsinghua University |
Keywords: AI-Based Methods, Data Sets for Robot Learning, Imitation Learning
Abstract: Imitation learning method has shown immense promise for robotic manipulation, yet its practical deployment is fundamentally constrained by the data scarcity. Despite prior work on collecting large-scale datasets, there still remains a significant gap to robust spatial generalization. We identify a key limitation: individual trajectories, regardless of their length, are typically collected from a emph{single, static spatial configuration} of the environment. This includes fixed object and target spatial positions as well as unchanging camera viewpoints, which significantly restricts the diversity of spatial information available for learning. To address this critical bottleneck in data efficiency, we propose textbf{MOtion-Based Variability Enhancement} (emph{MOVE}), a simple yet effective data collection paradigm that enables the acquisition of richer spatial information from dynamic demonstrations. Our core contribution is an augmentation strategy that injects motion into any movable objects within the environment for each demonstration. This process implicitly generates a dense and diverse set of spatial configurations within a single trajectory. We conduct extensive experiments in both simulation and real-world environments to validate our approach. For example, in simulation tasks requiring strong spatial generalization, emph{MOVE} achieves an average success rate of 39.1%, a 76.1% relative improvement over the static data collection paradigm (22.2%), and yields up to 2--5times gains in data efficiency on certain tasks.
|
| |
| 09:00-10:30, Paper ThI1I.193 | Add to My Program |
| Integrating Artificial Vision and Wearable Robotics: Adaptive Assistance Enabled by Manipulation Context Awareness |
|
| Ferrari, Sandro | Technical University of Munich (TUM) |
| Aimi, Emanuele | Technical University of Munich (TUM) |
| Missiroli, Francesco | Technical University of Munich (TUM) |
| Masiero, Federico | Technical University of Munich (TUM) |
| Casadio, Maura | University of Genoa |
| Masia, Lorenzo | Technical University of Munich (TUM) |
Keywords: Wearable Robotics, Prosthetics and Exoskeletons, Physical Human-Robot Interaction
Abstract: Occupational exoskeletons are emerging as a promising solution for industrial applications, providing support to reduce fatigue and the risk of musculoskeletal disorders. One of the main challenges limiting their widespread adoption is that most existing devices cannot deliver real-time, adaptable, and context-aware assistance. This paper presents the first fully vision-driven control strategy for a bimanual upper-limb soft exoskeleton, enabling adaptive assistance during industrial tool manipulation. The approach integrates three modules: tool recognition and segmentation, hand tracking with gesture recognition, and a fusion layer that ensures reliable understanding of the manipulation context. This allows modulation of lifting assistance in real time according to the weight of the grasped object. Experiments with human participants demonstrated that the proposed approach reduces biceps activation by more than 50% compared to the no-support condition, while operating in real time on embedded hardware. The method is robust to hand–object occlusions, camera repositioning, and dynamic environments, demonstrating its practicality for industrial deployment. Overall, this work establishes vision-based control as a scalable solution for ergonomic, adaptive exoskeletons that enhance safety and productivity in demanding workplaces.
|
| |
| 09:00-10:30, Paper ThI1I.194 | Add to My Program |
| Probabilistic Topological Map Inference with Belief Propagation |
|
| Wang, Houzhe | Imperial College London |
| Jiang, Jingqi | Imperial College London |
| Xu, Shida | Imperial College London |
| Yeatman, Eric | Imperial College London |
| Wang, Sen | Imperial College London |
Keywords: Mapping, SLAM, Localization
Abstract: Metric Simultaneous Localization and Mapping (SLAM) prioritizes geometric accuracy of estimated robot poses and maps. However, in many real-world robot applications, such as inspection robots operating inside pipelines or other confined network environments, metric accuracy is less critical than correctly capturing the underlying topological connectivity. In this paper, we investigate back-end optimization for topological mapping/SLAM, and propose a probabilistic topological map inference algorithm. Given noisy front-end measurements, our approach explicitly models the topological map inference problem within a factor graph framework. It performs inference using belief propagation, which yields a posterior distribution over multiple plausible topological maps rather than a single estimate. We evaluate our method on topologies derived from an open-source pipeline network dataset, spanning various topology sizes and degrees of perceptual aliasing. Extensive experiments demonstrate that our algorithm infers high-quality topological maps across varying conditions.
|
| |
| 09:00-10:30, Paper ThI1I.195 | Add to My Program |
| Robotic Grasping and Placement Controlled by EEG-Based Hybrid Visual and Motor Imagery |
|
| Liu, Yichang | Fudan University |
| Wang, Tianyu | Fudan University |
| Ye, Ziyi | Fudan University |
| Li, Yawei | ETH Zurich |
| Jiang, Yu-Gang | Fudan University |
| Wang, Shouyan | Fudan University |
| Fu, Yanwei | Fudan University |
Keywords: Perception for Grasping and Manipulation, Human-Robot Collaboration, Brain-Machine Interfaces
Abstract: We present a framework that integrates EEG-based visual and motor imagery (VI/MI) with robotic control to enable real-time, intention-driven grasping and placement. Motivated by the promise of BCI-driven robotics to enhance human-robot interaction, this system bridges neural signals with physical control by deploying offline-pretrained decoders in a zero-shot manner within an online streaming pipeline. This establishes a dual-channel intent interface that translates visual intent into robotic actions, with VI identifying objects for grasping and MI determining placement poses, enabling intuitive control over both what to grasp and where to place. The system operates solely on EEG via a cue-free imagery protocol, achieving integration and online validation. Implemented on a Base robotic platform and evaluated across diverse scenarios, including occluded targets or varying participant postures, the system achieves online decoding accuracies of 40.23% (VI) and 62.59% (MI), with an end-to-end task success rate of 20.88%. These results demonstrate that high-level visual cognition can be decoded in real time and translated into executable robot commands, bridging the gap between neural signals and physical interaction, and validating the flexibility of a purely imagery-based BCI paradigm for practical human–robot collaboration.
|
| |
| 09:00-10:30, Paper ThI1I.196 | Add to My Program |
| Multimodal Belief-Space Covariance Steering with Active Probing and Influence for Interactive Driving |
|
| Chakravarty, Devodita | Indian Institute of Technology Kharagpur |
| Dolan, John M. | Carnegie Mellon University |
| Lyu, Yiwei | Texas A&M University |
Keywords: Intelligent Transportation Systems
Abstract: Autonomous driving in complex traffic requires reasoning under uncertainty. Common approaches rely on prediction-based planning or risk-aware control, but these are typically treated in isolation, limiting their ability to capture the coupled nature of action and inference in interactive settings. This gap becomes especially critical in uncertain scenarios, where simply reacting to predictions can lead to unsafe maneuvers or overly conservative behavior. Our central insight is that safe interaction requires not only estimating human behavior but also shaping it when ambiguity poses risks. To this end, we introduce a hierarchical belief model that structures human behavior across coarse discrete intents and fine motion modes, updated via Bayesian inference for interpretable multi-resolution reasoning. On top of this, we develop an active probing strategy that identifies when multimodal ambiguity in human predictions may compromise safety and plans disambiguating actions that both reveal intent and gently steer human decisions toward safer outcomes. Finally, a runtime risk-evaluation layer based on Conditional Value-at-Risk (CVaR) ensures that all probing actions remain within human risk tolerance during influence. Our simulations in lane-merging and unsignaled intersection scenarios demonstrate that our approach achieves higher success rates and shorter completion times compared to existing methods. These results highlight the benefit of coupling belief inference, probing, and risk monitoring, yielding a principled and interpretable framework for planning under uncertainty.
|
| |
| 09:00-10:30, Paper ThI1I.197 | Add to My Program |
| Order Matters: On Parameter-Efficient Image-To-Video Probing for Recognizing Nearly Symmetric Actions |
|
| Thiyakesan Ponbagavathi, Thinesh | University of Hildesheim |
| Roitberg, Alina | University of Hildesheim |
Keywords: Gesture, Posture and Facial Expressions, Human-Robot Collaboration, Recognition
Abstract: Fine-grained understanding of human actions is essential for safe and intuitive human–robot interaction. We study the challenge of recognizing nearly symmetric actions, such as picking up vs. placing down a tool or opening vs. closing a drawer. These actions are common in close human-robot collaboration, yet they are rare and largely overlooked in mainstream vision frameworks. Pretrained vision foundation models (VFMs) are often adapted using probing, valued in robotics for its efficiency and low data needs, or parameter-efficient fine-tuning (PEFT), which adds temporal modeling through adapters or prompts. However, our analysis shows that probing is permutation-invariant and blind to frame order, while PEFT is prone to overfitting on smaller HRI datasets, and less practical in real-world robotics due to compute constraints. To address this, we introduce STEP (Self-attentive Temporal Embedding Probing), a lightweight extension to probing that models temporal order via frame-wise positional encodings, a global CLS token, and a simplified attention block. Compared to conventional probing, STEP improves accuracy by 4–10% on nearly symmetric actions and 6–15% overall across action recognition benchmarks in human-robot-interaction, industrial assembly, and driver assistance. Beyond probing, STEP surpasses heavier PEFT methods and even outperforms fully fine-tuned models on all three benchmarks, establishing a new state of the art. Code and models will be made publicly available
|
| |
| 09:00-10:30, Paper ThI1I.198 | Add to My Program |
| OHMM-PA: A Learning from Demonstration Approach Using Online Hidden Markov Models with Path Planning |
|
| Irsperger, Jan | Technical University of Munich / School of Computation, Information and Technology |
| Fernandez Prado, Diego | Technical University of Munich / School of Computation, Information and Technology |
| Steinbach, Eckehard | Technical University of Munich |
|
|
| |
| 09:00-10:30, Paper ThI1I.199 | Add to My Program |
| Bio-Inspired Rolling-Disk Continuum Robot: Logarithmic Spiral and Constant Curvature Design with Contraction Capabilities |
|
| Firdaus, Md Modassir | Indian Institute of Technology Gandhinagar |
| Mallru, Vikranth | IIT Bhubaneswar |
| Malodia, Harsh | School of Engineering and Applied Science, Ahmedabad University |
| Vadali, Madhu | Indian Institute of Technology Gandhinagar |
Keywords: Soft Robot Materials and Design, Soft Robot Applications, Tendon/Wire Mechanism
Abstract: Soft robotics draws inspiration from biological appendages like seahorse tails, octopus arms, and elephant trunks, which demonstrate remarkable flexibility and diverse functionalities. While advances in soft robotics have enabled delicate manipulation, safe human interaction, and medical applications, existing systems lag behind mimicking nature with a change in length. This paper presents a novel rolling-disk-based continuum robot design that replicates natural logarithmic spiral geometry while overcoming limitations of previous bio-inspired systems, including fixed length, complex elastic interconnects, and absent central lumens. The proposed 3D-printable architecture enables cost-effective rapid prototyping with adjustable parameters, achieving both logarithmic spiral and constant curvature bending, consistent with established kinematic models. A central hollow passage supports tool integration for minimally invasive procedures, while contraction capability enables dynamic length adjustment. Comprehensive mathematical analysis, CAD development, and SOFA simulations validate the design conceptualisation. Experimental demonstrations confirm bending, contraction, and grasping capabilities across diverse object geometries, establishing a foundation for scalable, adaptable tendon-driven continuum robots that bridge biological inspiration with practical engineering implementation.
|
| |
| 09:00-10:30, Paper ThI1I.200 | Add to My Program |
| Where I Am & Where to Go: Egocentric Indoor Scene Perception with Agent Interaction for Remote Embodied Visual Grounding |
|
| Zhang, Hongtao | Donghua University |
| Tang, Yili | Donghua University |
| Gao, Yuan | The Chinese University of Hong Kong |
| Zhang, Jue | Donghua University |
| Zhang, Jidong | E-Surfing Digital Life Technology Co., Ltd |
| Zhao, Mingbo | Donghua University |
Keywords: Vision-Based Navigation, Deep Learning for Visual Perception, Visual Learning
Abstract: Embodied Referring Expression Grounding (REVERIE) is a Vision-and-Language Navigation (VLN) task that better reflects real-world human instructions. Unlike conventional VLN, REVERIE is more challenging as agents must navigate in unseen environments and ground remote objects described by short, high-level commands. This requires agents not only to plan a route without detailed step-by-step guidance but also to accurately localize the target object at the destination. Existing VLN agents mainly emphasize navigation performance while overlooking object grounding success, leading to a significant performance gap. We introduce a model-agnostic interaction framework with two auxiliary agents, Where-I-Am (WIA) and Where-to-Go (W2G). Specifically, WIA predicts the current room type from environmental observations, while W2G infers the target room type from high-level instructions. Our framework is plug-and-play and can be integrated with various VLN models. On the REVERIE benchmark, it improves navigation success rate (SR) by 7.78% and remote grounding success (RGS) by 5.48% over the baselines, demonstrating the effectiveness and generality of our design. Furthermore, in challenging unseen test environments, our framework achieves competitive results on the REVERIE dataset, outperforming the previous state-of-the-art VLN agent (without additional training data) with a 2.27% gain in RGS.
|
| |
| 09:00-10:30, Paper ThI1I.201 | Add to My Program |
| Teaching to Individual Needs: Bidirectional Teacher-Student Learning for Wheeled-Legged Locomotion |
|
| Li, Guangsheng | Dalian University of Technology |
| Wu, Charles | MagicLab Robotics Technology Co., Ltd |
| Zheng, XinHua | Dalian University of Technology |
| Zhu, Shiyu | Dalian University of Technology |
| Liu, Shenglan | Dalian University of Technology |
Keywords: Legged Robots, Reinforcement Learning, Imitation Learning
Abstract: Reinforcement Learning (RL) enables robust and adaptive locomotion in legged and wheeled-legged robots. A common approach is the Teacher-Student (TS) paradigm, in which a teacher policy with privileged information supervises a proprioceptive student. While the TS paradigm has proven effective on legged robots, we encounter two critical issues when applying it to wheeled-legged robots. One issue is multimodal confusion, where teacher actions become multimodal under the student proprioceptive observations, resulting in the student generating averaged action modes. The other is low imitability of teacher actions, as the teacher overlooks their reproducibility by the student. To address these issues, we propose Teaching to Individual Needs (TIN), a bidirectional TS framework. To mitigate multimodal confusion within the student policy, we design a Highest-Weight Component Mixture Density Network (HWC-MDN). By utilizing HWC-MDN, TIN student can explicitly model multimodal action distributions and outputs the highest-weight component. To improve imitability, we propose an Imitation-Aware Reward (IAR) that encourages the teacher to generate more reproducible actions by the student. Simulation experiments show that TIN significantly improves both training efficiency and traversability. Real-world tests illustrate that TIN enables the wheeled-legged robot MagicDog-W to traverse 45 cm obstacles and ascend 45° slopes.
|
| |
| 09:00-10:30, Paper ThI1I.202 | Add to My Program |
| CAPS: Context-Aware Priority Sampling for Enhanced Imitation Learning in Autonomous Driving |
|
| Mirkhani, Hamidreza | Huawei Technologies Canada |
| Khamidehi, Behzad | Huawei Technologies Canada |
| Ahmadi, Ehsan | University of Alberta |
| Elmahgiubi, Mohammed | Huawei Technologies Inc |
| Zhang, Weize | Huawei |
| Arasteh, Fazel | Noah's Ark Lab, Huawei |
| Rajguru, Umar | Huawei Technologies Canada |
| Rezaee, Kasra | Huawei Technologies |
| Bai, Dongfeng | Noah's Ark Lab, Huawei Technologies |
Keywords: Imitation Learning, Integrated Planning and Learning, Learning from Demonstration
Abstract: In this paper, we introduce Context-Aware Priority Sampling (CAPS), a novel method designed to enhance data efficiency in learning-based autonomous driving systems. CAPS addresses the challenge of imbalanced datasets in imitation learning by leveraging Vector Quantized Variational Autoencoders (VQ-VAEs). In this way, we can get structured and interpretable data representations, which help to reveal meaningful patterns in the data. These patterns are used to group the data into clusters, with each sample being assigned a cluster ID. The cluster IDs are then used to re-balance the dataset, ensuring that rare yet valuable samples receive higher priority during training. We evaluate our method through closed-loop experiments in the CARLA simulator. The results on Bench2Drive scenarios demonstrate the effectiveness of CAPS in enhancing model generalization, with substantial improvements in both driving score and success rate.
|
| |
| 09:00-10:30, Paper ThI1I.203 | Add to My Program |
| Data-Efficient Hierarchical Goal-Conditioned Reinforcement Learning Via Normalizing Flows |
|
| Garg, Shaswat | ArenaX Labs |
| Moezzi, Matin | ArenaX Labs |
| Da Silva, Brandon | ArenaX Labs |
Keywords: Autonomous Agents, Learning from Demonstration, Reinforcement Learning
Abstract: Hierarchical goal-conditioned reinforcement learning (H-GCRL) provides a powerful framework for tackling complex, long-horizon tasks by decomposing them into structured subgoals. However, its practical adoption is hindered by poor data efficiency and limited policy expressivity, especially in offline or data-scarce regimes. In this work, normalizing flow-based bierarchical implicit Q-learning (NF-HIQL), a novel framework that replaces unimodal gaussian policies with expressive normalizing flow policies at both the high- and low-levels of the hierarchy is introduced. This design enables tractable log-likelihood computation, efficient sampling, and the ability to model rich multimodal behaviors. New theoretical guarantees are derived, including explicit KL-divergence bounds for real-valued non-volume preserving (RealNVP) policies and PAC-style sample efficiency results, showing that NF-HIQL preserves stability while improving generalization. Empirically, NF-HIQL is evaluted across diverse long-horizon tasks in locomotion, ball-dribbling, and multi-step manipulation from OGBench. NF-HIQL consistently outperforms prior goal-conditioned and hierarchical baselines, demonstrating superior robustness under limited data and highlighting the potential of flow-based architectures for scalable, data-efficient hierarchical reinforcement learning.
|
| |
| 09:00-10:30, Paper ThI1I.204 | Add to My Program |
| Bioinspired Kirigami Capsule Robot for Minimally Invasive Gastrointestinal Biopsy |
|
| Zhao, Ruizhou | The Chinese University of Hong Kong |
| Chu, Yichen | Northeastern University |
| Zhao, Shuwei | The Chinese University of HONG KONG |
| Yue, Wenchao | The Chinese University of Hong Kong |
| Ren, Hongliang | Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS) |
| Tang, Raymond Shing-Yan | The Chinese University of Hong Kong, Department of Medicine and Therapeutics |
|
|
| |
| 09:00-10:30, Paper ThI1I.205 | Add to My Program |
| Blinking into Emotion: How Context and LED Frequency Shape Non-Humanoid Robots’ Emotional Transparency |
|
| Scarpato, Patrizio | University of Naples Federico II |
| Raggioli, Luca | University of Naples Federico II |
| Esposito, Raffaella | University of Naples Federico II |
| Rossi, Silvia | Universita' Di Napoli Federico II |
Keywords: Emotional Robotics, Social HRI, Design and Human Factors
Abstract: In social contexts, correctly interpreting communicative signals is essential for understanding an interlocutor’s reactions. Emotional cues provide important context, enhancing mutual understanding and enabling more natural, adaptive interactions. For non-humanoid robots, which typically have more limited interaction capabilities, it could be necessary to combine multiple non-verbal signals and to consider the context in which they are used. In this work, we present the design of non-verbal behaviors that enable a non-humanoid robot to communicate its intended emotional state effectively. Our strategy focuses on the use of LEDs as non-verbal cues along with facial expressions for conveying emotions. We conducted a user study to evaluate the effect of these channels with and without context. Our results show that the robot conveys emotions more transparently when context is included, and that blinking LEDs can be an effective channel for communicating emotion. Results suggest that blinking alone is a minimal but functional cue, with richer models performing better without context. When short contextual sentences and more spaced blink frequencies are added, the blinking-only condition performs on par with, or even better than, the full multimodal model for some emotions and, in particular, with respect to participants’ perceived arousal. This indicates that carefully designed, simple visual cues can be an effective affect channel for non-humanoid robots.
|
| |
| 09:00-10:30, Paper ThI1I.206 | Add to My Program |
| Singularity Analysis of ABB's GoFa 5 Robot Arm |
|
| Refalo, Axel | École De Technologie Supérieure |
| Bonev, Ilian | École De Technologie Supérieure |
| Gosselin, Clement | Université Laval |
Keywords: Kinematics, Mechanism Design, Performance Evaluation and Benchmarking
Abstract: In the past decade, manipulator arms with non- traditional architectures — once found mainly in space and painting applications — have become popular as collaborative robots. Examples include ABB’s YuMi and GoFa, Kinova’s Link 6, and Fanuc’s CRX. These cobots lack closed-form inverse kinematics solutions, making it impossible to unambiguously select one configuration among the 16 (or infinitely many) that correspond to a given end-effector pose, which may create safety risks. Moreover, they exhibit far more singularities than typical manipulators, and most of them are far more complex to describe. Nevertheless, many authors argue these manipulators can provide improved dexterity and a larger workspace. In this paper, we analyze the singularities of ABB’s GoFa using Grassmann line geometry and provide straightforward, sufficient (though conservative) conditions for avoiding them. Then, while GoFa can exhibit over a dozen distinct singularities compared to only three (wrist, shoulder, and elbow) for traditional robot arms, we attempt to quantify which architecture actually possesses a greater amount of singular and near-singular configurations.
|
| |
| 09:00-10:30, Paper ThI1I.207 | Add to My Program |
| VisuaLLMPlanner - a Maneuver Planner for Automated Vehicles Using Large Language Models |
|
| Neurath, Daniel | Technical University Berlin |
| Schäufele, Bernd | Fraunhofer Fokus |
| Radusch, Ilja | Fraunhofer FOKUS |
Keywords: Autonomous Vehicle Navigation, Task and Motion Planning, Collision Avoidance
Abstract: Achieving safe and reliable automated driving in real-world conditions requires the ability to handle rare and unpredictable situations, commonly known as long-tail scenarios. These cases are often underrepresented in training data and remain a major challenge for conventional motion planning systems. In this work, we present VisuaLLMPlanner, a maneuver planning framework that integrates a multimodal large language model (MLLM) into the high-level decision-making loop of an automated driving pipeline. The system is triggered when the ego vehicle encounters a situation with an obstacle that cannot be resolved by a standard lane-following planner. At this point, a structured input comprising a bird’s-eye view image and a textual scene description is generated and passed to the MLLM. Rather than generating plans directly, the model selects from a discrete set of pre-generated and validated maneuver options, allowing for interpretable and structured decision-making. We evaluate our approach on the interPlan benchmark, which focuses explicitly on long-tail scenarios, and demonstrate that VisuaLLMPlanner achieves strong performance in comparison to prior LLM-based planners. The results highlight both the potential and current limitations of foundation models for high-level reasoning in automated vehicle planning.
|
| |
| 09:00-10:30, Paper ThI1I.208 | Add to My Program |
| M2R2: MultiModal Robotic Representation for Temporal Action Segmentation |
|
| Sliwowski, Daniel | TU Wien |
| Lee, Dongheui | Technische Universität Wien (TU Wien) |
Keywords: Representation Learning, Deep Learning Methods, Sensor Fusion
Abstract: Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have primarily focused on leveraging proprioceptive information to determine skill boundaries, with recent approaches in surgical robotics incorporating vision. In contrast, computer vision typically relies on exteroceptive sensors, such as cameras. Existing multimodal TAS models in robotics integrate feature fusion within the model, making it difficult to reuse learned features across different models. Meanwhile, pretrained vision-only feature extractors commonly used in computer vision struggle in scenarios with limited object visibility. In this work, we address these challenges by proposing M2R2, a multimodal feature extractor tailored for TAS, which combines information from both proprioceptive and exteroceptive sensors. We introduce a novel training strategy that enables the reuse of learned features across multiple TAS models. Our method sets a new state-of-the-art performance on three robotic datasets REASSEMBLE, (Im)PerfectPour, and JIGSAWS. Additionally, we conduct an extensive ablation study to evaluate the contribution of different modalities in robotic TAS tasks.
|
| |
| 09:00-10:30, Paper ThI1I.209 | Add to My Program |
| Adaptive Physical Human–Robot Interaction Via a Passivity-Aware Model Predictive Variable Admittance Control |
|
| Mahfouz, Dalia M. | German University in Cairo (GUC) |
| Di Lillo, Paolo | University of Cassino and Southern Lazio |
| Shehata, Omar M. | German University in Cairo (GUC) |
| Morgan, Elsayed | German University in Cairo (GUC) |
| Arrichiello, Filippo | University of Cassino and Southern Lazio |
Keywords: Physical Human-Robot Interaction, Compliance and Impedance Control, Robust/Adaptive Control
Abstract: Physical Human–Robot Interaction (pHRI) requires control frameworks that balance accuracy, compliance, and safety under variable human behaviors. This paper proposes a novel Model Predictive Variable Admittance (MPVA) framework that integrates trajectory tracking, interaction force directionality, and passivity constraints into an online real-time optimization scheme. The proposed architecture is implemented on a 7-DoF Kinova Jaco-2 robot and validated experimentally through mixed assistive and resistive modes with multiple subjects performing pHRI tasks. Results supported by both objective metrics and subjective evaluation through a NASA TLX survey show that the MPVA achieves competitive tracking accuracy, reducing physical effort with minimal passivity violations compared to other algorithmic baselines such as fixed-gain admittance and fuzzy-based adaptive admittance. This demonstrates safe and effective human-robot physical interaction across diverse modes.
|
| |
| 09:00-10:30, Paper ThI1I.210 | Add to My Program |
| Investigating the Role of Implicit Signals in Adaptive User-Aware Human-Robot Interactions |
|
| Gucsi, Bálint | University of Southampton |
| Tuyen, Nguyen Tan Viet | University of Southampton |
| Chu, Bing | University of Southampton |
| Tarapore, Danesh | University of Southampton |
| Tran-Thanh, Long | University of Warwick |
Keywords: Social HRI, Multi-Modal Perception for HRI
Abstract: Our work investigates how social robots can act in a user-aware manner by adapting their behaviour to users' personal characteristics and preferences without unnecessarily exposing them to frustration through the robot's actions. In particular, we investigate how implicit social signals inadvertently exhibited by users (e.g. facial expressions) during interactions can be incorporated into user-aware decision-making models while accounting for the systematic limitations of implicit feedback signals (e.g. inconsistency, noise, culture and individual-dependence). Doing so, we develop a user-aware adaptive decision-making and learning framework for human-robot interactions, building on implicit signal processing, cue-based intent inference, and multiarmed bandit learning techniques. Evaluating our approach, we conduct a user study where participants interact with a Pepper robot in a cafeteria style interaction scenario, with the robot providing recommendations and taking orders while adapting its behaviour to individual users. The experimental results demonstrate our proposed model's success in adapting its behaviour (i.e. conversational style) to users with different personal characteristics, while receiving 80% positive user feedback, and user questionnaire responses reporting higher perceived usefulness than baseline approaches. Questionnaire responses also illustrate positive user impressions of implicit signal based approaches while highlighting the importance of accounting for their limitations in learning models. In addition, we provide a dataset of over 5 hours of human and robot behaviour data extracted from multimodal recordings captured as part of our user study.
|
| |
| 09:00-10:30, Paper ThI1I.211 | Add to My Program |
| SHAF: Small Language Model Integrated with Motion Modality for Multimodal Interaction |
|
| Ansari, Aamir Ahmad | University of Southampton |
| Tuyen, Nguyen Tan Viet | University of Southampton |
| Ramchurn, Sarvapali | University of Southampton |
Keywords: Multi-Modal Perception for HRI, Gesture, Posture and Facial Expressions, Deep Learning Methods
Abstract: Multimodal interaction plays a vital role in human–AI interaction, enabling robots or AI agents to interpret human input from multiple sensory channels and respond through diverse communication modalities. This paper introduces SHAF, an LLM-based multimodal model capable of handling text, image, and human motion as both input and output modalities across different multi-turn conversational settings. In SHAF, vector quantization is employed to convert images and human motion into an aligned set of tokens, followed by pre-training and instruction fine-tuning of a small Large Language Model (LLM) on our newly created SHAF dataset. Experimental results demonstrate that SHAF achieves competitive performance in text-to-motion and motion-to-text tasks in comparison to relevant works, while handling an additional modality and supporting a broader range of tasks. This research contributes an LLM-based multimodal approach, with the aim of fostering deeper exploration of human motion modality in LLMs within the context of HRI and related domains.
|
| |
| 09:00-10:30, Paper ThI1I.212 | Add to My Program |
| CoVAR: Co-Generation of Video and Action for Robotic Manipulation Via Multi-Modal Diffusion |
|
| Yang, Liudi | University of Freiburg |
| Bai, Yang | Ludwig Maximilian University of Munich |
| Eskandar, George | University of Stuttgart |
| Shen, Fengyi | Technical University of Munich |
| Altillawi, Mohammad | Huawei, Autonomous University of Barcelona, |
| Chen, Dong | Technische Universität München |
| Liu, Ziyuan | Huawei Group |
| Valada, Abhinav | University of Freiburg |
Keywords: Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation, Imitation Learning
Abstract: We present a method to generate video–action pairs that follow text instructions, starting from an initial image observation and the robot’s joint states. Our approach automatically provides action labels for video diffusion mod- els, overcoming the common lack of action annotations and enabling their full use for robotic policy learning. Existing methods either adopt two-stage pipelines, which limit tightly coupled cross-modal information sharing, or rely on adapting a single-modal diffusion model for joint distribution that cannot fully leverage pretrained video knowledge. To overcome these limitations, we (1) extend a pretrained video diffusion model with a parallel, dedicated action diffusion model that preserves pretrained knowledge, (2) introduce a Bridge Attention mecha- nism to enable effective cross-modal interaction, and (3) design an action refinement module to convert coarse actions into precise controls for low-resolution datasets. Experiments on public benchmarks and real-world datasets demonstrate that our method produces higher-quality videos, more accurate ac- tions, and significantly outperforms existing baselines, offering a scalable framework for leveraging large-scale video data for robotic learning
|
| |
| 09:00-10:30, Paper ThI1I.213 | Add to My Program |
| Video-To-BT: Generating Reactive Behavior Trees from Human Demonstration Videos for Robotic Assembly |
|
| Zhao, Xiwei | Technical University of Munich |
| Wang, Yiwei | Technology University of Munich |
| Wu, Yansong | Technische Universität München |
| Wu, Fan | Shanghai University |
| Sun, Teng | Shanghai University |
| Miao, Zhonghua | Shanghai University |
| Haddadin, Sami | Mohamed Bin Zayed University of Artificial Intelligence |
| Knoll, Alois | Tech. Univ. Muenchen TUM |
Keywords: Task Planning, AI-Enabled Robotics, Intelligent and Flexible Manufacturing
Abstract: Modern manufacturing demands robotic assembly systems with enhanced flexibility and reliability. However, traditional approaches often rely on programming tailored to each product by experts for fixed settings, which are inherently inflexible to product changes and lack the robustness to handle variations. As Behavior Trees (BTs) are increasingly used in robotics for their modularity and reactivity, we propose a novel hierarchical framework, Video-to-BT, that seamlessly integrates high-level cognitive planning with low-level reactive control, with BTs serving both as the structured output of planning and as the governing structure for execution. Our approach leverages a Vision-Language Model (VLM) to decompose human demonstration videos into subtasks, from which Behavior Trees are generated. During the execution, the planned BTs combined with real-time scene interpretation enable the system to operate reactively in the dynamic environment, while VLM-driven replanning is triggered upon execution failure. This closed-loop architecture ensures stability and adaptivity. We validate our framework on real-world assembly tasks through a series of experiments, demonstrating high planning reliability, robust performance in long-horizon assembly tasks, and strong generalization across diverse and perturbed conditions. Project website: https://video2bt.github.io/video2bt_page/
|
| |
| 09:00-10:30, Paper ThI1I.214 | Add to My Program |
| Generative Adversarial Imitation Learning for Robot Swarms: Learning from Human Demonstrations and Trained Policies |
|
| Kraus, Mattes | University of Konstanz |
| Kuckling, Jonas | University of Konstanz |
Keywords: Swarm Robotics, Imitation Learning, Learning from Demonstration
Abstract: In imitation learning, robots are supposed to learn from demonstrations of the desired behavior. Most of the work in imitation learning for swarm robotics provides the demonstrations as rollouts of an existing policy. In this work, we provide a framework based on generative adversarial imitation learning that aims to learn collective behaviors from human demonstrations. Our framework is evaluated across six different missions, learning both from manual demonstrations and demonstrations derived from a PPO-trained policy. Results show that the imitation learning process is able to learn qualitatively meaningful behaviors that perform similarly well as the provided demonstrations. Additionally, we deploy the learned policies on a swarm of TurtleBot 4 robots in real-robot experiments. The exhibited behaviors preserved their visually recognizable character and their performance is comparable to the one achieved in simulation.
|
| |
| 09:00-10:30, Paper ThI1I.215 | Add to My Program |
| Simple Models, Real Swimming: Digital Twins for Tendon-Driven Underwater Robots |
|
| Michelis, Mike Yan | ETH Zurich |
| Obayashi, Nana | NYU |
| Hughes, Josie | EPFL |
| Katzschmann, Robert Kevin | ETH Zurich |
Keywords: Modeling, Control, and Learning for Soft Robots, Biologically-Inspired Robots, Tendon/Wire Mechanism
Abstract: Mimicking the graceful motion of swimming animals remains a core challenge in soft robotics due to the complexity of fluid-structure interaction and the difficulty of controlling soft, biomimetic bodies. Existing modeling approaches are often computationally expensive and impractical for complex control or reinforcement learning needed for realistic motions to emerge in robotic systems. In this work, we present a tendon-driven fish robot modeled in an efficient underwater swimmer environment using a simplified, stateless hydrodynamics formulation implemented in the widespread robotics framework MuJoCo. With just two real-world swimming trajectories, we identify five fluid parameters that allow a matching to experimental behavior and generalize across a range of actuation frequencies. We show that this stateless fluid model can generalize to unseen actuation and outperform classical analytical models such as the elongated body theory. This simulation environment runs faster than real-time and can easily enable downstream learning algorithms such as reinforcement learning for target tracking, reaching a 93% success rate. Due to the simplicity and ease of use of the model and our open-source simulation environment, our results show that even simple, stateless models --- when carefully matched to physical data --- can serve as effective digital twins for soft underwater robots, opening up new directions for scalable learning and control in aquatic environments.
|
| |
| 09:00-10:30, Paper ThI1I.216 | Add to My Program |
| WideDepth: Millimeter-Accurate Benchmark for Fisheye Depth Estimation |
|
| Indyk, Ilia | Robotics Center |
| Penshin, Ignat | Robotics Center |
| Sosin, Ivan | Robotics Center |
| Monastyrny, Maxim | Robotics Center |
| Valenkov, Aleksei | Robotics Center |
| Makarov, Ilya | AXXX; Trusted AI Research Center, RAS |
Keywords: Data Sets for Robotic Vision, RGB-D Perception, Omnidirectional Vision
Abstract: Fisheye cameras are increasingly adopted in robotics for near-field manipulation, navigation, and immersive perception, yet indoor depth benchmarks with accurate ground truth are still missing. To address this, we introduce WideDepth — the first indoor dataset for fisheye depth estimation, featuring 101 scenes containing 5K high-resolution stereo pairs labeled with millimeter-level ground truth depth and disparity. Our dataset also includes paired pinhole and fisheye samples across varying fields of view and baselines in both horizontal and vertical stereo setups. We further propose a method to adapt pinhole-trained stereo models to fisheye images and introduce a novel stereo fisheye image generation pipeline based on high-resolution LiDAR scans. Leveraging these methods, we thoroughly evaluate state-of-the-art monocular depth, stereo matching, and depth completion models on our benchmark. Additionally, we provide 18K LiDAR-derived sparse depth training samples, achieving up to a 62% performance boost on fisheye data when fine-tuning pinhole-based stereo models. In summary, the high precision and versatility of our benchmark set a strong foundation for advancing research in fisheye depth estimation and robotics perception. Project page: ilyaind.github.io/WideDepth
|
| |
| 09:00-10:30, Paper ThI1I.217 | Add to My Program |
| CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions |
|
| Yang, Lizhi | California Institute of Technology |
| Werner, Blake | California Institute of Technology |
| de Sa, Massimiliano | California Institute of Technology |
| Ames, Aaron | California Institute of Technology |
Keywords: Reinforcement Learning, Robot Safety, Machine Learning for Robot Control
Abstract: Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety. Yet safety violations can lead to catastrophic outcomes in real-world deployments. Control Barrier Functions (CBFs) offer a principled method to enforce dynamic safety—traditionally deployed online via safety filters. While the result is safe behavior, the fact that the RL policy does not have knowledge of the CBF can lead to conservative behaviors. This paper proposes CBF-RL, a framework for generating safe behaviors with RL by enforcing CBFs in training. CBF-RL has two key attributes: (1) minimally modifying a nominal RL policy to encode safety constraints via a CBF term, and (2) safety filtering of the policy rollouts in training. Theoretically, we prove that continuous-time safety filters can be deployed via closed-form expressions on discrete-time roll-outs. Practically, we demonstrate that CBF-RL internalizes the safety constraints in the learned policy—both enforcing safer actions and biasing towards safer rewards—enabling safe deployment without the need for an online safety filter. We validate our framework through ablation studies on navigation tasks and on the Unitree G1 humanoid robot, where CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, enabling the humanoid robot to avoid obstacles and climb stairs safely in real-world settings without a runtime safety filter.
|
| |
| 09:00-10:30, Paper ThI1I.218 | Add to My Program |
| Bio-Inspired Tail Oscillation Enables Fast Crawling on Deformable Granular Terrains |
|
| Liu, Shipeng | University of Southern California |
| Sagare, Meghana | University of Southern California |
| Patil, Shubham | University of Southern California |
| Qian, Feifei | University of Southern California |
Keywords: Biologically-Inspired Robots, Field Robots, Humanoid and Bipedal Locomotion
Abstract: Deformable substrates such as sand and mud present significant challenges for terrestrial robots due to complex robot-terrain interactions. Inspired by mudskippers, amphibious animals that naturally adjust their tail morphology and movement jointly to navigate such environments, we investigate how tail design and control can jointly enhance flipper-driven locomotion on granular media. Using a bio-inspired robot modeled after the mudskipper, we experimentally compared locomotion performance between idle and actively oscillating tail configurations and found that tail oscillation increased forward speed by 17% while reducing body drag by 46%. Shear force measurements revealed that this improvement arises from oscillation-induced fluidization of the substrate, which lowers resistive forces acting on the body. Additionally, tail morphology strongly influenced the oscillation strategy: designs with larger horizontal surface areas leveraged the oscillation-induced reduction in shear resistance more effectively by limiting insertion depth. Based on these findings, we present a design principle to inform tail action selection based on substrate strength and tail morphology. Our results offer new insights into tail design and control for improving robot locomotion on deformable substrates, with implications for agricultural robotics, search and rescue, and environmental exploration.
|
| |
| 09:00-10:30, Paper ThI1I.219 | Add to My Program |
| View Synthesis and 6DoF Pose Estimation in mmWave Radar Neural Radiance Fields |
|
| Kamari, Ahmad | George Mason University |
| Kumar, Hemant | George Mason University |
| Wu, Nan | George Mason University |
| Hajrasouliha, Amirreza | George Mason University |
| Han, Bo | George Mason University |
| Pathak, Parth | George Mason University |
Keywords: Localization, Mapping, AI-Based Methods
Abstract: Estimating a device’s 6DoF pose (i.e., location and orientation) within the environment is a fundamental problem in robotics, and beyond. Millimeter-wave (mmWave) radars have emerged as an attractive alternative to optical sensors (e.g., RGB cameras) in these tasks due to their ability to operate in poor lighting and adverse conditions such as smoke and fog. This paper presents mmNeRF, a view synthesis and 6DoF pose estimation system based on neural radiance fields (NeRF) designed specifically for mmWave radars. mmNeRF requires only radar range-angle heatmaps collected in a given environment to construct its implicit neural representation, ensuring multi-view consistency and producing high-quality view synthesis. It then builds a 6DoF pose estimation framework that queries the neural model with particle filters to perform scan & matching operations to yield an accurate 6DoF pose. We evaluate mmNeRF using over 50K radar frames collected in six different indoor environments on a handheld rig equipped with a radar. Our results show that mmNeRF achieves median translation and rotation errors of 0.34m and 17.15◦ for single-spectrum 6DoF pose estimation and Absolute Trajectory Error (ATE) of 0.66m and 0.81 radians for continuous 6DoF pose tracking, considerably outperforming state-of-the-art solutions.
|
| |
| 09:00-10:30, Paper ThI1I.220 | Add to My Program |
| M-VTOP: Modular Visuo-Tactile Object Pose Estimation for High-Precision Robotic Manipulation |
|
| Oller, Miquel | University of Michigan |
| Qian, Qiyang | University of California at Berkeley |
| Corcodel, Radu | Mitsubishi Electric Research Laboratories |
| Jain, Siddarth | Mitsubishi Electric Research Laboratories (MERL) |
Keywords: Computer Vision for Automation, Sensor-based Control, Assembly
Abstract: Accurate object pose estimation is essential for robotic manipulation, particularly in tasks involving small or geometrically intricate objects where high precision is required. Existing vision, tactile, and hybrid-based approaches struggle with occlusion, noise, and limited generalization, often requiring extensive retraining or large annotated datasets. In this work, we present M-VTOP, a modular framework for in-hand object pose estimation that integrates vision, tactile, and contact sensing in a flexible manner, allowing robustness against noisy or missing modalities. At the core of the framework is a belief-based particle filter that fuses heterogeneous sensor observations, maintains probabilistic estimates, and continuously refines them toward high-precision convergence in closed-loop robotic control with the pose estimation feedback. A mask based observation representation unifies visual and tactile signals into geometry-centric inputs, enhancing robustness to texture and lighting variations while supporting zero-shot generalization. The framework requires only an object’s CAD model and avoids task-specific retraining. Experiments show that M-VTOP achieves sub-millimeter accuracy under complex geometries, occlusions, and tight tolerances, demonstrating its promise for high-precision robotic manipulation.
|
| |
| 09:00-10:30, Paper ThI1I.221 | Add to My Program |
| DREAM: Domain-Aware Reasoning for Efficient Autonomous Underwater Monitoring |
|
| Wu, Zhenqi | University of South Florida |
| Modi, Abhinav | University of Maryland, College Park |
| Mavrogiannis, Angelos | University of Maryland, College Park |
| Joshi, Kaustubh | University of Maryland College Park |
| Chopra, Nikhil | University of Maryland, College Park |
| Aloimonos, Yiannis | University of Maryland, College Park |
| Karapetyan, Nare | Woods Hole Oceanographic Institution |
| Rekleitis, Ioannis | University of Delaware |
| Lin, Xiaomin | University of South Florida |
Keywords: Marine Robotics, Perception-Action Coupling, Environment Monitoring and Management
Abstract: The ocean is warming and acidifying, increasing the risk of mass mortality events for temperature-sensitive shellfish such as oysters. This motivates the development of long-term monitoring systems. However, human labor is costly and long-duration underwater work is highly hazardous, thus favoring robotic solutions as a safer and more efficient option. To enable underwater robots to make real-time, environment-aware decisions without human intervention, we must equip them with an intelligent “brain.” This highlights the need for persistent, wide-area, and low-cost benthic monitoring. To this end, we present DREAM, a Vision Language Model (VLM)-guided autonomy framework for long-term underwater exploration and habitat monitoring. The results show that our framework is highly efficient in finding and exploring target objects (e.g., oysters, shipwreck) without prior location information. In the oyster-monitoring task, our framework takes 31.5% less time than previous baseline with the same amount of oysters. Compared to the vanilla VLM, it uses 23% fewer steps while covering 8.88% more oysters. In shipwreck scenes, our framework successfully explores and maps the wreck without collisions, requiring 27.5% fewer steps than the vanilla model and achieving 100% coverage, while the vanilla model achieves 60.23% average coverage in our shipwreck environments.
|
| |
| 09:00-10:30, Paper ThI1I.222 | Add to My Program |
| Balancing Deployment Costs in Multi-Robot Task Assignment |
|
| Wilde, Nils | Dalhousie University |
| Alonso-Mora, Javier | Delft University of Technology |
Keywords: Multi-Robot Systems, Planning, Scheduling and Coordination, Path Planning for Multiple Mobile Robots or Agents
Abstract: Multi-Robot Task Assignment (MRTA) studies the problem of allocating spatially distributed tasks to a fleet of cooperative robots as well as determining the optimal task sequence for each robot. Common objectives include minimizing the task waiting times, minimizing the robot tour lengths and maximizing the number of serviced tasks within given time windows. However, this does not consider an equitable distribution of the workload among the fleet. Yet, uneven workloads are often undesirable since it can incur solutions where few robots service most tasks while parts of the fleet remain underused. On the other hand, under fully balanced workloads robots may insufficiently consider the total operation cost and thus can be deployed a redundant manner. In this paper, we study MRTA from the viewpoint of multi-objective optimization (MOO), formulating the problem of simultaneously minimizing the costs of individual robot tours. We explore how this treatment allows for attaining more balanced solutions than common formulations using the sum or maximum of tour costs. We present a generalist formulation using a scalar objective and establish theoretical guarantees on the attainable multi-objective trade-offs. Further, we derive an effective heuristic based on a p-norm of tour lengths that is able to find balanced workloads among robots. Our approach is agnostic to the specific choice of MRTA solver and we provide insights into how it can be incorporated into two state-of-the-art algorithms. We demonstrate our approach in experiments for offline and online MRTA setups, including servicing tasks as well as pickup and delivery, and highlight its advantages with respect to balanced workloads compared to state-of-the-art formulations.
|
| |
| 09:00-10:30, Paper ThI1I.223 | Add to My Program |
| MINT: A Vision-Based Soft Sensor for Mutual Integration of Normal Interaction Force and Texture Perception |
|
| Rafiee Javazm, Mohammad | University of Texas at Austin |
| Kapuria, Siddhartha | University of Texas at Austin |
| Kara, Ozdemir Can | University of Texas at Austin |
| Kiehler, Sonika | The University of Texas at Austin |
| Hamada, Rami | University of Texas at Austin |
| Ivatury, Joga | The University of Texas at Austin |
| Alambeigi, Farshid | University of Texas at Austin |
Keywords: Medical Robots and Systems, Force and Tactile Sensing, Soft Sensors and Actuators
Abstract: Inspired by the design of Vision-based Tactile Sensors (VTSs) and soft resistive strain sensors, in this paper, we introduce MINT: a vision-based soft sensor for Mutual Integration of Normal interaction force and Texture perception. MINT is a hybrid vision-based tactile sensor that simultaneously integrates normal force measurement with high-resolution texture perception. This unique sensor utilizes a soft resistive strain sensor between the Gel Layer and Mirror Layer of a typical VTS. By combining electrical and visual sensing modalities, MINT overcomes the limitations of existing resistive sensors and VTSs, offering a robust, efficient, and scalable solution for direct measurement of force and texture capture. To evaluate MINT’s functionality, we first propose a unique design and fabrication procedure. Next, we conduct a series of experiments, evaluating its force and texture sensing capabilities through interactions with various rigid objects.
|
| |
| 09:00-10:30, Paper ThI1I.224 | Add to My Program |
| Notes-To-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks |
|
| Haresh, Sanjay | Qualcomm AI Research |
| Dijkman, Daniel | Qualcomm AI Research |
| Bhattacharyya, Apratim | Qualcomm AI Research |
| Roland, Memisevic | Qualcomm AI Research |
Keywords: Deep Learning Methods, Machine Learning for Robot Control
Abstract: Many dexterous manipulation tasks are non-markovian in nature, yet little attention has been paid to this fact in the recent upsurge of the vision-language-action (VLA) paradigm. Although they are successful in bringing internet-scale semantic understanding to robotics, existing VLAs are primarily "stateless" and struggle with memory-dependent long horizon tasks. In this work, we explore a way to impart both spatial and temporal memory to a VLA by incorporating a language scratchpad. The scratchpad makes it possible to memorize task-specific information, such as object positions, and it allows the model to keep track of a plan and progress towards subgoals within that plan. We evaluate this approach on a split of memory-dependent tasks from the ClevrSkills environment, on MemoryBench, as well as on a challenging real-world pick-and-place task. We show that incorporating a language scratchpad significantly improves generalization on these tasks for both non-recurrent and recurrent models.
|
| |
| 09:00-10:30, Paper ThI1I.225 | Add to My Program |
| LapSurgie: Humanoid Robots Performing Surgery Via Teleoperated Handheld Laparoscopy |
|
| Liang, Zekai | Univeristy of California, San Diego |
| Liang, Xiao | University of California San Diego |
| Atar, Soofiyan | University of California San Diego |
| Das, Sreyan | University of California, San Diego |
| Chiu, Zoe | Cornell University |
| Zhang, Peihan | University of California San Diego |
| Joyce, Calvin | University of California, San Diego |
| Richter, Florian | University of California, San Diego |
| Liu, Shanglei | University of California, San Diego |
| Yip, Michael C. | University of California, San Diego |
Keywords: Surgical Robotics: Laparoscopy, Medical Robots and Systems, Humanoid Robot Systems
Abstract: Robotic laparoscopic surgery has gained increasing attention in recent years for its potential to deliver more efficient and precise minimally invasive procedures. However, adoption of surgical robotic platforms remains largely confined to high-resource medical centers, exacerbating healthcare disparities in rural and low-resource regions. To close this gap, a range of solutions has been explored, from remote mentorship to fully remote telesurgery. Yet, the practical deployment of surgical robotic systems to underserved communities remains an unsolved challenge. Humanoid systems offer a promising path toward deployability, as they can directly operate in environments designed for humans without extensive infrastructure modifications -- including operating rooms. In this work, we introduce LapSurgie, the first humanoid-robot-based laparoscopic teleoperation framework. The system leverages an inverse-mapping strategy for manual-wristed laparoscopic instruments that abides to remote center-of-motion constraints, enabling precise hand-to-tool control of off-the-shelf surgical laparoscopic tools without additional setup requirements. A control console equipped with a stereo vision system provides real-time visual feedback. Finally, a comprehensive user study across platforms demonstrates the effectiveness of the proposed framework and provides initial evidence for the feasibility of deploying humanoid robots in laparoscopic procedures.
|
| |
| 09:00-10:30, Paper ThI1I.226 | Add to My Program |
| UniFField: A Generalizable Unified Neural Feature Field for Visual, Semantic, and Spatial Uncertainties in Any Scene |
|
| Maurer, Christian | Technische Universität Darmstadt |
| Jauhri, Snehal | TU Darmstadt |
| Lueth, Sophie | Technical University of Darmstadt |
| Chalvatzaki, Georgia | Technische Universität Darmstadt |
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, RGB-D Perception
Abstract: Comprehensive visual, geometric and semantic understanding of a 3D scene is crucial for successful execution of robotic tasks, especially in unstructured and complex environments. Additionally, to make robust decisions it is necessary for the robot to evaluate the reliability of perceived information. While recent advances in 3D neural feature fields have enabled robots to leverage features from pretrained foundation models for tasks such as language-guided manipulation and navigation, existing methods suffer from two critical limitations: (i) they are typically scene-specific, and (ii) they lack the ability to model uncertainty in their predictions. We present UniFField, a unified uncertainty-aware neural feature field that combines visual, semantic, and geometric features in a single generalizable representation while also predicting uncertainty in each modality. Our approach, which can be applied zero shot to any new environment, incrementally integrates RGB-D images into our voxel-based feature representation as the robot explores the scene, simultaneously updating uncertainty estimation. We evaluate our uncertainty estimations to accurately describe the model prediction errors in scene reconstruction and semantic feature prediction. Furthermore, we successfully leverage spatial and semantic feature predictions and their respective uncertainty for an active object search task using a mobile manipulator robot, demonstrating the capability for robust decision-making.
|
| |
| 09:00-10:30, Paper ThI1I.227 | Add to My Program |
| COMPASS: Cross-embOdiment Mobility Policy Via ResiduAl RL and Skill Synthesis |
|
| Liu, Wei | Nvidia |
| Zhao, Huihua | Georgia Tech |
| Li, Chenran | University of California, Berkeley |
| Deng, Yuchen | Nvidia |
| Biswas, Joydeep | The University of Texas at Austin |
| Chang, Yan | Nvidia |
| Pouya, Soha | Stanford University |
Keywords: Reinforcement Learning, Vision-Based Navigation
Abstract: As robots are increasingly deployed in diverse application domains, enabling robust mobility across different embodiments has become a critical challenge. Classical mobility stacks, though effective on specific platforms, require extensive per-robot tuning and do not scale easily to new embodiments. Learning-based approaches, such as imitation learning (IL), offer alternatives, but face significant limitations on the need for high-quality demonstrations for each embodiment. To address these challenges, we introduce COMPASS, a unified framework that enables scalable cross-embodiment mobility using expert demonstrations from only a single embodiment. We first pre-train a mobility policy on a single robot using IL, combining a world model with a policy model. We then apply residual reinforcement learning (RL) to efficiently adapt this policy to diverse embodiments through corrective refinements. Finally, we distill specialist policies into a single generalist policy conditioned on an embodiment embedding vector. This design significantly reduces the burden of collecting data while enabling robust generalization across a wide range of robot designs. Our experiments demonstrate that COMPASS scales effectively across diverse robot platforms while maintaining adaptability to various environment configurations, achieving a generalist policy with a success rate approximately 5X higher than the pre-trained IL policy, and further demonstrates zero-shot sim-to-real transfer.
|
| |
| 09:00-10:30, Paper ThI1I.228 | Add to My Program |
| Preference-Conditioned Reinforcement Learning for Space-Time Efficient Online 3D Bin Packing |
|
| Sarawgi, Nikita | University of Southern California |
| Manyar, Omey Mohan | University of Southern California |
| Wang, Fan | Amazon Robotics |
| Nguyen, Thinh | Amazon Robotics |
| Seita, Daniel | University of Southern California |
| Gupta, Satyandra K. | University of Southern California |
Keywords: Logistics, Industrial Robots, Deep Learning Methods
Abstract: Robotic bin packing is widely deployed in warehouse automation, with current systems achieving robust performance through heuristic and learning-based strategies. These systems must balance compact placement with rapid execution, where selecting alternative items or reorienting them can improve space utilization but introduce additional time. We propose a selection-based formulation that explicitly reasons over this trade-off: at each step, the robot evaluates multiple candidate actions, weighing expected packing benefit against estimated operational time. This enables time-aware strategies that selectively accept increased operational time when it yields meaningful spatial improvements. Our method, STEP (Space-Time Efficient Packing), uses a preference-conditioned, Transformer-based reinforcement learning policy, and allows generalization across candidate set sizes and integration with standard placement modules. It achieves a 44% reduction in operational time without compromising packing density. Additional material is available at https://step-packing.github.io.
|
| |
| 09:00-10:30, Paper ThI1I.229 | Add to My Program |
| Multi-Robot Trajectory Planning Via Constrained Bayesian Optimization and Local Cost Map Learning with STL-Based Conflict Resolution |
|
| Raxit, Sourav | University of New Orleans |
| Redwan Newaz, Abdullah Al | University of New Orleans |
| Fuentes, Jose | Florida International University |
| Padrao, Paulo | Providence College |
| Cavalcanti, Ana | Florida International University |
| Bobadilla, Leonardo | Florida International University |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Integrated Planning and Learning, Motion and Path Planning
Abstract: We address multi-robot motion planning under Signal Temporal Logic (STL) specifications with kinodynamic constraints. Exact approaches face scalability bottlenecks and limited adaptability, while conventional sampling-based methods require excessive samples to construct optimal trajectories. We propose a two-stage framework integrating sampling-based online learning with formal STL reasoning. At the single-robot level, our constrained Bayesian Optimization-based Tree search (cBOT) planner uses Gaussian process as surrogate models to learn local cost maps and feasibility constraints, generating shorter collision-free trajectories with fewer samples. At the multi-robot level, our STL-enhanced Kinodynamic Conflict-Based Search (STL-KCBS) algorithm incorporates STL monitoring into conflict detection and resolution, ensuring specification satisfaction while maintaining scalability and probabilistic completeness. Benchmarking demonstrates improved trajectory efficiency and safety over existing methods. Real-world experiments with autonomous surface vehicles validate robustness and practical applicability in uncertain environments.The STLcBOT Planner will be released as an open-source package and videos of real-world and simulated experiments are available at https://stlbot.github.io/.
|
| |
| 09:00-10:30, Paper ThI1I.230 | Add to My Program |
| Reachable Predictive Control: A Novel Control Algorithm for Nonlinear Systems with Unknown Dynamics and Its Practical Applications |
|
| Shafa, Taha | University of Illinois Urbana Champaign |
| Meng, Yiming | University of Illinois Urbana-Champaign |
| Ornik, Melkior | University of Illinois Urbana-Champaign |
Keywords: Autonomous Vehicle Navigation, Planning under Uncertainty, Motion and Path Planning
Abstract: This paper proposes an algorithm capable of driving a system to follow a piecewise linear trajectory without prior knowledge of the system dynamics. Motivated by a critical failure scenario in which a system can experience an abrupt change in its dynamics, we demonstrate that it is possible to follow a set of waypoints comprised of states analytically proven to be reachable despite not knowing the system dynamics. The proposed algorithm first applies small perturbations to locally learn the system dynamics around the current state, then computes the set of states that are provably reachable using the locally learned dynamics and their corresponding maximum growth-rate bounds, and finally synthesizes a control action that navigates the system to a guaranteed reachable state.
|
| |
| 09:00-10:30, Paper ThI1I.231 | Add to My Program |
| Feedforward Pressure-Regulated Position Control of Soft Ballooning Actuators Using a Modified Prandtl–Ishlinskii Model |
|
| Sowaruth, Nashil | University of Sussex |
| Herzig, Nicolas | University of Sussex |
Keywords: Modeling, Control, and Learning for Soft Robots, Hydraulic/Pneumatic Actuators, Calibration and Identification
Abstract: Thanks to their compliance and adaptability, soft actuators are promising devices for medical applications and the exploration of unstructured environment. However, their nonlinear behaviour, including strong hysteresis effects, presents challenges for accurate position control. This work investigates two data-driven feedforward control strategies for controlling the position of a Hyper-Elastic Ballooning Membrane Actuator (HBMA): a baseline single polynomial fit model and a hysteresis-aware Modified Prandtl-Ishlinskii (MPI) model. Comparative experiments demonstrate that hysteresis-aware control substantially improves accuracy. Specifically, incorporating hysteresis improved overall accuracy by 71% when the HBMA was inflated up to 20.5 mm. During partial inflation-deflation cycles, the MPI controller achieved a mean error of 0.685 mm, corresponding to 9.8% of the 7 mm displacement range. These results highlight the limitations of using feedforward control alone in soft robotic actuation while emphasising the benefits of hysteresis-aware modelling. The findings contribute to the ongoing effort to develop effective control strategies for soft robotic systems.
|
| |
| 09:00-10:30, Paper ThI1I.232 | Add to My Program |
| IntuFly: Intuitive Continuous Hand–Gaze Control for UAVs |
|
| Xu, Junsheng | Southeast University |
| Ma, Ke | Southeast University |
| Li, Xinde | Southeast University |
| Yu, Chengxiang | Southeast University |
| Zhang, Zeyu | Southeast University |
| Zhang, Zhentong | Southeast University |
Keywords: Human-Centered Robotics, Gesture, Posture and Facial Expressions, Aerial Systems: Perception and Autonomy
Abstract: Operating Unmanned Aerial Vehicles (UAVs) remains challenging for non-experts because single-modality interfaces distort intent: gesture-only systems depend on discrete vocabularies and mode switches that break continuity and raise cognitive load, while gaze-only control offers limited dimensionality and is vulnerable to Midas-touch and saccadic jitter. We present IntuFly, an intuition-driven hand-gaze framework in which hands draw the path to give continuous 3D translation and eyes set heading and lock targets, preserving intent continuity and reducing effort. To overcome cross-stream asynchrony and noise, our deployment-oriented fusion layer performs timestamp-consistent late fusion with stale-frame dropping and lightweight stabilization, yielding stable closed-loop operation at more than 25 Hz on commodity hardware. In simulation racing, novices fly faster on shorter paths than a Remote controller (RC) baseline, and intermediates select shorter, smoother yet more conservative lines; Subjective scales indicate lower workload and higher usability. In mobile target tracking, adding gaze produces faster responses with near-complete line-of-sight (LOS) coverage under identical limits. The same perception-control stack runs stably on an indoor DJI Tello platform with behavior consistent with simulation, demonstrating sim-to-real feasibility. These results show that IntuFly lowers the learning barrier for non-expert users while preserving fine control and stability, offering a deployable path toward intuitive, continuous human-UAV cooperative flight. Our code is publicly available at https://github.com/Crotonbee/IntuFly.
|
| |
| 09:00-10:30, Paper ThI1I.233 | Add to My Program |
| GIFT: Geometry-Induced Functional Transfer for Category-Level Object Manipulation |
|
| de Farias, Cristiana | TU Darmstadt |
| Figueredo, Luis | University of Nottingham (UoN) |
| Laha, Riddhiman | Technical University of Munich |
| Adjigble, Komlan Jean Maxime | University of Birmingham |
| Tamadazte, Brahim | CNRS |
| Stolkin, Rustam | University of Birmingham |
| Haddadin, Sami | Mohamed Bin Zayed University of Artificial Intelligence |
| Marturi, Naresh | University of Birmingham |
Keywords: Grasping, Task and Motion Planning, Learning from Demonstration
Abstract: Robotic manipulation of unfamiliar objects in new environments is challenging due to limited generalisation capabilities. We propose a new skill transfer framework, GIFT (Geometry-Induced Functional Transfer), which enables a robot to transfer complex object manipulation skills and constraints from a single human demonstration. Our approach addresses the challenge of skill acquisition and task execution by deriving geometric representations from demonstrations focusing on object-centric interactions. By leveraging the Functional Maps (FMC) framework, we efficiently map interaction functions between objects and their environments, allowing the robot to replicate task operations across objects of similar topologies or categories, even when they have significantly different shapes. Additionally, our method incorporates screw interpolation (ScLERP) for generating smooth, geometrically-aware robot paths to ensure the transferred skills adhere to the demonstrated task constraints. We validate the effectiveness and adaptability of our approach through extensive experiments, demonstrating successful skill transfer and task execution in diverse real-world environments without requiring additional training.
|
| |
| 09:00-10:30, Paper ThI1I.234 | Add to My Program |
| A Compact Rotary Series Elastic Actuator with Wide Deflection Range and Linear Torque Response for pHRI Applications |
|
| Eraky, Mohamed | Stevens Institute of Technology |
| Li, Andy | Stevens Institute of Technology |
| Gebre, Biruk | Stevens Institute of Technology |
| Pochiraju, Kishore | Stevens Institute of Technology |
| Zanotto, Damiano | Stevens Institute of Technology |
Keywords: Physical Human-Robot Interaction, Wearable Robotics, Prosthetics and Exoskeletons
Abstract: This paper presents the design and characterization of a new series elastic actuator (SEA) for physical human-robot interaction (pHRI) featuring a compact spring mechanism. The spring mechanism consists of ten compression springs, fitted on output rotors and arranged in a curved formation. The compression springs are enclosed in spring chambers featured in input rotors. This design reduces frictional losses and enables all springs to bear load bidirectionally with minimal preload relative to conventional designs that rely on antagonistic spring arrangements, thereby enhancing deflection range and torque capacity. We introduce the SEA design and experimentally characterize the passive torque-deflection curve and the closed-loop torque tracking bandwidth. Bench testing demonstrates a torque capacity of 18 Nm and a maximum stiffness of 43.8 Nm/rad. As a representative application, the SEA is integrated into an ankle exoskeleton, with the spring mechanism co-located at the ankle joint. Treadmill walking tests with the exoskeleton indicate good torque tracking performance, with a root-mean-square error of 1.48 Nm when applying 12% assistance, corresponding to a peak torque of 17.6 Nm.
|
| |
| 09:00-10:30, Paper ThI1I.235 | Add to My Program |
| SAVMap: Structure-Aided Visual Mapping of Large-Scale 2.5D Manhattan Wireframes from Panoramic Video |
|
| Huang, Howard | Nokia Bell Labs |
| Surianarayanan, Bharath | Nokia Bell Labs |
| Lee, Keifer | Nokia Bell Labs |
| Wang, Chenyu | ShanghaiTech University |
| Feng, Chen | New York University |
Keywords: Mapping, Manufacturing, Maintenance and Supply Chains, SLAM
Abstract: Precise 3D representations of industrial environments enable tasks such as robot localization and digital twin generation. We propose SAVMap, a method for generating a wireframe map of warehouse shelf and light structures using only a panoramic video camera as the sensor input. Sequences of rectified images with shelf and ceiling-facing views are extracted from a panoramic video captured along the warehouse aisles. Using a semantic segmentation network front end, a set of sparse, semantic structure feature points (e.g., corners of shelf structures, centers of lights) are extracted from each image and tracked across the sequences. By accounting for real-world geometric relationships among the points such as Manhattan grids, a constrained structure-from-motion algorithm yields the 3D points that form a wireframe map. We demonstrate the scalability and accuracy of our proposal in a warehouse with 46 shelving rows, each with faces spanning 55m by 7m. From an hour of panoramic video content, we create wireframe maps for over 5000 shelf elements across the rows, achieving an aggregate mean absolute error of 4.8cm with respect to ground-truth.
|
| |
| 09:00-10:30, Paper ThI1I.236 | Add to My Program |
| Built Different: Tactile Perception to Overcome Cross-Embodiment Capability Differences in Collaborative Manipulation |
|
| van den Bogert, William | University of Michigan |
| Iyengar, Madhavan | Carnegie Mellon University |
| Fazeli, Nima | University of Michigan |
Keywords: Force and Tactile Sensing, Representation Learning, Hardware-Software Integration in Robotics
Abstract: Tactile sensing is a widely-studied means of implicit communication between robot and human. In this paper, we investigate how tactile sensing can help bridge differences between robotic embodiments in the context of collaborative manipulation. For a robot, learning and executing force-rich collaboration require compliance to human interaction. While compliance is often achieved with admittance control, many commercial robots lack the joint torque monitoring needed for such control. To address this challenge, we present an approach that uses tactile sensors and behavior cloning to transfer policies from robots with these capabilities to those without. We train a single policy that demonstrates positive transfer across embodiments, including robots without torque sensing. We demonstrate this positive transfer on four different tactile-enabled embodiments using the same policy trained on force-controlled robot data. Across multiple proposed metrics, the best performance came from a decomposed tactile shear-field representation combined with a pre-trained encoder, which improved success rates over alternative representations.
|
| |
| 09:00-10:30, Paper ThI1I.237 | Add to My Program |
| Fusing Satellite Imagery and Planimetric Maps for Cross-View Localization |
|
| Ngo, Quang Long Ho | École Polytechnique Fédérale De Lausanne |
| Xia, Zimin | École Polytechnique Fédérale De Lausanne (EPFL) |
| Alahi, Alexandre | EPFL |
Keywords: Localization, Computer Vision for Transportation
Abstract: Current cross-view localization methods predominantly rely on satellite imagery as the aerial modality. Although recent work explores planimetric maps (e.g., OpenStreetMap tiles), these approaches often lag in performance. Yet both modalities are widely available and possess complementary properties. Satellite images are closer to ground-level camera imagery, offering finer detail, whereas planimetric maps contain annotated objects (e.g., streetlamps) and remain informative in areas where the ground is occluded, such as by foliage. Despite this, only one prior work provides an end-to-end method to fuse the two modalities, and it does not demonstrate their potential within state-of-the-art methods. To combine the strengths of both modalities, we propose a new fusion module that augments standard encoders and demonstrates that integrating satellite imagery with planimetric maps improves state-of-the-art single-modality methods. The module comprises (i) cross-modal conditioning, which processes each modality’s encoding with awareness of the other, and (ii) a patch-level fusion rule that controls the granularity of information exchange. We achieve state-of-the-art results, reducing the mean localization error by 30.13%. Qualitatively, the fusion adaptively selects the more informative modality, improving overall accuracy.
|
| |
| 09:00-10:30, Paper ThI1I.238 | Add to My Program |
| COBALT: Crowdsourcing Robot Learning Via Cloud-Based Teleoperation with Smartphones |
|
| Agarwal, Ayush | Georgia Institute of Technology |
| Gandhi, Ansh | University of California, Berkeley |
| Collins, Jeremy | Georgia Institute of Technology |
| Rayyan, Omar | University of California, Los Angeles |
| Sarswat, Aryan | Georgia Institute of Technology |
| Koushik, Ranjani | Georgia Institute of Technology |
| Moghani, Masoud | University of Toronto |
| Mandlekar, Ajay Uday | NVIDIA |
| Garg, Animesh | Georgia Institute of Technology |
Keywords: Telerobotics and Teleoperation, Data Sets for Robot Learning, Imitation Learning
Abstract: The scarcity of large-scale, high-quality demonstration data remains a bottleneck in scaling imitation learning for robotic manipulation. We present COBALT, a teleoperation platform designed to democratize robot learning at scale both in simulation and in the real world. By leveraging vectorized environments, our scalable, load-balanced infrastructure supports concurrent teleoperation by multiple users on a single GPU, yielding a significant reduction in teleoperation cost. Operators can connect from nearly anywhere on Earth using commonly available devices, including single or dual smartphones, VR headsets, 3D mice, and keyboards. An in-memory data cache and efficient video streaming keep control and rendering synchronous, sustaining dozens of concurrent users at 20 Hz with sub-100 ms end-to-end latency. We demonstrate concurrent support for 256 clients across 8 GPUs, underscoring the system’s ability to scale across hardware and within individual servers. We perform a comprehensive user study showing that phone-based teleoperation performs comparably to or better than specialized hardware, enabling faster, more ergonomic data collection. To ensure data quality, COBALT logs a suite of real-time metrics to automatically filter suboptimal demonstrations. We further demonstrate that a structured user training curriculum significantly improves data collection quality. Guided by insights from our user study, we crowdsource the collection of a large-scale, high-quality pilot dataset with 7500+ demonstrations (50+ hours) collected with smartphones across nine countries over five days. We validate the dataset's quality by training state-of-the-art imitation learning algorithms. Please visit cobalt-teleop.github.io for more details.
|
| |
| 09:00-10:30, Paper ThI1I.239 | Add to My Program |
| Relevance for Human Robot Collaboration |
|
| Zhang, Xiaotong | Massachusetts Institute of Technology |
| Huang, Dingcheng | Massachusetts Institute of Technology |
| Youcef-Toumi, Kamal | Massachusetts Institute of Technology |
Keywords: Service Robotics, AI-Enabled Robotics
Abstract: Inspired by the human ability to selectively focus on relevant information, this paper introduces relevance, a novel dimensionality reduction process that enables robots to identify relevant scene elements in a scene and generate responses that are seamless, fast, and accurate. To accurately and efficiently quantify relevance, we developed an event-based framework that maintains a continuous perception of the scene, evaluates cue sufficiency within the scene, and selectively triggers relevance determination. Within this framework, we developed a probabilistic methodology that considers various factors and is built on a novel structured scene representation. Both simulations and experimental results demonstrate the effectiveness of our relevance concept, as well as the proposed framework and methods for relevance quantification. Simulation results demonstrate that the relevance framework and methodology accurately predict the relevance of a general Human Robot Collaboration (HRC) setup, achieving a precision of 0.99, a recall of 0.94, an F1 score of 0.96, and an object ratio of 0.94. Relevance demonstrates broad benefits across multiple aspects of HRC, yielding a 79.56% reduction in task planning time compared with a state-of-the-art (SOTA) task planner for a cereal task, a 26.53% decrease in perception latency for object detection, an improvement of up to 13.50% in HRC safety, and an 80.84% reduction in the number of inquiries required during collaboration. A real-world demonstration highlights the effectiveness of the relevance framework, together with its modules, in providing intelligent and seamless assistance to humans during everyday tasks.
|
| |
| 09:00-10:30, Paper ThI1I.240 | Add to My Program |
| From Language to Deployment: Offline Optimization and Ontology-Guided Behavior Tree Generation for Transparent Robot Applications |
|
| Wu, Ruichao | Fraunhofer IPA |
| Pan, Jiwei | University of Stuttgart |
| Youssef, Mohamed | University of Stuttgart |
| Kahl, Bjoern | Fraunhofer IPA |
| Kraus, Werner | Fraunhofer IPA |
| Morozov, Andrey | University of Stuttgart |
Keywords: Control Architectures and Programming, Software Architecture for Robotic and Automation, Optimization and Optimal Control
Abstract: Automatically generating robot applications from natural language promises to lower the barrier to automation, but remains difficult in domains that demand reliability and transparency, such as industrial assembly or collaborative manipulation. End-to-end policies and a large language model(LLM)-based planners can map instructions to robot behaviors, but they often lack interpretability and provide limited assurance of correctness. We present a framework that composes applications from modular, application-independent atomic skills expressed as Behavior Trees (BTs). BTs are constructed and validated against an ontology-level dual graph to enforce control-flow and data-flow consistency before execution, ensuring transparency and structural correctness. Application-level parameters are optimized offline in simulation using Monte Carlo Tree Search guided by LLM-derived priors. Rather than serving as a runtime optimizer, this process systematically explores interdependent parameters, producing a dataset of reliable parameterizations that can support future gating mechanisms for online adaptation. The framework is validated in a physical robotic setup, demonstrating transparent and consistent offline generation of deployable applications, and laying the foundation for adaptive, real-time systems.
|
| |
| 09:00-10:30, Paper ThI1I.241 | Add to My Program |
| RL-Augmented Adaptive Model Predictive Control for Bipedal Locomotion Over Challenging Terrain |
|
| Kamohara, Junnosuke | Georgia Institute of Technology |
| Wu, Feiyang | Georgia Institute of Technology |
| Wamorkar, Chinmayee | Georgia Institute of Technology |
| Hutchinson, Seth | Northeastern University |
| Zhao, Ye | Georgia Institute of Technology |
Keywords: Humanoid and Bipedal Locomotion, Reinforcement Learning, Optimization and Optimal Control
Abstract: Model predictive control (MPC) has demonstrated effectiveness for humanoid bipedal locomotion; however, its applicability in challenging environments, such as rough and slippery terrain, is limited by the difficulty of modeling terrain interactions. In contrast, reinforcement learning (RL) has achieved notable success in training robust locomotion policies over diverse terrain, yet it lacks guarantees of constraint satisfaction and often requires substantial reward shaping. Recent efforts in combining MPC and RL have shown promise of taking the best of both worlds, but they are primarily restricted to flat terrain or quadrupedal robots. In this work, we propose an RL-augmented MPC framework tailored for bipedal locomotion over rough and slippery terrain. Our method parametrizes three key components of single-rigid-body-dynamics-based MPC: system dynamics, swing leg controller, and gait frequency. We validate our approach through bipedal robot simulations in NVIDIA IsaacLab across various terrains, including stairs, stepping stones, and low-friction surfaces. Experimental results demonstrate that our RL-augmented MPC framework produces significantly more adaptive and robust behaviors compared to baseline MPC and RL. Project page: https://rl-augmented-mpc.github.io/rlaugmentedmpc/
|
| |
| 09:00-10:30, Paper ThI1I.242 | Add to My Program |
| Find the Fruit: Zero-Shot Sim2Real RL for Occlusion-Aware Plant Manipulation |
|
| Subedi, Nitesh | Iowa State University |
| Yang, Hsin-Jung | Iowa State University |
| Jha, Devesh K. | Mitsubishi Electric Research Laboratories |
| Sarkar, Soumik | Iowa State University |
Keywords: Agricultural Automation, Manipulation Planning, Reinforcement Learning
Abstract: Autonomous harvesting in the open presents a complex manipulation problem. In most scenarios, an autonomous system has to deal with significant occlusion and require interaction in the presence of large structural uncertainties (every plant is different). Perceptual and modeling uncertainty make design of reliable manipulation controllers for harvesting challenging, resulting in poor performance during deployment. We present a sim2real reinforcement learning (RL) framework for occlusion-aware plant manipulation, where a policy is learned entirely in simulation to reposition stems and leaves to reveal target fruit(s). In our proposed approach, we decouple high-level kinematic planning from low-level compliant control which simplifies the sim2real transfer. This decomposition allows the learned policy to generalize across multiple plants with different stiffness and morphology. In experiments with multiple real-world plant setups, our system achieves up to 86.7% success in exposing target fruits, demonstrating robustness to occlusion variation and structural uncertainty.
|
| |
| 09:00-10:30, Paper ThI1I.243 | Add to My Program |
| Centralized Periodic Planning under Asynchronous Communication for Multi-Agent Monitoring |
|
| Fornos, David | Texas A&M University |
| Rossi, Federico | Jet Propulsion Laboratory - California Institute of Technology |
| Shell, Dylan | Texas A&M University |
| Selva, Daniel | Texas A&M University |
Keywords: Planning under Uncertainty, Motion and Path Planning, Cooperating Robots
Abstract: This paper examines the problem of coordinating the observations of multiple agents constrained to periodic trajectories that communicate asynchronously with a central planner. We are motivated by settings such as active monitoring missions tracking stochastic and spatially spreading events like wildfires or flooding, where a rapid response is essential and the spatial extent can be large. In such cases, "always-on" networking may be infeasible and continuous coordination may be prohibitively costly. Periodic trajectories are a natural constraint for relevant classes of systems, e.g., UAV swarms that cycle around recharging stations or Earth observation satellite constellations; moreover, these lead to recurring communication opportunities with compute-capable infrastructure. We introduce the Multi-Agent Asynchronous Periodic Partially Observable MDP (MA-APPOMDP), a new planning framework that formalizes asynchronous check-in times and centralized but delayed information flow. We propose two algorithms tailored to this new model: the Asynchronous Belief Branching Algorithm (ABBA), which performs exact belief branching over unknown observations, and SB-ABBA, a sampling-based approximation where scalability is prioritized over exactness. Empirical results on different wildfire event monitoring problems show that our methods consistently achieve higher event coverage and lower detection delay than several heuristic and planning baselines, with SB-ABBA scaling to larger problem instances.
|
| |
| 09:00-10:30, Paper ThI1I.244 | Add to My Program |
| In-The-Wild Compliant Manipulation with UMI-FT |
|
| Choi, Hojung | Stanford University |
| Hou, Yifan | Stanford University |
| Pan, Chuer | Stanford University |
| Hong, Seongheon | Stanford University |
| Patel, Austin | Stanford University |
| Xu, Xiaomeng | Stanford University |
| Cutkosky, Mark | Stanford University |
| Song, Shuran | Stanford University |
Keywords: Learning from Demonstration, Deep Learning in Grasping and Manipulation, Dexterous Manipulation
Abstract: Many manipulation tasks require careful force modulation. With insufficient force the task may fail, while excessive force could cause damage. The high cost, bulky size and fragility of commercial force/torque (F/T) sensors have limited large-scale, force-aware policy learning. We introduce UMI-FT, a handheld data-collection platform that mounts compact, six-axis force/torque sensors on each finger, enabling finger-level wrench measurements alongside RGB, depth, and pose. Using the multimodal data collected from this device, we train an adaptive compliance policy that predicts position targets, grasp force, and stiffness for execution on standard compliance controllers. In evaluations on three contact-rich, force-sensitive tasks (whiteboard wiping, skewering zucchini, and lightbulb insertion), UMI-FT enables policies that reliably regulate external contact forces and internal grasp forces, outperforming baselines that lack compliance or force sensing. UMI-FT offers a scalable path to learning compliant manipulation from in-the-wild demonstrations. We open-source the hardware and software to facilitate broader adoption at: https://umi-ft.github.io/
|
| |
| 09:00-10:30, Paper ThI1I.245 | Add to My Program |
| Two-Time-Scale Composite Learning Online Identification and Control for Compliant-Joint Robots |
|
| Shi, Tian | Southeast University |
| Liu, Lin | Sun Yat-Sen University |
| Wang, Qian | Sun Yat-Sen University |
| Su, Jinya | Southeast University |
| Li, Shihua | Southeast University |
| Pan, Yongping | Southeast University |
Keywords: Robust/Adaptive Control, Compliant Joints and Mechanisms, Calibration and Identification
Abstract: SP-based synthesis yields two-time-scale control that allows compliant-joint robots to achieve high-quality tracking at low implementation cost. Composite learning enables exact online identification and control of robots without the stringent condition known as persistent excitation (PE). However, to achieve exact online identification for compliant-joint robots, parameter update derived from SP-based synthesis and composite learning requires physically unavailable states. This paper presents a novel SP-based composite learning robot control (SP-CLRC) strategy for compliant-joint robots that achieves exact online identification and control without requiring access to physically unavailable states. In the proposed method, link-side and actuator-side parameters are estimated separately, enabling exact online identification using available robot states. A two-time-scale composite learning method is proposed to guarantee practical exponential stability of the closed-loop system with parameter convergence under interval excitation, a condition strictly weaker than PE. Experiments on a two-degree-of-freedom robot driven by series elastic actuators have shown that the proposed SP-CLRC significantly outperforms the baseline in online identification and tracking accuracy.
|
| |
| 09:00-10:30, Paper ThI1I.246 | Add to My Program |
| Visual-Auditory Proprioception of Soft Finger Shape and Contact |
|
| Guo, Qinsong | Carneige Mellon University |
| Yang, Ke | New York University |
| Zhao, Hanwen | New York University |
| Fang, Haohan | New York University |
| Wang, Haoxuan | New York University |
| Feng, Chen | New York University |
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Sensors and Actuators, Deep Learning for Visual Perception
Abstract: Soft robotic fingers require precise proprioception of both global deformation and local contact to enable safe and dexterous manipulation. Vision-based methods can reconstruct overall shape but struggle under severe occlusion, while audio-only approaches provide complementary cues but lack spatial detail. We present DeepCoFi, a lightweight multimodal proprioception framework that fuses internal camera images with acoustic spectrograms to jointly recover finger geometry and contact. The framework leverages the complementary strengths of vision and acoustics and employs a FoldingNet-based two-stage decoder that first reconstructs global bending and then refines local contact deformations. To support this integration, we introduce a soft finger design that incorporates an exoskeleton-mounted camera and microphone in a single molding step, preserving compliance while enabling multimodal sensing. Experiments on a comprehensive dataset and real-world grasping tasks show that DeepCoFi achieves robust proprioception under occlusion and generalizes effectively to unseen deformations and contact conditions. Open-source resources and project updates are available at https://ai4ce.github.io/DeepCoFi/.
|
| |
| 09:00-10:30, Paper ThI1I.247 | Add to My Program |
| Tiny-DroNeRF: Tiny Neural Radiance Fields Aboard Federated Learning-Enabled Nano-Drones |
|
| Carboni, Ilenia | IDSIA, USI-SUPSI |
| Cereda, Elia | USI and SUPSI |
| Lamberti, Lorenzo | ETH Zurich & IDSIA, USI-SUPSI |
| Malpetti, Daniele | IDSIA, USI-SUPSI |
| Conti, Francesco | University of Bologna |
| Palossi, Daniele | ETH Zurich & IDSIA, USI-SUPSI |
Keywords: Embedded Systems for Robotic and Automation, Deep Learning for Visual Perception, Aerial Systems: Perception and Autonomy
Abstract: Sub-30 g nano-sized aerial robots can leverage their agility and form factor to explore cluttered and narrow environments, like in industrial inspection and search and rescue missions. However, the price for their tiny size is a strong limit in their resources, i.e., sub-100 mW microcontroller units (MCUs) delivering ~100 GOps/s at best, and memory budgets well below 100 MB. Despite these strict constraints, we aim to enable complex vision-based tasks aboard nano-drones, such as dense 3D scene reconstruction: a key robotic task underlying fundamental capabilities like spatial awareness and motion planning. Top-performing 3D reconstruction methods leverage neural radiance fields (NeRF) models, which require GBs of memory and massive computation, usually delivered by high-end GPUs consuming 100s of Watts. Our work introduces Tiny-DroNeRF, a lightweight NeRF model, based on Instant-NGP, and optimized for running on a GAP9 ultra-low-power (ULP) MCU aboard our nano-drones. Then, we further empower our Tiny-DroNeRF by leveraging a collaborative federated learning scheme, which distributes the model training among multiple nano-drones. Our experimental results show a 95% reduction in Tiny-DroNeRF's memory footprint compared to Instant-NGP with only a 17% drop in reconstruction accuracy. Finally, our federated learning scheme allows Tiny-DroNeRF to train with an amount of data otherwise impossible to keep in a single drone's memory, increasing the overall reconstruction accuracy. Ultimately, our work combines, for the first time, NeRF training on an ULP MCU with federated learning on nano-drones.
|
| |
| 09:00-10:30, Paper ThI1I.248 | Add to My Program |
| S3LAM: Surfel Splatting SLAM for Geometrically Accurate Tracking and Mapping |
|
| Fan, Ruoyu | Tsinghua University |
| Wen, Yu-Hui | Beijing Jiaotong University |
| Zhang, Tao | Pudu Technology Ltd |
| Zeng, Long | Tsinghua University |
| Liu, Yong-Jin | Tsinghua University |
Keywords: SLAM, Localization, Mapping
Abstract: We propose S3LAM, a novel RGB-D SLAM system that leverages 2D surfel splatting to achieve geometrically accurate scene representations for simultaneous tracking and mapping. Unlike existing 3DGS-based SLAM approaches that rely on 3D Gaussian ellipsoids, we utilize 2D Gaussian surfels as primitives for more efficient scene representation. By focusing on the surfaces of objects in the scene, this design enables S3LAM to reconstruct high-quality geometry, benefiting both mapping and tracking. To address inherent SLAM challenges including real-time optimization under limited viewpoints, we introduce a novel adaptive surface rendering strategy that improves mapping accuracy while maintaining computational efficiency. We further derive camera pose Jacobians directly from 2D surfel splatting formulation, highlighting the importance of our geometrically accurate representation that improves tracking convergence. Extensive experiments on both synthetic and real-world datasets demonstrate that S3LAM achieves state-of-the-art performance. Our code is available at https://github.com/FanryZ/S3LAM.
|
| |
| 09:00-10:30, Paper ThI1I.249 | Add to My Program |
| Formal Safety Verification and Refinement for Generative Motion Planners Via Certified Local Stabilization |
|
| Nath, Devesh | Georgia Institute of Technology |
| Yin, Haoran | Georgia Institute of Technology |
| Chou, Glen | Georgia Institute of Technology |
Keywords: Robot Safety, Formal Methods in Robotics and Automation, Machine Learning for Robot Control
Abstract: We present a method for formal safety verification of learning-based generative motion planners. Generative motion planners (GMPs) offer advantages over traditional planners, but verifying the safety and dynamic feasibility of their outputs is difficult since neural network verification (NNV) tools scale only to a few hundred neurons, while GMPs often contain millions. To preserve GMP expressiveness while enabling verification, our key insight is to imitate the GMP by stabilizing references sampled from the GMP with a small neural tracking controller and then applying NNV to the closed-loop dynamics. This yields reachable sets that rigorously certify closed-loop safety, while the controller enforces dynamic feasibility. Building on this, we construct a library of verified GMP references and deploy them online in a way that imitates the original GMP distribution whenever it is safe to do so, improving safety without retraining. We evaluate across diverse planners, including diffusion, flow matching, and vision-language models, improving safety in simulation (on ground robots and quadcopters) and on hardware (differential-drive robot).
|
| |
| 09:00-10:30, Paper ThI1I.250 | Add to My Program |
| PSALM-V: Automating Symbolic Planning in Interactive Visual Environments with Large Language Models |
|
| Zhu, Wang Bill | University of Southern California |
| Chai, Miaosen | University of Southern California |
| Singh, Ishika | University of Southern California |
| Jia, Robin | University of Southern California |
| Thomason, Jesse | University of Southern California |
Keywords: Integrated Planning and Learning, Task Planning, Planning under Uncertainty
Abstract: We propose PSALM-V, the first autonomous neuro-symbolic learning system able to induce symbolic action semantics (i.e., pre- and post-conditions) in visual environments through interaction. PSALM-V bootstraps reliable symbolic planning without expert action definitions, using LLMs to generate heuristic plans and candidate symbolic semantics. Previous work has explored using large language models to generate action semantics for Planning Domain Definition Language (PDDL)-based symbolic planners. However, these approaches have primarily focused on text-based domains or relied on unrealistic assumptions, such as access to a predefined problem file, full observability, or explicit error messages. By contrast, PSALM-V dynamically infers PDDL problem files and domain action semantics by analyzing execution outcomes and synthesizing possible error explanations. The system iteratively generates and executes plans while maintaining a tree-structured belief over possible action semantics for each action, iteratively refining these beliefs until a goal state is reached. Simulated experiments of task completion in ALFRED demonstrate that PSALM-V increases the plan success rate from 37% (Claude-3.7) to 74% in partially observed setups. Results on two 2D game environments, RTFM and OverCooked-AI, show that PSALM-V improves step efficiency and succeeds in domain induction in multi-agent settings. PSALM-V correctly induces PDDL pre- and post-conditions for real-world robot BlocksWorld tasks, despite low-level manipulation failures from the robot. Videos and resources at https://psalmv.github.io/.
|
| |
| 09:00-10:30, Paper ThI1I.251 | Add to My Program |
| Kungfubot2: Learning Versatile Motion Skills for Humanoid Whole-Body Control |
|
| Han, Jinrui | Shanghai Jiao Tong University |
| Xie, Weiji | Shanghai Jiao Tong University |
| Zheng, Jiakun | East China University of Science and Technology |
| Shi, Jiyuan | Institute of Artificial Intelligence (TeleAI), China Telecom |
| Zhang, Weinan | Shanghai Jiao Tong University |
| Xiao, Ting | East China University of Science and Technology |
| Bai, Chenjia | Institute of Artificial Intelligence (TeleAI), China Telecom |
Keywords: Humanoid and Bipedal Locomotion, Whole-Body Motion Planning and Control, Reinforcement Learning
Abstract: Learning versatile whole-body skills by tracking various human motions is a fundamental step toward general-purpose humanoid robots. This task is particularly challenging because a single policy must master a broad repertoire of motion skills while ensuring stability over long-horizon sequences. To this end, we present VMS, a unified whole-body controller that enables humanoid robots to learn diverse and dynamic behaviors within a single policy. Our framework integrates a hybrid tracking objective that balances local motion fidelity with global trajectory consistency, and an Orthogonal Mixture-of-Experts (OMoE) architecture that encourages skill specialization while enhancing generalization across motions. A segment-level tracking reward is further introduced to relax rigid step-wise matching, enhancing robustness when handling global displacements and transient inaccuracies. We validate VMS extensively in both simulation and real-world experiments, demonstrating accurate imitation of dynamic skills, stable performance over minute-long sequences, and strong generalization to unseen motions. These results highlight the potential of VMS as a scalable foundation for versatile humanoid whole-body control. The project page is available at kungfubot2-humanoid.github.io.
|
| |
| 09:00-10:30, Paper ThI1I.252 | Add to My Program |
| SLNet: A Super-Lightweight Geometry-Adaptive Network for 3D Point Cloud Recognition |
|
| Mohammad, Saeid | Sirjan University of Technology |
| Salarpour, Amir | Clemson University |
| MohajerAnsari, Pedram | Clemson University |
| Pesé, Mert D. | Clemson University |
Keywords: Deep Learning for Visual Perception, RGB-D Perception, Vision-Based Navigation
Abstract: We present SLNet, a lightweight backbone for 3D point cloud recognition designed to achieve strong performance without the computational cost of many recent attention, graph, and deep MLP based models. The model is built on two simple ideas: NAPE (Nonparametric Adaptive Point Embedding), which captures spatial structure using a combination of Gaussian RBF and cosine bases with input adaptive bandwidth and blending, and GMU (Geometric Modulation Unit), a per channel affine modulator that adds only 2D learnable parameters. These components are used within a four stage hierarchical encoder with FPS+kNN grouping, nonparametric normalization, and shared residual MLPs. In experiments, SLNet shows that a very small model can still remain highly competitive across several 3D recognition tasks. On ModelNet40, SLNet-S with 0.14M parameters and 0.31 GFLOPs achieves 93.64% overall accuracy, outperforming PointMLP-elite with 5× fewer parameters, while SLNet-M with 0.55M parameters and 1.22 GFLOPs reaches 93.92%, exceeding PointMLP with 24× fewer parameters. On ScanObjectNN, SLNet-M achieves 84.25% overall accuracy within 1.2 percentage points of PointMLP while using 28× fewer parameters. For large scale scene segmentation, SLNet-T extends the backbone with local Point Transformer attention and reaches 58.2% mIoU on S3DIS Area 5 with only 2.5M parameters, more than 17× fewer than Point Transformer V3. We also introduce NetScore+, which extends NetScore by incorporating latency and peak memory so that efficiency can be evaluated in a more deployment oriented way. Across multiple benchmarks and hardware settings, SLNet delivers a strong overall balance between accuracy and efficiency. Code is available at: https://github.com/m-saeid/SLNet.
|
| |
| 09:00-10:30, Paper ThI1I.253 | Add to My Program |
| Dual Quaternion Based Compliant Movement Primitives for Deformable Object Manipulation |
|
| Samai, Amir | UNIVERSITY of Le Havre |
| Thomas, John | Institut Pascal |
| Alkhatib, Mohammad | Université Clermont Auvergne |
| Ozgur, Erol | SIGMA-Clermont / Institut Pascal |
| Mezouar, Youcef | Institut Pascal |
Keywords: Learning from Demonstration, Bimanual Manipulation
Abstract: Learning from demonstration effectively transfers human manipulation skills to robots. It can be especially useful for imitating industrial manipulation tasks which are performed by humans and are difficult to model such as deformable object manipulation. Manipulation of deformable objects often requires not only accurate tracking of the demonstration trajectory using a robot end-effector, but also the accommodation of interaction forces. Precise tracking of such trajectories while ignoring these interaction forces leads to overly stiff, unsafe, or unsuccessful executions. We address this problem by proposing Dual Quaternion based Compliant Movement Primitives (DQ-CMP). DQ-CMP couples a dual-quaternion based Dynamic Movement Primitive for compact 6-DoF pose encoding with learnable wrench primitives. This combination reproduces synchronized motion and force behaviors directly at the end-effector. The method is robot-agnostic and singularity-free at the representation-level, as it operates in operational space using dual quaternions. From a few demonstrations, required wrenches for unseen initial configurations are predicted using Gaussian process regression defined on the pose manifold. This enables generalization of the learned wrenches across different starting poses. We validate the method on real-robot experiments including a shoe-sole detachment for recycling and bending of stiff foam inside a box. Results show compliant, safe task execution and successful generalization to new initial poses.
|
| |
| 09:00-10:30, Paper ThI1I.254 | Add to My Program |
| Behavior Cloning-Enhanced Deep Reinforcement Learning for Robot Navigation Via Dynamic Reward and Adaptive Replanning |
|
| Chi, Jianning | Northeastern University |
| Li, Fusheng | Northeastern University (China) |
| Zhang, Wenjun | University of Saskatchewan |
| Yang, Yongming | Shenyang Institute of Automation |
Keywords: Foundations of Automation, Planning, Scheduling and Coordination, Discrete Event Dynamic Automation Systems
Abstract: Deep reinforcement learning (DRL) is a core technology for mobile robot navigation in diverse environments, yet existing Behavior Cloning (BC)-enhanced DRL methods suffer two critical challenges: fixed imitation constraints suppress autonomous exploration in late training stages despite stabilizing early learning, and goal-obstacle avoidance task conflicts impede robust action selection during navigation. To address these issues, this paper proposes an Adaptive Strategy Deep Reinforcement Learning (ADRL) method, which reformulates BC as a progressively released transitional constraint and builds a stage-aware transition framework for robot navigation. Specifically, ADRL dynamically fuses Twin Delayed Deep Deterministic Policy Gradient (TD3) with BC via a value-driven imitation scheduling mechanism, which adaptively modulates the expert-online data mixing ratio and BC regularization strength based on critic feedback to accelerate convergence and realize a smooth shift from imitation-dominant to exploration-driven learning. A phase-aligned dynamic weight composite reward function is designed, which embeds motion constraints and stage-aware priority adjustment to mitigate reward sparsity and align learning objectives with policy maturity. Additionally, a lightweight adaptive replanning mechanism is developed as an evaluation stabilizer, which generates obstacle-avoiding waypoints by obstacle density when the robot stagnates, resolving goal-obstacle avoidance conflicts without al
|
| |
| 09:00-10:30, Paper ThI1I.255 | Add to My Program |
| BridgeTA: Bridging the Representation Gap in Knowledge Distillation Via Teacher Assistant for Bird’s Eye View Map Segmentation |
|
| Kim, Beomjun | Yonsei University |
| Woo, Suhan | Yonsei University |
| Heo, Sejong | Hyundai Motor Company |
| Kim, Euntai | Yonsei University |
Keywords: Computer Vision for Automation, Computer Vision for Transportation, Recognition
Abstract: Bird’s Eye View (BEV) map segmentation is one of the most important and challenging tasks in autonomous driving. Camera-only approaches have drawn attention as cost-effective alternatives to LiDAR, but they still fall behind LiDAR-Camera (LC) fusion-based methods. Knowledge Distillation (KD) has been explored to narrow this gap, but existing methods mainly enlarge the student model by mimicking the teacher’s architecture, leading to higher inference cost. To address this issue, we introduce BridgeTA, a cost‑effective distillation framework to bridge the representation gap between LC fusion and Camera‑only models through a Teacher Assistant (TA) network while keeping the student’s architecture and inference cost unchanged. A lightweight TA network combines the BEV representations of the teacher and student, creating a shared latent space that serves as an intermediate representation. To ground the framework theoretically, we derive a distillation loss using Young’s Inequality, which decomposes the direct teacher-student distillation path into teacher-TA and TA-student dual paths, stabilizing optimization and strengthening knowledge transfer. Extensive experiments on the challenging nuScenes dataset demonstrate the effectiveness of our method, achieving an improvement of 4.2% mIoU over the Camera-only baseline, up to 45% higher than the improvement of other state-of-the-art KD methods. The code will be released upon acceptance.
|
| |
| 09:00-10:30, Paper ThI1I.256 | Add to My Program |
| TacTip-Based Dynamic Contact Force Estimation with Sequential Tactile Images and Its Applications to Robotic Force Tracking |
|
| Xie, Wantong | Shanxi University |
| Lu, Zhenyu | South China University of Technology |
| Liu, Jingyang | Shanxi University |
| Yang, Jialong | South China University of Technology; Peng Cheng Laboratory |
| Chen, Lu | Shanxi University |
| Yang, Chenguang | University of Liverpool |
Keywords: Force and Tactile Sensing, Human-Robot Collaboration
Abstract: Force estimation is crucial for robotics, human--machine interaction, and industrial automation. However, traditional methods are often hindered by high cost, mechanical wear, and limited accuracy in dynamic scenarios. Vision-based tactile sensing provides a promising alternative, yet existing approaches commonly rely on static calibration and degrade under dynamic interactions such as slip. To overcome these limitations, we present a novel force prediction framework for TacTip sensors, termed as Frame-stack Force Prediction Method (FFPM). The framework integrates a Dynamic Tactile Flow Encoder to capture spatiotemporal features, enabling accurate modeling of dynamic force variations. An Exponentially Weighted Residual Correction strategy is further introduced to refine predictions by leveraging historical residuals, yielding smoother and more reliable force estimation. The predicted forces are incorporated into a force-tracking impedance control scheme, achieving precise tracking during slip interactions. Experiments on our constructed dataset demonstrate state-of-the-art performance, reducing MAPE to 12.54%, and further validate the effectiveness of the proposed framework in real-world dynamic force estimation and control.
|
| |
| 09:00-10:30, Paper ThI1I.257 | Add to My Program |
| Conditional Flow-VAE for Safety-Critical Traffic Scenario Generation |
|
| Gong, Zimu | University of Michigan-Ann Arbor |
| Zhang, Brian Zhaoning | University of Waterloo |
| Zhang, Chris | Waabi / University of Toronto |
| Wong, Kelvin | University of Toronto |
| Urtasun, Raquel | University of Toronto |
Keywords: Autonomous Vehicle Navigation, Learning from Demonstration, Simulation and Animation
Abstract: Safety-critical scenarios are essential for the development of autonomous vehicles (AVs) but are rare in real-world driving data. While simulation offers a way to generate such scenarios, manually designed test cases lack scalability, and adversarial optimization often produces unrealistic behaviors. In this work, we introduce a conditional latent flow matching approach for scalable and realistic safety-critical scenario generation. Our method uses distribution matching to transform nominal scenes into safety-critical rollouts. Furthermore, we demonstrate that incorporating both simulation and real-world data enables our framework to efficiently generate diverse, data-driven scenarios. Experimental results highlight that our approach is able to more consistently and realistically generate novel safety-critical scenarios, making it a valuable tool for training and benchmarking AV systems.
|
| |
| 09:00-10:30, Paper ThI1I.258 | Add to My Program |
| Sonar–GPS Fusion for Seabed Mapping in Turbid Shallow Waters with an Autonomous Surface Vehicle |
|
| Zhang, Yisheng | University of Maryland, College Park |
| Xu, Michael | University of Maryland, College Park |
| Williams, Alan | University of Maryland Center for Environmental Science |
| Gray, Matthew | University of Maryland Center for Environmental Science |
| Karapetyan, Nare | Woods Hole Oceanographic Institution |
| Yu, Miao | University of Maryland, College Park |
Keywords: Marine Robotics, Mapping, Robotics and Automation in Agriculture and Forestry
Abstract: Accurate seabed mapping is essential for habitat monitoring and infrastructure inspection. In turbid, shallow coastal waters, such as shellfish aquaculture farms, the effectiveness of traditional optical methods is limited. Autonomous surface vehicles (ASVs) equipped with forward-looking sonar (FLS) offer a promising alternative. However, existing sonar-based systems face challenges in achieving fine resolution mapping over long trajectories due to low-resolution positioning measurements and accumulated drift over long trajectories. In this paper, we present a drift-resilient seabed mapping framework that integrates local FLS frame alignment using the Fourier–Mellin transform (FMT) with global trajectory optimization based on an extended Kalman filter (EKF) that fuses global positioning system (GPS), inertial measurement unit (IMU), and compass data. A variance-based image blending strategy is used to further reduce visual artifacts in overlapping regions. Field trials on a structured oyster farm site show that our framework helps reduce drift in RMSE by 9.5% relative to the FMT-only baseline. This framework also enables sub-meter reconstruction accuracy and preservation of high-resolution textures needed for oyster inventory estimation within the mapped areas.
|
| |
| 09:00-10:30, Paper ThI1I.259 | Add to My Program |
| PMG: Parameterized Motion Generator for Human-Like Locomotion Control |
|
| Han, Chenxi | Tsinghua University |
| Min, Yuheng | Tsinghua University |
| Huang, Zihao | Zerith |
| Hong, Ao | Tsinghua University |
| Liu, Hang | University of Michigan |
| Cheng, Yi | Tsinghua University |
| Liu, Houde | Shenzhen Graduate School, Tsinghua University |
Keywords: Humanoid and Bipedal Locomotion, Human and Humanoid Motion Analysis and Synthesis, Whole-Body Motion Planning and Control
Abstract: Recent advances in data-driven reinforcement learning and motion tracking have substantially improved humanoid locomotion, yet critical practical challenges remain. In particular, while low-level motion tracking and trajectory-following controllers are mature, whole-body reference–guided methods are difficult to adapt to higher-level command interfaces and diverse task contexts: they require large, high-quality datasets, are brittle across speed and pose regimes, and are sensitive to robot-specific calibration. To address these limitations, we propose the Parameterized Motion Generator (PMG), a real-time motion generator grounded in an analysis of human motion structure that synthesizes reference trajectories using only a compact set of parameterized motion data together with high-dimensional control commands. Combined with an imitation-learning pipeline and an optimization-based sim-to-real motor parameter identification module, we validate the complete approach on our humanoid prototype ZERITH Z1 and show that, within a single integrated system, PMG produces natural, human-like locomotion, responds precisely to high-dimensional control inputs—including VR-based teleoperation—and enables efficient, verifiable sim-to-real transfer. Together, these results establish a practical, experimentally validated pathway toward natural and deployable humanoid control. Website:https://pmg-icra26.github.io/
|
| |
| 09:00-10:30, Paper ThI1I.260 | Add to My Program |
| SafeNet: A Neural-Symbolic Network for Safe Planning in Robotic Systems Using Formal Method-Guided LLM Fine-Tuning |
|
| Wang, Zifan | Syracuse University |
| Fan, Jialiang | University of Notre Dame |
| Zuo, Rui | Syracuse University |
| Qiu, Qinru | Syracuse University |
| Kong, Fanxin | University of Notre Dame |
Keywords: AI-Enabled Robotics, Task Planning, Robot Safety
Abstract: Robotic systems present unique safety challenges due to their complex integration of computational and physical processes and direct interaction with humans and environments. Traditional approaches to robot safety planning either rely on conventional methods, which struggle with the complexity of modern robotic systems, or on pure machine learning techniques, which lack formal safety guarantees. While recent advances in Large Language Models (LLMs) offer promising capabilities, pre-trained LLMs alone lack the specific domain expertise required for effective robotic safety planning. This paper introduces SafeNet, a novel neural-symbolic network architecture that enhances LLMs' safety planning capabilities through formal method-guided fine-tuning for robotic applications. Our approach integrates formal logical knowledge and reward machines into pre-trained LLMs by carefully designed fine-tuning, creating a neural-symbolic approach that combines the flexibility of neural networks with the precision of formal methods for robot trajectory generation and task planning. Experimental results demonstrate significant improvements in safe trajectory generation for robotic systems, with planning success rates increasing from 1.17% to 91.60% for the block manipulation task and from 7.23% to 90.63% for the robotic path planning task.
|
| |
| 09:00-10:30, Paper ThI1I.261 | Add to My Program |
| An Entropy-Based Hybrid Local-Global Algorithm to Navigate Information-Sparse Environments |
|
| Carley, Bennett | Texas A&M University |
| O'Kane, Jason | Texas A&M University |
Keywords: Reactive and Sensor-Based Planning, Planning under Uncertainty
Abstract: We explore a navigation problem for a simple robot with extremely noisy sensing and significant movement uncertainty. We are particularly interested in environments containing large regions in which relatively little distinguishing sensor information is available to assist with localization. This paper proposes a navigation algorithm for this setting that strategically directs the robot through such regions when possible, but with a careful view of the need to regain relatively accurate localization at certain points in the execution. Reasoning directly about the robot's uncertainty, the approach utilizes a local entropy metric to identify regions where sensors have strong informative value. This metric informs the selection of coarse global paths that guide a more precise local planner. We discuss an implementation of this algorithm, and provide simulation results demonstrating its effectiveness in spite of large errors in both sensing and actuation.
|
| |
| 09:00-10:30, Paper ThI1I.262 | Add to My Program |
| Step Placement Swing Control for Powered Knee-Ankle Prostheses |
|
| Feldkamp, Michael | University of Minnesota |
| Humann, Rachel Gehlhar | University of Minnesota |
Keywords: Prosthetics and Exoskeletons, Motion Control, Modeling and Simulating Humans
Abstract: Humans engage in alternating locomotion patterns in daily life by continuously adjusting step placement. Step placement control in powered prostheses could benefit prosthesis users by supporting speed-adaptation and improving gait stability. This paper uses a data-driven predictive step placement model and a task-space swing controller to achieve human-like step placement patterns on a powered prosthesis platform in simulation. We designed the predictive model to estimate future desired step placement from current user-prosthesis states by analyzing biological gait patterns from a motion-capture dataset. We also present a novel 3D human-prosthesis simulation for evaluating prosthesis controllers with inputs from human walking experiments. In this simulation, we demonstrate our step placement controller with 22 subject models, each with 28 steady-state and 35 non-steady-state walking conditions. Simulation results show that this speed-adaptive control framework achieves human-like step placement and Margin of Stability patterns with respect to walking speed.
|
| |
| 09:00-10:30, Paper ThI1I.263 | Add to My Program |
| IVISION-2DCD: A Long-Term Change Detection Dataset for Large-Scale Outdoor Construction Monitoring |
|
| Mao, Dayou | National Research Council of Canada (NRC) |
| Lin, Yuchen | University of Waterloo |
| Ebadi, Ashkan | National Research Council of Canada (NRC) |
| Zelek, John S. | University of Waterloo |
| Wong, Alexander | University of Waterloo |
| Chen, Yuhao | University of Waterloo |
Keywords: Robotics and Automation in Construction, Data Sets for Robotic Vision, Computer Vision for Automation
Abstract: Automation in construction is essential for reducing costs and human errors in large-scale projects. We approach the construction progress monitoring from the aspect of detecting changes in construction sites. As construction buildings continue to evolve in geometry and appearance over time, change detection need to be performed from arbitrary camera viewpoints. This necessitates developing 2D Change Detection (2DCD) algorithms that operate robustly across diverse camera perspectives at construction sites. While developing and evaluating such systems is data-intensive, no open-source benchmark dataset exists at the intersection of 2D change detection and construction automation research. Data collection using Unmanned Aerial Vehicles (UAVs) is gaining its popularity in outdoor large-scale surveying. However, in active construction sites conducting drone missions equipped with high-end sensors imposes safety concerns. Flight trajectory and collected camera viewpoints can be significantly limited. To address this critical gap, we introduce iVISION-2DCD, a large-scale synthetically generated dataset from dense LiDAR point clouds with photorealistic input images and accurate ground truth annotations. Our dataset formally defines the problem of viewpoint-robust 2DCD at construction sites and captures the inherent complexities of real-world deployment. In this paper, we present our systematic methodology for synthetic data generation, developing novel view synthesis techniques to overcome bi-temporal alignment and viewpoint diversity challenges, and implementing semi-automated semantic segmentation with change label generation while preserving challenging real-world cases. Benchmark evaluations using state-of-the-art 2DCD algorithms demonstrate that iVISION-2DCD poses novel research challenges for the computer vision and robotics communities.
|
| |
| 09:00-10:30, Paper ThI1I.264 | Add to My Program |
| KeySG: Hierarchical Keyframe-Based 3D Scene Graphs |
|
| Werby, Abdelrhman | University of Stuttgart |
| Rotondi, Dennis | University of Stuttgart |
| Scaparro, Fabio | University of Stuttgart |
| Arras, Kai Oliver | University of Stuttgart |
Keywords: Semantic Scene Understanding, Mapping, RGB-D Perception
Abstract: Abstract— In recent years, 3D scene graphs have emerged as a powerful world representation, offering both geometric accuracy and semantic richness. Combining 3D scene graphs with large language models enables robots to reason, plan, and navigate in complex human-centered environments. However, current approaches for constructing 3D scene graphs are semantically limited to a predefined set of relationships, and their serialization in large environments can easily exceed an LLM’s context window. We introduce KeySG, a framework that represents 3D scenes as a hierarchical graph consisting of floors, rooms, objects, and functional elements, where nodes are augmented with multi-modal information extracted from keyframes selected to optimize geometric and visual coverage. The keyframes allow us to efficiently leverage VLMs to extract scene information, alleviating the need to explicitly model relationship edges between objects, enabling more general, task-agnostic reasoning and planning. Our approach can process complex and ambiguous queries while mitigating the scalability issues associated with large scene graphs by utilizing a hierarchical multi-modal retrieval-augmented generation (RAG) pipeline to extract relevant context from the graph. Evaluated across three distinct benchmarks, 3D object semantic segmentation, functional element segmentation, and complex query retrieval KeySG outperforms prior approaches on most metrics, demonstrating its superior semantic richness and efficiency.
|
| |
| 09:00-10:30, Paper ThI1I.265 | Add to My Program |
| Graph Neural Model Predictive Control for High-Dimensional Systems |
|
| Benito Eberhard, Patrick | ETH Zurich |
| Pabon, Luis | Stanford University |
| Gammelli, Daniele | Stanford |
| Buurmeijer, Hugo | Stanford University |
| Lahr, Amon | ETH Zürich |
| Leone, Mark | Stanford University |
| Carron, Andrea | ETH Zurich |
| Pavone, Marco | Stanford University |
Keywords: Machine Learning for Robot Control, Modeling, Control, and Learning for Soft Robots, Optimization and Optimal Control
Abstract: The control of high-dimensional systems, such as soft robots, requires models that faithfully capture complex dynamics while remaining computationally tractable. This work presents a framework that integrates Graph Neural Network (GNN)-based dynamics models with structure-exploiting Model Predictive Control to enable real-time control of high-dimensional systems. By representing the system as a graph with localized interactions, the GNN preserves sparsity, while a tailored condensing algorithm eliminates state variables from the control problem, ensuring efficient computation. The complexity of our condensing algorithm scales linearly with the number of system nodes, and leverages Graphics Processing Unit (GPU) parallelization to achieve real-time performance. The proposed approach is validated in simulation and experimentally on a physical soft robotic trunk. Results show that our method scales to systems with up to 1,000 nodes at 100 Hz in closed-loop, and demonstrates real-time reference tracking on hardware with sub-centimeter accuracy, outperforming baselines by 63.6%. Finally, we show the capability of our method to achieve effective full-body obstacle avoidance.
|
| |
| 09:00-10:30, Paper ThI1I.266 | Add to My Program |
| SPARR: Simulation-Based Policies with Asymmetric Real-World Residuals for Assembly |
|
| Guo, Yijie | NVIDIA |
| Akinola, Iretiayo | NVIDIA |
| Johannsmeier, Lars | NVIDIA |
| Hadfield, Hugo | NVIDIA |
| Gupta, Abhishek | University of Washington |
| Narang, Yashraj | NVIDIA |
Keywords: Reinforcement Learning, Assembly
Abstract: Robotic assembly presents a long-standing challenge due to its requirement for precise, contact-rich manipulation. While simulation-based learning has enabled the development of robust assembly policies, their performance often degrades when deployed in real-world settings due to the sim-to-real gap. Conversely, real-world reinforcement learning (RL) methods avoid the sim-to-real gap, but rely heavily on human supervision and lack generalization ability to environmental changes. In this work, we propose a hybrid approach that combines a simulation-trained base policy with a real-world residual policy to efficiently adapt to real-world variations. The base policy, trained in simulation using low-level state observations and dense rewards, provides strong priors for initial behavior. The residual policy, learned in the real world using visual observations and sparse rewards, compensates for discrepancies in dynamics and sensor noise. Extensive real-world experiments demonstrate that our method, SPARR, achieves near-perfect success rates across diverse two-part assembly tasks. Compared to the state-of-the-art zero-shot sim-to-real methods, SPARR improves success rates by 38.4% while reducing cycle time by 29.7%. Moreover, SPARR requires no human expertise, in contrast to the state-of-the-art real-world RL approaches that depend heavily on human supervision. Please visit the project webpage at https://research.nvidia.com/labs/srl/projects/sparr/
|
| |
| 09:00-10:30, Paper ThI1I.267 | Add to My Program |
| U2E: Uncertainty-Aware Modeling and Uncertainty-Guided Exploration with Deep Ensemble for Quadrupedal Robot |
|
| Bai, Zitong | Beihang University |
| Gao, Yince | Beihang University |
| Liu, Naiyuan | Beihang University |
| Huang, Yiming | Zhejiang Sci-Tech University |
| Yu, Xiaolong | Hangzhou Innovation Institute, Beihang University |
| Wang, Wei | Beihang University |
Keywords: Legged Robots, Reinforcement Learning
Abstract: Reinforcement learning has facilitated agile locomotion in quadrupedal robots. However, most works remain highly dependent on the accuracy of simulation models in describing real-world robot dynamics. Consequently, policy transfer from simulation to hardware is still hindered by the well-known sim-to-real gap, which typically arises from modeling errors and the challenges of efficiently obtaining informative data in large state-action spaces. To address these challenges, this work proposes an innovative framework U2E that integrates Uncertainty-aware actuator modeling with an Uncertainty-guided Exploration policy. The actuator model leverages a deep ensemble of neural networks to provide both precise predictions and uncertainty estimates, allowing for the assessment of model confidence and the identification of regions with inadequate data coverage. The exploration strategy then actively guides data collection to autonomously acquire informative real-world samples and refine actuator models, thereby enhancing compensation for simulation discrepancies. Experiments on the quadrupedal locomotion tasks, including jumping and trajectory tracking, demonstrate that our approach reduces the sim-to-real gap and improves performance without the dependence on manually designed trajectories.
|
| |
| 09:00-10:30, Paper ThI1I.268 | Add to My Program |
| PA-BiCoop: A Primary-Auxiliary Cooperative Framework for General Bimanual Manipulation |
|
| Bai, Qicheng | SGIT AI Lab, SGCC |
| Wang, Ziru | SGIT AI Lab, SGCC |
| Ma, Teli | The Hong Kong University of Science and Technology, Guangzhou |
| Dai, Guang | SGIT AI Lab, SGCC |
| Wang, Jingdong | Baidu |
| Wang, Mengmeng | Zhejiang University of Technology, Hangzhou |
Keywords: Bimanual Manipulation, Deep Learning in Grasping and Manipulation, Imitation Learning
Abstract: Bimanual manipulation is essential for advanced robotic systems because it offers higher efficiency and flexibility compared to single-arm configurations. However, existing approaches either lack inter-arm interaction or ignore the need for a dynamic division of labor, treating the arms as functionally equivalent. To address these limitations, this paper draws inspiration from human bimanual manipulation where one arm handles core operations and the other provides auxiliary support, and proposes PA-BiCoop, a new single-model bimanual cooperation framework with dynamic primary-auxiliary arm differentiation. PA-BiCoop categorizes robotic arms into primary and auxiliary arms with adaptively adjustable roles across task stages, employs two specialized decoders that share a global feature encoder: the primary decoder generates the primary arm’s base-coordinate pose and core-task affordance heatmaps, and the auxiliary decoder outputs the auxiliary arm’s relative pose in the primary arm’s coordinate system. Moreover, we design a dynamic role assignment module to automatically map roles to left/right arms without manual pre-definition. This design facilitates inter-arm knowledge sharing and coordinated manipulation. Extensive experiments demonstrate that our PA-BiCoop achieves superior performance: it outperforms state-of-the-art baselines by 48% on average in RLBench2 simulation tasks and by over 50% on average in real-world tasks, thereby verifying its effectiveness and advancement in bimanual manipulation.
|
| |
| 09:00-10:30, Paper ThI1I.269 | Add to My Program |
| FineCycle: Towards a Full-Cycle Management Paradigm for Robotic Deployment and Development |
|
| Wang, Haolin | Shanghai Jiao Tong University |
| Xi, Wang | Shanghai Jiao Tong University |
| Zhu, Zhiyuan | Shanghai Jiao Tong University |
| Fang, Chongrong | Shanghai Jiao Tong University |
| He, Jianping | Shanghai Jiao Tong University |
Keywords: Computer Architecture for Robotic and Automation, Software, Middleware and Programming Environments
Abstract: Typical robotic workflows involve deploying applications from servers to robots for testing or distributing validated applications across a fleet to unify capabilities. Because these processes are often slowed by tedious environment configurations, we need a way to improve deployment and development efficiency. Existing solutions typically simplify deployment through containerization; however, they often lack integrated development environments, necessitating repetitive packaging for code modifications and hindering iterative efficiency. This paper proposes FINECYCLE, a management paradigm that encapsulates the entire software stack into a robotic image. By facilitating a complete "deploy-develop-store" cycle, FINECYCLE streamlines the transition between cross-host deployment and iterative refinement. Additionally, we opensource image templates compatible with this paradigm to reduce time costs for researchers and foster collaborative progress in the robotics community.
|
| |
| 09:00-10:30, Paper ThI1I.270 | Add to My Program |
| LLM-Guided Semantic Stereo Adaptive Visual Servoing for Precise Peg-In-Hole |
|
| Dong, Xiyue | Hong Kong Center for Logistics Robotics |
| Sun, Guangli | The Chinese University of Hong Kong |
| Hu, Jinfei | Hong Kong Centre for Logistic Robotics |
| Huang, Tianyu | Hong Kong Centre for Logistic Robotics |
| Chen, Wei | The Chinese University of Hong Kong |
| Liu, Yunhui | Chinese University of Hong Kong |
Keywords: Robust/Adaptive Control, Visual Servoing, AI-Enabled Robotics
Abstract: Precision assembly tasks like peg-in-hole remain challenging for robotic manipulation. While visual servoing offers a robust framework, it depends heavily on accurate calibration and manual feature engineering. Learning-based methods, including vision-language models (VLMs), provide strong semantic understanding but often lack the precision needed for high-tolerance, contact-rich insertions. This paper introduces a novel framework that combines the semantic reasoning of large language models (LLMs) with adaptive visual servoing to bridge this gap. Our approach uses an LLM as a semantic feature extractor and correspondence engine for stereo visual servoing. The LLM processes generic point features from uncalibrated stereo images along with a task description in natural language, leveraging its spatial understanding to identify and correspond optimal features across views. These features drive a stereo adaptive visual servoing controller that estimates unknown calibration parameters online, enabling precise, calibration-free positioning. Extensive evaluations on cylindrical, square, and hexagonal peg-in-hole tasks across three trials demonstrate average success rates above 90% with steady-state errors of 1.8--2.8 pixels, closely comparable to calibrated methods (1.2--2.5 pixels). This is achieved without requiring prior models, calibration, or task-specific training, thereby advancing flexible and precise robotic assembly.
|
| |
| 09:00-10:30, Paper ThI1I.271 | Add to My Program |
| Learning Affordances at Inference-Time for Vision-Language-Action Models |
|
| Shah, Ameesh | UC Berkeley |
| Chen, William | UC Berkeley |
| Godbole, Adwait | University of California Berkeley |
| Mora, Federico | University of California, Berkeley |
| Seshia, Sanjit A. | University of California Berkeley |
| Levine, Sergey | UC Berkeley |
Keywords: Learning from Experience, Continual Learning, AI-Enabled Robotics
Abstract: Solving complex real-world control tasks often takes multiple tries: if we fail at first, we reflect on what went wrong, and change our strategy accordingly to avoid making the same mistake. In robotics, Vision-Language-Action models (VLAs) offer a promising path towards solving complex control tasks, but lack the ability to contextually and dynamically readjust behavior when they fail to accomplish a task. In this work, we introduce Learning from Inference-Time Execution (LITEN), which connects a VLA low-level policy to a high-level VLM that conditions on past experiences by including them in-context, allowing it to learn the affordances and capabilities of the low-level VLA. Our approach iterates between a reasoning phase that generates and executes plans for the low-level VLA, and an assessment phase that reflects on the resulting execution and draws useful conclusions to be included in future reasoning contexts. Unlike similar approaches to self-refinement in non-robotics domains, LITEN must reflect on unstructured real-world robot trajectories (e.g., raw videos), which requires careful treatment during assessment. Our experimental results demonstrate LITEN is able to effectively learn from past experience to generate plans that use high-affordance instructions to accomplish long-horizon tasks.
|
| |
| 09:00-10:30, Paper ThI1I.272 | Add to My Program |
| SaferPath: Hierarchical Visual Navigation with Learned Guidance and Safety-Constrained Control |
|
| Zhang, Lingjie | The Hong Kong University of Science and Technology (Guangzhou) |
| Jiang, Zeyu | The Hong Kong University of Science and Technology (Guangzhou) |
| Chen, Changhao | The Hong Kong University of Science and Technology (Guangzhou) |
Keywords: Integrated Planning and Learning, Integrated Planning and Control, Vision-Based Navigation
Abstract: Visual navigation is a core capability for mobile robots, yet end-to-end learning-based methods often struggle with generalization and safety in unseen, cluttered, or narrow environments. These limitations are especially pronounced in dense indoor settings, where collisions are likely and end-to-end models frequently fail. To address this, we propose SaferPath, a hierarchical visual navigation framework that leverages learned guidance from existing end-to-end models and refines it through a safety-constrained optimization-control module. SaferPath transforms visual observations into a traversable-area map and refines guidance trajectories using Model Predictive Stein Variational Evolution Strategy (MP-SVES), efficiently generating safe trajectories in only a few iterations. The refined trajectories are tracked by an MPC controller, ensuring robust navigation in complex environments. Extensive experiments in scenarios with unseen obstacles, dense unstructured spaces, and narrow corridors demonstrate that SaferPath consistently improves success rates and reduces collisions, outperforming representative baselines such as ViNT and NoMaD, and enabling safe navigation in challenging real-world settings.
|
| |
| 09:00-10:30, Paper ThI1I.273 | Add to My Program |
| SafeDMPs: Integrating Formal Safety with DMPs for Adaptive HRI |
|
| Tiwari, Pranav | Indian Institute of Science |
| Nath, Soumyodipta | Indian Institute of Science |
| Prakash, Ravi | Indian Institute of Science |
Keywords: Safety in HRI, Imitation Learning, Robust/Adaptive Control
Abstract: Robots operating in human-centric environments must be both robust to disturbances and provably safe from collisions. Achieving these properties simultaneously and efficiently remains a central challenge. While Dynamic Movement Primitives (DMPs) offer inherent stability and generalization from single demonstrations, they lack formal safety guarantees. Conversely, formal methods like Control Barrier Functions (CBFs) provide provable safety but often rely on computationally expensive, real-time optimization, hindering their use in high-frequency control. This paper introduces SafeDMPs, a novel framework that resolves this trade-off. We integrate the closed-form efficiency and dynamic robustness of DMPs with a provably safe, non-optimization-based control law derived from Spatio-Temporal Tubes (STTs). This synergy allows us to generate motions that are not only robust to perturbations and adaptable to new goals, but also guaranteed to avoid static and dynamic obstacles. Our approach achieves a closed-form solution for a problem that traditionally requires online optimization. Experimental results on a 7-DOF robot manipulator demonstrate that SafeDMPs is orders of magnitude faster and more accurate than optimization-based baselines, making it an ideal solution for real-time, safe, and collaborative robotics.
|
| |
| 09:00-10:30, Paper ThI1I.274 | Add to My Program |
| GeVI-SLAM: Gravity-Enhanced Stereo VI SLAM for Underwater Robots |
|
| Shen, Yuan | The Chinese University of Hong Kong, Shenzhen |
| Hong, Yuze | The Chinese University of Hong Kong,Shenzhen |
| Zeng, Guangyang | The Chinese University of Hong Kong, Shenzhen |
| Zhang, Tengfei | The Chinese University of HongKong, ShenZhen |
| Chui, Pui Yi | The Chinese University of Hong Kong |
| Hong, Ziyang | Heriot-Watt University |
| Wu, Junfeng | The Chinese Unviersity of Hong Kong, Shenzhen |
Keywords: Marine Robotics, Visual-Inertial SLAM, Localization
Abstract: Accurate visual–inertial simultaneous localization and mapping (VI SLAM) for underwater robots remains a significant challenge due to frequent visual degeneracy and insufficient inertial measurement unit (IMU) motion excitation. In this paper, we present GeVI-SLAM, a gravity-enhanced stereo VI SLAM system designed to address these issues. By leveraging the stereo camera's direct depth estimation ability, we eliminate the need to estimate scale during IMU initialization, enabling stable operation even under low-acceleration dynamics.With precise gravity initialization, we decouple the pitch and roll from the pose estimation and solve a 4 degrees of freedom (DOF) Perspective‑n‑Point (PnP) problem for pose tracking. This allows the use of a minimal 3-point solver, which significantly reduces computational time to reject outliers within a Random Sample Consensus (RANSAC) framework. We further propose a bias-eliminated 4-DOF PnP estimator with provable consistency, ensuring the relative pose converges to the true value as the feature number increases. To handle dynamic motion, we refine the full 6-DOF pose while jointly estimating the IMU covariance, enabling adaptive weighting of the gravity prior. Extensive experiments on simulated and real-world data demonstrate that GeVI-SLAM achieves higher accuracy and greater stability compared to state-of-the-art methods.
|
| |
| 09:00-10:30, Paper ThI1I.275 | Add to My Program |
| GIFT: Generalizing Intent for Flexible Test-Time Rewards |
|
| Amin, Fin | North Carolina State University |
| Dennler, Nathaniel | MIT |
| Bobu, Andreea | MIT |
Keywords: Human-Centered Automation, Learning from Experience, Human-Centered Robotics
Abstract: Robots learn reward functions from user demonstrations, but these rewards often fail to generalize to new environments. This failure occurs because learned rewards latch onto spurious correlations in training data rather than the underlying human intent that demonstrations represent. Existing methods leverage visual or semantic similarity to improve robustness, yet these surface-level cues often diverge from what humans actually care about. We present Generalizing Intent for Flexible Test-Time rewards (GIFT), a framework that grounds reward generalization in human intent rather than surface cues. GIFT leverages language models to infer high-level intent from user demonstrations by contrasting preferred with non-preferred behaviors. At deployment, GIFT maps novel test states to behaviorally equivalent training states via intent-conditioned similarity, enabling learned rewards to generalize across distribution shifts without retraining. We evaluate GIFT on tabletop manipulation tasks with new objects and layouts. Across four simulated tasks with over 50 unseen objects, GIFT consistently outperforms visual and semantic similarity baselines in test-time pairwise win rate and state-alignment F1 score. Real-world experiments on a 7-DoF Franka Panda robot demonstrate that GIFT reliably transfers to physical settings. Further discussion can be found at https://mit-clear-lab.github.io/GIFT/
|
| |
| 09:00-10:30, Paper ThI1I.276 | Add to My Program |
| CATALYST: Cognitive-To-Autonomy-Inspired Two-Stage Training Data Generation with Local-System-Aware Selection Technique |
|
| Kim, Taehoon | DGIST (Daegu Gyeongbuk Institute of Science & Technology) |
| Oh, Sehoon | DGIST |
Keywords: Data Sets for Robot Learning, Integrated Planning and Learning, Task and Motion Planning
Abstract: In conventional learning-based robotic dynamics modeling, physical information is mostly incorporated into the model or loss function, while the design of training data often relies on random sampling or uniform coverage, which can limit performance. To address this gap, this paper proposes the CATALYST framework, which generates optimal training data based on physics priors and the modeling structure of the chosen learning model. Stage 1 uses the CAD-derived inertia matrix M(q) to approximate the joint distribution of [q, M] with a PLM, thereby identifying the optimal locations for the local model centers (mu_k^{{opt}}). Stage 2 then optimizes an Operating-Point-Centered Excitation Trajectory (OPCET). This optimization simultaneously (i) aligns the trajectory with the target operating points (l_m), (ii) enforces range-of-motion (RoM) constraints (l_r), and (iii) achieves desirable velocity–acceleration statistics (large volume, isotropy, low correlation, captured by l_s). We validate the approach in simulation using a 3-DoF yaw–pitch–pitch manipulator, which allows visual demonstration of the process and outcomes. We then analyze the framework step by step. Results show that each stage meets its objective. A PLM trained on data generated by the proposed trajectories outperforms baselines (Spread/RoM, ill‑centered, Tukey‑windowed chirp, and cubic) in both torque regression and control. Thus, CATALYST yields more accurate regression and more reliable feedforward control than conventional designs.
|
| |
| 09:00-10:30, Paper ThI1I.277 | Add to My Program |
| CDF-Glove: A Cable-Driven Force Feedback Glove for Dexterous Teleoperation |
|
| Liang, Huayue | Tsinghua University |
| Ruochong, Li | PsiBot |
| Chen, Yuanpei | South China University of Technology |
| Yang, Yaodong | Peking University |
| Zeng, Long | Tsinghua University |
| Wang, Xueqian | Center for Artificial Intelligence and Robotics, Graduate School at Shenzhen, Tsinghua University, 518055 Shenzhen, China |
|
|
| |
| 09:00-10:30, Paper ThI1I.278 | Add to My Program |
| Geometric Backstepping Control of Omnidirectional Tiltrotors Incorporating Servo–Rotor Dynamics for Robustness against Sudden Disturbances |
|
| Lee, Jaewoo | Seoul National University |
| Lee, Dongjae | Carnegie Mellon University |
| Lee, Jinwoo | Seoul National University |
| Lee, Hyungyu | University of Illinois Urbana-Champaign |
| Kim, Yeonjoon | Seoul National University |
| Kim, H. Jin | Seoul National University |
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications, Motion Control
Abstract: This work presents a geometric backstepping controller for a variable-tilt omnidirectional multirotor that explicitly accounts for both servo and rotor dynamics. Considering actuator dynamics is essential for more effective and reliable operation, particularly during aggressive flight maneuvers or recovery from sudden disturbances. While prior studies have investigated actuator-aware control for conventional and fixed-tilt multirotors, these approaches rely on linear relationships between actuator input and wrench, which cannot capture the nonlinearities induced by variable tilt angles. In this work, we exploit the cascade structure between the rigid-body dynamics of the multirotor and its nonlinear actuator dynamics to design the proposed backstepping controller and establish exponential stability of the overall system. Furthermore, we reveal parametric uncertainty in the actuator model through experiments, and we demonstrate that the proposed controller remains robust against such uncertainty. The controller was compared against a baseline that does not account for actuator dynamics across three experimental scenarios: fast translational tracking, rapid rotational tracking, and recovery from sudden disturbance. The proposed method consistently achieved better tracking performance, and notably, while the baseline diverged and crashed during the fastest translational trajectory tracking and the recovery experiment, the proposed controller maintained stability and successfully completed the tasks, thereby demonstrating its effectiveness.
|
| |
| 09:00-10:30, Paper ThI1I.279 | Add to My Program |
| Designing Latent Safety Filters Using Pre-Trained Vision Models |
|
| Tabbara, Ihab | Washington University in St. Louis |
| Yang, Yuxuan | Washington University in Saint Louis |
| Hamzeh, Ahmad | Washington University in St. Louis |
| Astafyev, Maxwell | Washington University in St. Louis |
| Sibai, Hussein | Washington University in St. Louis |
Keywords: Robot Safety, Collision Avoidance, Deep Learning for Visual Perception
Abstract: Ensuring safety of vision-based control systems remains a major challenge hindering their deployment in critical settings. Safety filters have gained increased interest as effective tools for ensuring the safety of classical control systems, but their applications in vision-based control settings have so far been limited. Pre-trained vision representations (PVRs) have been shown to be effective perception backbones for control in various robotics domains. In this paper, we are interested in examining their effectiveness when used for designing vision-based safety filters. We use them as backbones for classifiers defining failure sets, for Hamilton–Jacobi (HJ) reachability-based value functions, and for latent world models. We discuss the trade-offs between training from scratch, fine-tuning the PVRs, and freezing the PVRs when training the models they are backbones for. We also evaluate whether one of the PVRs is superior across all tasks, evaluate whether learned world models or Q-functions are better for switching decisions to safe policies, and discuss practical considerations for deploying these PVRs on resource-constrained devices. Our experiments show that compared to training representations from scratch, using PVRs as perception backbones for vision-based safety filters can reduce violation rates by 12.2%, and fine-tuning PVRs to the task can reduce them by 73.7%, while maintaining or improving task performance. Code is available at https: //github.com/tabz23/Latent-Safety-Filters.
|
| |
| 09:00-10:30, Paper ThI1I.280 | Add to My Program |
| Context Matters! Relaxing Goals with LLMs for Feasible 3D Scene Planning |
|
| Musumeci, Emanuele | Sapienza University of Rome |
| Brienza, Michele | Sapienza University of Rome |
| Argenziano, Francesco | Sapienza University of Rome |
| Drid, Abdel Hakim | Mohamed Khider University of Biskra |
| Suriani, Vincenzo | University of Basilicata |
| Nardi, Daniele | Sapienza University of Rome |
| Bloisi, Domenico | International University of Rome UNINT |
Keywords: Task Planning, Semantic Scene Understanding, AI-Enabled Robotics
Abstract: Embodied agents need to plan and act reliably in real and complex 3D environments. Classical planning (e.g., PDDL) offers structure and guarantees, but in practice it fails under noisy perception and incorrect predicate grounding. On the other hand, Large Language Models (LLMs)-based planners leverage commonsense reasoning, yet frequently propose actions that are unfeasible or unsafe. Following recent works that combine the two approaches, we introduce ContextMatters, a framework that fuses LLMs and classical planning to perform hierarchical goal relaxation: the LLM helps ground symbols to the scene and, when the target is unreachable, it proposes functionally equivalent goals that progressively relax constraints, adapting the goal to the context of the agent's environment. Operating on 3D Scene Graphs, this mechanism turns many nominally unfeasible tasks into tractable plans and enables context-aware partial achievement when full completion is not achievable. Our experimental results show a +52.45% Success Rate improvement over state-of-the-art LLMs+PDDL baseline, demonstrating the effectiveness of our approach. Moreover, we validate the execution of ContextMatter in a real world scenario by deploying it on a TIAGo robot. Anonymized code, dataset, and supplementary materials are available to the community at https://anonymous.4open.science/r/context-matters-13ED.
|
| |
| 09:00-10:30, Paper ThI1I.281 | Add to My Program |
| Compositional Context Fine-Tuning Vision-Language Model for Complex Assembly Action Understanding from Videos |
|
| Zheng, Hao | New York University Abu Dhabi |
| Huang, Jinyi | The University of Auckland |
| Zheng, Tiantian | New York University Abu Dhabi |
| Xu, Xun | University of Auckland |
| Alhanai, Tuka | New York University Abu Dhabi |
Keywords: Computer Vision for Manufacturing, Recognition, Assembly
Abstract: Assembly action understanding is a key enabler for effective human-robot collaborative assembly, yet it remains challenging due to subtle motions and fine-grained hand–object interactions. We adapt vision-language models (VLMs) to this challenging domain with Compositional Context Fine-Tuning (CCFT), a method that decomposes assembly actions into semantic elements (textit{Verb}, textit{Object}, textit{Tool}) and fine-tunes VLMs to recognize each action element using templated question-answering pairs. This approach ensures near-deterministic outputs. To enable efficient and effective multi-task learning under limited data, a Layer-Partitioned Alternating Training (LP-AT) method is presented, which assigns distinct model layers to recognize specific action elements through element-specific low-rank adapters. LP-AT alternates weight updates across element-specific adapters, reducing cross-task interference while enabling per-adapter hyperparameter optimization. Furthermore, we create HA-ViD-VQA and IKEA-ASM-VQA datasets from existing assembly video datasets. Extensive experiments on these datasets demonstrate that our method consistently outperforms strong action recognition baselines while providing interpretable element-level predictions that can support diverse downstream applications. Code and dataset are released at url{https://github.com/x-labs-xyz/CCFT}.
|
| |
| 09:00-10:30, Paper ThI1I.282 | Add to My Program |
| PPGuide: Steering Diffusion Policies with Performance Predictive Guidance |
|
| Wang, Zixing | Purdue University |
| Jha, Devesh | Mitsubishi Electric Research Laboratories |
| Qureshi, Ahmed H. | Purdue University |
| Romeres, Diego | Mitsubishi Electric Research Laboratories |
Keywords: Imitation Learning, Sensorimotor Learning, Representation Learning
Abstract: Diffusion policies have shown to be very efficient at learning complex, multi-modal behaviors for robotic manipulation. However, errors in generated action sequences can compound over time which can potentially lead to failure. Some approaches mitigate this by augmenting datasets with expert demonstrations or learning predictive world models which might be computationally expensive. We introduce Performance Predictive Guidance (PPGuide), a lightweight, classifier-based framework that steers a pre-trained diffusion policy away from failure modes at inference time. PPGuide makes use of a novel self-supervised process: it uses attention-based multiple instance learning to automatically estimate which observation-action chunks from the policy's rollouts are relevant to success or failure. We then train a performance predictor on this self-labeled data. During inference, this predictor provides a real-time gradient to guide the policy toward more robust actions. We validated our proposed PPGuide across a diverse set of tasks from the Robomimic and MimicGen benchmarks, demonstrating consistent improvements in performance.
|
| |
| 09:00-10:30, Paper ThI1I.283 | Add to My Program |
| Manipulator Generative Design Optimization for Orchard Environments |
|
| Rosette, Marcus | Oregon State University |
| Wang, Tianhai | The University of Tokyo |
| Burridge, James | The University of Tokyo |
| Guo, Wei | The University of Tokyo |
| Davidson, Joseph | Oregon State University |
Keywords: Evolutionary Robotics, Agricultural Automation, Kinematics
Abstract: Manipulators are essential for advancing orchard robotics tasks such as pruning and harvesting, which require precise, dexterous motion in cluttered and unstructured environments. Off-the-shelf industrial arms, while readily available, often lack the reach and dexterity required for these settings. In this paper we present a simulation-driven, multi-objective optimization framework for task-specific manipulator kinematics, leveraging the NSGA-II evolutionary algorithm and physics-based evaluation. Candidate designs are encoded with high-level parameters -- joint type, axis orientation, link length, and joint count -- then automatically generated as URDF models and evaluated in simulation for reachability, manipulability, torque demand, and motion planning cost. Trade-offs are revealed on a Pareto front, enabling exploration across diverse designs. The framework is demonstrated on a real-world tree pruning task, using collected 3D scans of expert-pruned trees and an automated prune point identification pipeline to generate target points to guide the optimization. Results show that the proposed approach produces task-specific manipulator designs with improved workspaces and reduced operational constraints compared to a commercial industrial arm, offering a viable pathway toward deployable agricultural manipulation systems.
|
| |
| 09:00-10:30, Paper ThI1I.284 | Add to My Program |
| Observer–Actor: Active Vision Imitation Learning with Sparse-View Gaussian Splatting |
|
| Wang, Yilong | Imperial College London |
| Qian, Cheng | Tum |
| Fan, Ruomeng | Imperial College London |
| Johns, Edward | Imperial College London |
Keywords: Imitation Learning, Dual Arm Manipulation, Perception for Grasping and Manipulation
Abstract: We propose Observer-Actor (ObAct), a novel framework for active vision imitation learning in which the observer moves to optimal visual observations for the actor. We study ObAct on a dual-arm robotic system equipped with wrist-mounted cameras. At test time, ObAct dynamically assigns observer and actor roles: the observer arm constructs a 3D Gaussian Splatting (3DGS) representation from three images, virtually explores this to find an optimal camera pose, then moves to this pose; the actor arm then executes a policy using the observer’s observations. This formulation enhances the clarity and visibility of both the object and the gripper in the policy's observations. As a result, we enable the training of ambidextrous policies on observations that remain closer to the occlusion-free training distribution, leading to more robust policies. We study this formulation with two existing imitation learning methods -- trajectory transfer and behaviour cloning -- and experiments show that ObAct significantly outperforms static-camera setups: trajectory transfer improves by 145% without occlusion and 233% with occlusion, while behavior cloning improves by 75% and 143%, respectively. Videos are available at: https://obact.github.io.
|
| |
| 09:00-10:30, Paper ThI1I.285 | Add to My Program |
| Prepare before You Act: Learning from Humans to Rearrange Initial States |
|
| Dai, Yinlong | Virginia Tech |
| Keyser, Andre | Virginia Tech |
| Losey, Dylan | Virginia Tech |
Keywords: Imitation Learning, Visual Learning, Learning from Demonstration
Abstract: Imitation learning (IL) has proven effective across a wide range of manipulation tasks. However, IL policies often struggle when faced with out-of-distribution observations; for instance, when the target object is in a previously unseen position or occluded by other objects. In these cases, extensive demonstrations are needed for current IL methods to reach robust and generalizable behaviors. But when humans are faced with these sorts of atypical initial states, we often rearrange the environment for more favorable task execution. For example, a person might rotate a coffee cup so that it is easier to grasp the handle, or push a box out of the way so they can directly grasp their target object. In this work we seek to equip robot learners with the same capability: enabling robots to prepare the environment before executing their given policy. We propose ReSET, an algorithm that takes initial states --- which are outside the policy's distribution --- and autonomously modifies object poses so that the restructured scene is similar to training data. Theoretically, we show that this two step process (rearranging the environment before rolling out the given policy) reduces the generalization gap. Practically, our ReSET algorithm combines action-agnostic human videos with task-agnostic teleoperation data to i) decide when to modify the scene, ii) predict what simplifying actions a human would take, and iii) map those predictions into robot action primitives. Comparisons with diffusion policies, VLAs, and other baselines show that using ReSET to prepare the environment enables more robust task execution with equal amounts of total training data.
|
| |
| 09:00-10:30, Paper ThI1I.286 | Add to My Program |
| Feasible Rolling Trajectory Generation and Control for Tensegrity Robots |
|
| Liu, Songyuan | Beijing Institute of Technology |
| Yang, Qingkai | Beijing Institute of Technology |
| Tao, Zichen | Beijing Institute of Technology |
| Gui, Yun | Beijing Institute of Technology |
| Shi, Jiaxu | Beijing Institute of Technology |
| Fang, Hao | Beijing Institute of Technology |
Keywords: Modeling, Control, and Learning for Soft Robots, Motion Control, Dynamics
Abstract: Due to the multi-node and multi-contact motion characteristics of tensegrity robots, existing methods fail to generate feasible reference rolling trajectories, and controllers are also limited to open-loop approaches. To address this issue, we utilize motion decomposition to extract the motion phase that should be the primary focus. Subsequently, we propose a method combining form-finding-based critical configuration search and polynomial trajectories to generate feasible trajectories. Then, an iLQR controller that accounts for reducing actuator load is designed for trajectory tracking control. A key distinction from existing methods is that our approach eliminates the need for reset operations after each rolling cycle. The results of simulations and physical experiments demonstrate that the robot achieves continuous rolling, with improvements of 18.3% in speed and 34.4% in actuation load compared to existing works.
|
| |
| 09:00-10:30, Paper ThI1I.287 | Add to My Program |
| SuckTac: Camera-Based Tactile Sucker for Unstructured Surface Perception and Interaction |
|
| Yuan, Ruiyong | Shanghai Jiao Tong University |
| Ren, Jieji | Shanghai Jiao Tong University |
| Peng, Zhanxuan | Shanghai Jiao Tong University |
| Guo, Qianyu | Shanghai Jiao Tong University School of Medicine |
| Chen, Feifei | Shanghai Jiao Tong University |
| Gu, Guoying | Shanghai Jiao Tong University |
Keywords: Soft Sensors and Actuators, Biomimetics, Grasping
Abstract: Suckers are significant for robots in picking, transferring, manipulation and locomotion on diverse surfaces. However, conventional suckers lack high-fidelity tactile perception, which impedes them from resolving the fine-grained geometric features and interaction status of the target surface. This limits their robust performance with irregular objects and in complex, unstructured environments. Inspired by the adaptive structure and high-performance sensory capabilities of cephalopod suckers, we propose a novel, intelligent sucker, named SuckTac, that integrates a camera-based tactile sensor directly within its optimized structure to provide high-density perception and robust suction. Specifically, through joint structural optimization and a multi-material integrated casting technique, a camera and light source are embedded into the sucker, which enables in-situ, high-density perception of fine details such as surface shape, texture, and roughness. To further enhance robustness and adaptability, the sucker's mechanical design is also optimized by refining its profile, adding a compliant lip, and incorporating surface microstructure. Extensive experiments, including challenging tasks such as robotic cloth manipulation and soft mobile robot inspection, demonstrate the superior performance and broad applicability of the proposed system.
|
| |
| 09:00-10:30, Paper ThI1I.288 | Add to My Program |
| Stability-Aware Banked Turn Maneuver Control and Command Augmentation for 2-DOF Pendulum-Driven Spherical Robots |
|
| Pravecek, Derek | Texas A&M University |
| Jangale, Rishi | Texas A&M University |
| Villanueva, Aaron | Texas A&M University |
| Ambrose, Robert | Texas A&M University |
Keywords: Nonholonomic Mechanisms and Systems, Constrained Motion Planning, Field Robots
Abstract: Executing banked turns at elevated speeds poses significant dynamic challenges for 2-DOF pendulum-driven spherical robots. A steady-state torque balance reveals that centripetal loading at high speeds limits feasible roll angles and demands increasingly aggressive pendulum actuation. We derive a closed-form expression for the required pendulum angle and integrate this into a bank-aware Command Augmentation System (CAS) and control law that automatically alters infeasible commands. Experimental tests on Texas A&M RAD Lab's RoboBall II platform demonstrate that the CAS-equipped bank controller enables stable bank maneuvers at speeds up to 6 rad/s (1.83 m/s), where previous controllers fail, by dynamically limiting roll commands based on velocity and internal pressure.
|
| |
| 09:00-10:30, Paper ThI1I.289 | Add to My Program |
| Obstacle-Aware IBVS Target Tracking Via Feature-Space Projection and Virtual Imaging Guidance with ADP-Shaped Terminal Cost |
|
| Li, Mingcong | Beijing Institute of Technology |
| Chen, Zhen | Beijing Institute of Technology |
| Liu, Xiangdong | Beijing Institute of Technology |
Keywords: Visual Servoing, Nonholonomic Mechanisms and Systems, Collision Avoidance
Abstract: We propose a visual-servoing and obstacle-avoidance controller for a wheeled mobile robot (WMR) with a two-axis gimbal camera that operates without mapping, using only vision and lightweight forward sensing. A task-allocation MPC with online terminal-cost iteration is introduced. Specifically, task projection in the image-feature space mitigates underactuation and coupling–induced local optima; Virtual Imaging Constraint Guidance (VICG) yields a visibility-preserving heading reference that steers the trajectory around obstacles; and an Approximate Dynamic Programming (ADP) module learns a context-aware terminal cost online, providing long-horizon guidance for mid-horizon prediction. Relying solely on image feedback plus lightweight ranging, the method coordinates the WMR and gimbal to accomplish obstacle avoidance and visual-servo tracking jointly. Hardware experiments validate the feasibility and effectiveness of the proposed approach.
|
| |
| 09:00-10:30, Paper ThI1I.290 | Add to My Program |
| Aerial Manipulation with Contact-Aware Onboard Perception and Hybrid Control |
|
| Zhan, Yuanzhu | Pennsylvania State University |
| Jiang, Yufei | Pennsylvania State University |
| Cao, Muqing | Carnegie Mellon University |
| Geng, Junyi | Pennsylvania State University |
Keywords: Aerial Systems: Applications, Aerial Systems: Perception and Autonomy, Visual Servoing
Abstract: Aerial manipulation (AM) promises to move Unmanned Aerial Vehicles (UAVs) beyond passive inspection to contact-rich tasks such as grasping, assembly, and in-situ maintenance. Most prior AM demonstrations rely on external motion capture (MoCap) and emphasize position control for coarse interactions, limiting deployability. We present a fully onboard perception–control pipeline for contact-rich AM that achieves accurate motion tracking and regulated contact wrenches without MoCap. The main components are (1) an augmented visual–inertial odometry (VIO) estimator with contact-consistency factors that activate only during interaction, tightening uncertainty around the contact frame and reducing drift, and (2) image-based visual servoing (IBVS) to mitigate perception–control coupling, together with a hybrid force–motion controller that regulates contact wrenches and lateral motion for stable contact. Experiments show that our approach closes the perception-to-wrench loop using only onboard sensing, yielding an velocity estimation improvement of 66.01% at contact, reliable target approach, and stable force holding—pointing toward deployable, in-the-wild aerial manipulation.
|
| |
| 09:00-10:30, Paper ThI1I.291 | Add to My Program |
| CogStereo: Neural Stereo Matching with Implicit Spatial Cognition Embedding |
|
| Fang, Lihuang | Southern University of Science and Technology |
| Hu, Xiao | IDEA |
| Zou, Yuchen | Xian Jiaotng University |
| Zhang, Hong | SUSTech |
Keywords: Deep Learning for Visual Perception, SLAM, Visual Learning
Abstract: Deep stereo matching has advanced significantly on benchmark datasets through fine-tuning but falls short of the zero-shot generalization seen in foundation models in other vision tasks. We introduce CogStereo, a novel framework that addresses challenging regions, such as occlusions or weak textures, without relying on dataset-specific priors. CogStereo embeds implicit spatial cognition into the refinement process by using monocular depth features as priors, capturing holistic scene understanding beyond local correspondences. This approach ensures structurally coherent disparity estimation, even in areas where geometry alone is inadequate. CogStereo employs a dual-conditional refinement mechanism that combines pixel-wise uncertainty with cognition-guided features for consistent global correction of mismatches. Extensive experiments on Scene Flow, KITTI, Middlebury, ETH3D, EuRoc, and real-world demonstrate that CogStereo not only achieves state-of-the-art results but also excels in cross-domain generalization, shifting stereo vision towards a cognition-driven approach. More details are available at https://github.com/lhfang228/CogStereo.
|
| |
| 09:00-10:30, Paper ThI1I.292 | Add to My Program |
| Structured Interfaces for Automated Reasoning with 3D Scene Graphs |
|
| Ray, Aaron | Massachusetts Institute of Technology |
| Arkin, Jacob | Massachusetts Institute of Technology |
| Biggie, Harel | Massachusetts Institute of Technology |
| Fan, Chuchu | Massachusetts Institute of Technology |
| Carlone, Luca | Massachusetts Institute of Technology |
| Roy, Nicholas | Massachusetts Institute of Technology |
Keywords: Semantic Scene Understanding, AI-Based Methods, Human-Robot Teaming
Abstract: In order to provide a robot with the ability to understand and react to a user's natural language inputs, the natural language must be connected to the robot's underlying representations of the world. Recently, large language models (LLMs) and 3D scene graphs (3DSGs) have become a popular choice for grounding natural language and representing the world, respectively. In this work, we address the challenge of using LLMs with 3DSGs to ground natural language. Existing methods encode the scene graph as serialized text and insert it into the LLM's context window, but this encoding does not scale to large or rich 3DSGs. Instead, we propose to use a form of Retrieval Augmented Generation to select a subset of the 3DSG relevant to the task. We encode a 3DSG in a graph database and provide a query language interface (Cypher) as a tool to the LLM with which it can retrieve relevant data for language grounding. We evaluate our approach on instruction following and scene question-answering tasks and compare against baseline context window and code generation methods. Through evaluations on scene question-answering, instruction grounding, and scene graph updating tasks, we compare our approach to existing context window-based methods and a novel code generation method. Our results show that using Cypher as an interface to 3D scene graphs scales significantly better to large, rich graphs, leading to large improvements in grounded language tasks while also substantially reducing the token count of the scene graph content.
|
| |
| 09:00-10:30, Paper ThI1I.293 | Add to My Program |
| FineNav: A Versatile Framework Enhancing Ground Robot Navigation in Unstructured Environment |
|
| Wang, Jinghui | Shanghai Jiao Tong University |
| Wang, Chenyang | Shanghai Jion Tong University |
| Cao, Yuxuan | Shanghai Jiao Tong University |
| Sun, Zelong | China University of Geosciences, Wuhan |
| Xi, Wang | Shanghai Jiao Tong University |
| He, Jianping | Shanghai Jiao Tong University |
Keywords: Software Architecture for Robotic and Automation, Software, Middleware and Programming Environments
Abstract: Autonomous navigation of ground robots in unstructured 3D environments remains a fundamental challenge, as it requires accommodating dynamic obstacles, non-planar ground, and multi-story structures within a unified framework. In this paper, we propose a versatile navigation framework named FineNav. It features a novel hierarchical mapping system that couples a high-rate local voxel grid for real-time perception with a scalable global octree for persistent storage. This design balances low-latency performance with large-scale mapping capabilities, enabling reliable navigation in unstructured environments. Moreover, the entire navigation pipeline is refactored into modular and reusable components, while maintaining compatibility with existing 2D navigation ecosystems. We validate FineNav on a wheeled robot, demonstrating its versatility across diverse scenarios. FineNav is released as open-source software for the community.
|
| |
| 09:00-10:30, Paper ThI1I.294 | Add to My Program |
| Controllable Collision Scenario Generation Via Collision Pattern Prediction |
|
| Chen, Pin-Lun | National Yang Ming Chiao Tung University |
| Kung, Chi-Hsi | National Tsing Hua University |
| Chang, Che-Han | MediaTek Inc |
| Chiu, Wei-Chen | National Chiao Tung University |
| Chen, Yi-Ting | National Yang Ming Chiao Tung University |
Keywords: Intelligent Transportation Systems, Collision Avoidance, Motion and Path Planning
Abstract: Evaluating the safety of autonomous vehicles (AVs) requires diverse, safety-critical scenarios, with collisions being especially important yet rare and unsafe to collect in the real world. Therefore, the community has been focusing on generating safety-critical scenarios in simulation. However, controlling attributes such as collision type and time-to-accident (TTA) remains challenging. We introduce a new task called controllable collision scenario generation, where the goal is to produce trajectories that realize a user-specified collision type and TTA, to investigate the feasibility of automatically generating desired collision scenarios. To support this task, we present COLLIDE, a large-scale collision scenario dataset constructed by transforming real-world driving logs into diverse collisions, balanced across five representative collision types and different TTA intervals. We propose a framework that predicts Collision Pattern, a compact and interpretable representation that captures the spatial configuration of the ego and the adversarial vehicles at impact, before rolling out full adversarial trajectories. Experiments show that our approach outperforms strong baselines in both collision rate and controllability. Furthermore, generated scenarios consistently induce higher planner failure rates, revealing limitations of existing planners. We demonstrate that these scenarios fine-tune planners for robustness improvements, contributing to safer AV deployment in different collision scenarios.
|
| |
| 09:00-10:30, Paper ThI1I.295 | Add to My Program |
| Robust 3D Multi-Object Tracking for Autonomous Driving with Adaptive LiDAR-Visual Fusion and Multilevel Data Association |
|
| Jiang, Chao | University of South China |
| Wang, Chao | University of South China |
| Nie, Liang | University of South China |
| Zhang, Mingyue | University of South China |
| Yuting, Zhou | Hunan University of Science and Engineering |
Keywords: Visual Tracking, Object Detection, Segmentation and Categorization, Human Detection and Tracking
Abstract: To increase the safety and reliability of autonomous driving systems in complex traffic environments, this paper proposes a novel 3D multiobject tracking (MOT) method that integrates center-plane adaptive multisensor fusion, motion compensation, and multilevel data association. Unlike traditional methods, our approach employs a center-plane adaptive fusion strategy to align LiDAR and visual data precisely, mitigating errors in the target width caused by pose variations, and improving tracking accuracy. To address vehicle motion-induced association errors in dynamic scenarios, we incorporate IMU and GPS data for high-frequency vehicle pose estimation and compensation, ensuring stable and robust target association. Additionally, a rotational geometric distance intersection-over-union (RGDIoU) cost function is introduced, combined with multilevel spatial indexing, to optimize the data association efficiency and accuracy. The experimental results on benchmark datasets, including KITTI and nuScenes, demonstrate that our method achieves state-of-the-art (SOTA) performance across multiple tracking metrics, including HOTA and sAMOTA, while maintaining real-time performance at 90 FPS. Specifically, our method improves sAMOTA tracking accuracy by 13% over the best existing methods and achieves a HOTA score of 50.24%, surpassing all compared methods.
|
| |
| 09:00-10:30, Paper ThI1I.296 | Add to My Program |
| Optimal Prioritized Dissipation and Closed-Form Damping Limitation under Actuator Constraints for Haptic Interfaces |
|
| Celli, Camilla | Scuola Superiore Sant'Anna |
| Bini, Andrea | Scuola Superiore Sant'Anna |
| Novelli, Valerio | Scuola Superiore Sant'Anna |
| Filippeschi, Alessandro | Scuola Superiore Sant'Anna |
| Porcini, Francesco | PERCRO Laboratory, TeCIP Institute, Sant’Anna School of Advanced Studies, Pisa |
| Frisoli, Antonio | Scuola Superiore Sant'Anna |
|
|
| |
| 09:00-10:30, Paper ThI1I.297 | Add to My Program |
| AddTraX: Traction Enhancement of Tractors Using Additive Motor-Integrated Driving Wheel Units |
|
| Izumi, Takuten | The University of Tokyo |
| Nanakubo, Moe | The University of Tokyo |
| Honda, Koki | The University of Tokyo |
| Sasabe, Takashi | Kubota Corporation |
| Sakano, Tomoyoshi | Kubota Corporation |
| Iwami, Kenichi | Kubota Corporation |
| Fukui, Rui | The University of Tokyo |
Keywords: Field Robots, Cellular and Modular Robots, Mechanism Design
Abstract: This study proposes a novel concept of an electric tractor that can flexibly respond to tasks with different traction force requirements. The key idea is to attach motor-integrated additive driving wheel units (AddTraX) to the rear wheel of the tractor according to the required traction force. The required functions for the driving wheel units are to allow manual attachment of the driving wheel units by the operator, and to control the height of the driving wheel units while running so that all the wheels have contact with ground. Driving experiments have been conducted using a single-side 1/4 scale model of the proposed driving wheel units and simply implemented models of several road conditions. On a paved road, the attachment of the additional driving wheel units enhance the traction force by 1.9 times, and wheel height control is unnecessary. On a soft unpaved road, traction force is increased by controlling the height of the driving wheel units when the vehicle weight is low. Furthermore, the experiment also confirms that the additional driving wheel units can help the vehicle overcoming steps on uneven road.
|
| |
| 09:00-10:30, Paper ThI1I.298 | Add to My Program |
| A Manta Ray Robot with Tunable Two-Dimensional Wing Stiffness |
|
| Fu, Sicheng | University of Wisconsin, Madison |
| Wang, Wei | University of Wisconsin-Madison |
Keywords: Biologically-Inspired Robots
Abstract: Manta rays achieve efficient and maneuverable swimming through flapping of their large pectoral fins, where stiffness plays a critical role in hydrodynamic performance. Most existing manta-ray robots employ fixed or one-dimensional compliance, limiting their ability to replicate the two-dimensional stiffness variation essential for traveling wave propulsion. This paper presents a manta ray–inspired robot equipped with an active stiffness control mechanism that enables reconfigurable, two-dimensional stiffness distributions in its pectoral fins. The design integrates a cable-driven actuation system with anisotropic disks, providing multiple distinct stiffness states that can be locked during operation. Mechanical characterization confirms periodic stiffness variation, with spanwise stiffness increasing by more than 30% and chordwise stiffness decreasing by about 10% as the disk rotates from 0degree to 90degree, then recovering from 90degree to 180degree. Robot experiments evaluate the influence of stiffness on fin kinematics, thrust generation, and free-swimming performance. Thrust tests demonstrate that stiffness substantially affects steady-state thrust; under certain conditions, the optimal setting produces up to five times more thrust than the least effective one. Free-swimming trials further reveal that stiffness alters swimming speed, with up to 20% variation observed in low-frequency, large-amplitude flapping. These results highlight the potential of active stiffness control to enhance the performance of bio-inspired underwater robots and provide new insights into the role of structural compliance in aquatic locomotion.
|
| |
| 09:00-10:30, Paper ThI1I.299 | Add to My Program |
| Rainbow-DemoRL: Combining Improvements in Demonstration-Augmented Reinforcement Learning |
|
| Bhatt, Dwait | University of California, San Diego |
| Chou, Shih-Chieh | National Sun Yat-Sen University |
| Atanasov, Nikolay | University of California, San Diego |
Keywords: Reinforcement Learning, Learning from Demonstration, Machine Learning for Robot Control
Abstract: Several approaches have been proposed to improve the sample efficiency of online reinforcement learning (RL) by leveraging demonstrations collected offline. The offline data can be used directly as transitions to optimize RL objectives, or offline policy and value functions can first be learned from the data and then used for online finetuning or to provide reference actions. While each of these strategies has shown compelling results, it is unclear which method has the most impact on sample efficiency, whether these approaches can be combined, and if there are cumulative benefits. We classify existing demonstration-augmented RL approaches into three categories and perform an extensive empirical study of their strengths, weaknesses, and combinations to isolate the contribution of each strategy and determine effective hybrid combinations for sample-efficient online RL. Our analysis reveals that directly reusing offline data and initializing with behavior cloning consistently outperform more complex offline RL pretraining methods for improving online sample efficiency.
|
| |
| 09:00-10:30, Paper ThI1I.300 | Add to My Program |
| A Tactile Rubbing Gripper for Reliable Fabric Separation |
|
| Ling, Zhengrong | The Hong Kong University of Science and Technology |
| Huang, Zhenghao | The Hong Kong University of Science and Technology |
| Shen, Yajing | The Hong Kong University of Science and Technology |
Keywords: Grippers and Other End-Effectors, Grasping, Industrial Robots
Abstract: Automated fabric manipulation offers great potential for reducing labor requirements in textile manufacturing and domestic services. Yet, even the basic task of separating a single fabric layer poses substantial challenges for robots. Adhesive-based end-effectors suffer from limited material compatibility and environmental adaptability, while gripper-based designs, which primarily target crease grasping and rely on passive separation, frequently demonstrate unreliability. Current vision and tactile systems fail to detect the fabric separation surface. Given these mechanical and sensing constraints, existing separation solutions lack the ability to adjust the number of layers post-grasping, relying solely on single-attempt success. In this work, we propose a novel tactile-enhanced gripper capable of human-like rubbing motion for reliable cloth separation, which integrates a magnetic sensing system to monitor the separation process. Based on these, we further develop a pipeline to realize rubbing-based separation. Extensive experiments show our gripper achieves a 96.67% separation success rate across 15 fabrics with varying weaving patterns, and the tactile system reaches 87.00% accuracy in sliding surface detection. Our work provides a novel mechanism for fabric layer separation, facilitating subsequent cloth manipulation.
|
| |
| 09:00-10:30, Paper ThI1I.301 | Add to My Program |
| Open-World Object Manipulation with Vision-Language-Action Models Via Synthetic Multi-Modal Data |
|
| Chen, Yefei | East China Normal Univercity |
| Wen, Junjie | East China Normal University |
| Li, Jinming | Shanghai University |
| Zhou, Zhongyi | East China Normal University |
| Peng, Yaxin | Shanghai University |
| Shen, Chaomin | East China Normal University |
| Xu, Yi | Midea |
| Zhu, Yichen | Midea Group |
Keywords: Imitation Learning, Data Sets for Robot Learning
Abstract: Imitation learning has proven to be highly effective in teaching robots dexterous manipulation skills. However, it typically relies on large amounts of robot data, which limits its scalability and applicability in dynamic, real-world environments. One key challenge in this context is object generalization—where a robot trained to perform a task with one object, such as "hand over the apple." struggles to transfer its skills to a semantically similar but visually different object, such as "hand over the peach." This gap in generalization to new objects beyond those in the same category has yet to be adequately addressed in previous work on end-to-end visuomotor policy learning. In this paper, we present a simple yet effective approach for achieving object generalization through Vision-Language-Action (VLA) models, referred to as ObjectVLA. We design a lightweight image-text-data-synthesis pipeline, Search2Scene, which enables robots to generalize learned skills to novel objects without requiring explicit human demonstrations for each new target object. By leveraging vision-language pair data, our method provides a lightweight and scalable way to inject knowledge about the target object, establishing an implicit link between the object and the desired action. We evaluate ObjectVLA on a real robotic platform, demonstrating its ability to generalize across 100 novel objects with a 64% success rate in selecting objects not seen during training. These results highlight the effectiveness of our approach in enabling object-level generalization and reducing the need for extensive human demonstrations, paving the way for more flexible and scalable robotic learning systems.
|
| |
| 09:00-10:30, Paper ThI1I.302 | Add to My Program |
| Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-To-Real Manipulation |
|
| Wang, Maggie | Stanford University |
| Tian, Stephen | Stanford University |
| Swann, Aiden | Stanford University |
| Shorinwa, Ola | Princeton University |
| Wu, Jiajun | Stanford University |
| Schwager, Mac | Stanford University |
Keywords: Reinforcement Learning, Machine Learning for Robot Control, Deep Learning in Grasping and Manipulation
Abstract: Learning robotic manipulation policies directly in the real world can be expensive and time-consuming. While reinforcement learning (RL) policies trained in simulation present a scalable alternative, effective sim-to-real transfer remains challenging, particularly for tasks that require precise dynamics. To address this, we propose Phys2Real, a real-to-sim-to-real RL pipeline that combines vision-language model (VLM)-inferred physical parameter estimates with interactive adaptation through uncertainty-aware fusion. Our approach consists of three core components: (1) high-fidelity geometric reconstruction with 3D Gaussian splatting, (2) VLM-inferred prior distributions over physical parameters, and (3) online physical parameter estimation from interaction data. Phys2Real conditions policies on interpretable physical parameters, refining VLM predictions with online estimates via ensemble-based uncertainty quantification. On planar pushing tasks of a T-block with varying center of mass and a hammer with an off-center mass distribution, Phys2Real achieves substantial improvements over a domain randomization baseline: 100% vs 79% success rate for the bottom-weighted T-block, 57% vs 23% in the challenging top-weighted T-block, and 15% faster average task completion for hammer pushing. Ablation studies indicate that the combination of VLM and interaction information is essential for success. Project website: https://phys2real.github.io/.
|
| |
| 09:00-10:30, Paper ThI1I.303 | Add to My Program |
| Look, Focus, Act: Efficient and Robust Robot Learning Via Human Gaze and Foveated Vision Transformers |
|
| Chuang, Ian | University of California, Davis |
| Zou, Jinyu | Tongji University |
| Lee, Andrew | University of California, Davis |
| Gao, Dechen | University of California, Davis |
| Soltani, Iman | University of California, Davis |
Keywords: Imitation Learning, Bimanual Manipulation, Deep Learning in Grasping and Manipulation
Abstract: Human vision is a highly active process driven by gaze, which directs attention to task-relevant regions through foveation, dramatically reducing visual processing. In contrast, robot learning systems typically rely on passive, uniform processing of raw camera images. In this work, we explore how incorporating human-like active gaze into robotic policies can enhance efficiency and robustness. We develop GIAVA (Gaze Integrated Active-Vision ALOHA), a robot vision system that emulates human head and neck movement, and gaze adjustment for foveated processing. Extending the AV-ALOHA robot platform, we introduce a framework for simultaneously collecting eye-tracking, perspective control, and robot manipulation demonstration data from a human operator. We also open-source a simulation benchmark and dataset for training robot policies that incorporate human gaze. Inspired by recent work in foveated image segmentation and given the widespread use of Vision Transformers (ViTs) in robot learning, we integrate gaze information into ViTs using a foveated patch tokenization scheme. Compared to uniform patch tokenization, this significantly reduces the number of tokens, and thus computation. For this purpose, we explore two approaches to gaze estimation: The first is a two-stage model that predicts gaze independently to guide foveation and subsequently action. The second integrates gaze into the action space, allowing the policy to jointly estimate gaze and actions end-to-end. Our results show that our method for foveated robot vision drastically reduces computational overhead, and enhances robustness to background distractors. Notably, on certain high-precision tasks, foveated vision also improves performance, as reflected in higher success rates. Together, these findings suggest that human-inspired foveated visual processing offers untapped potential and should be further considered as a useful inductive bias in robotic vision systems. https://soltanilara.github.io/giava
|
| |
| 09:00-10:30, Paper ThI1I.304 | Add to My Program |
| AIM-SLAM: Dense Monocular SLAM Via Adaptive and Informative Multi-View Keyframe Prioritization with Foundation Model |
|
| Jeon, Jinwoo | KAIST |
| Seo, Dong-Uk | Korea Advanced Institute of Science and Technology |
| Lee, Eungchang Mason | Korea Advanced Institute of Science and Technology |
| Myung, Hyun | KAIST (Korea Advanced Institute of Science and Technology) |
Keywords: SLAM, Mapping, Deep Learning for Visual Perception
Abstract: Recent advances in geometric foundation models have emerged as a promising alternative for addressing the challenge of dense reconstruction in monocular visual simultaneous localization and mapping (SLAM). Although geometric foundation models enable SLAM to leverage variable input views, the previous methods remain confined to two-view pairs or fixed-length inputs without sufficient deliberation of geometric context for view selection. To tackle this problem, we propose AIM-SLAM, a dense monocular SLAM framework that exploits an adaptive and informative multi-view keyframe prioritization with dense pointmap predictions from visual geometry grounded transformer (VGGT). Specifically, we introduce the selective information and geometric-aware multi- view adaptation (SIGMA) module, which employs voxel overlap and information gain to retrieve a candidate set of keyframes and adaptively determine its size. Furthermore, we formulate a joint multi-view Sim(3) optimization that enforces consistent alignment across selected views, substantially improving pose estimation accuracy. The effectiveness of AIM-SLAM is demonstrated on real-world datasets, where it achieves state-of-the-art pose estimation performance and accurate dense reconstruction results. Our system supports ROS integration, with code is available at https://aimslam.github.io/.
|
| |
| 09:00-10:30, Paper ThI1I.305 | Add to My Program |
| A Modular, Wireless and Wearable Biosignal Acquisition Platform |
|
| Doukakis, Antonios | Tech Hive Labs |
| Smyrli, Aikaterini | Athena Research Center |
| Livadas, Makis | Brunel University London |
| De Melo Ribeiro, Henrique | Brunel University London |
| Alingal Meethal, Shadiya | University of Hertfordshire |
| Shahabian Alashti, Mohamad Reza | University of Hertfordshire |
| Lakatos, Gabriella | University of Hertfordshire |
| Holthaus, Patrick | University of Hertfordshire |
| Amirabdollahian, Farshid | The University of Hertfordshire |
Keywords: Wearable Robotics, Human Detection and Tracking, Human and Humanoid Motion Analysis and Synthesis
Abstract: We present a modular, wireless biosignal acquisition platform designed to enable scalable electromyography (EMG) and inertial measurement unit (IMU) sensing for wearable robotics applications. The system supports up to 64 EMG channels and integrates a 9-axis IMU, leveraging a distributed Leader-Follower board architecture. In this work, we demonstrate synchronised acquisition of 32 EMG channels together with IMU motion data in a fully wireless setup. The embedded firmware ensures low-latency, high-fidelity streaming at 1.4 kHz over a 2.4-GHz industrial, scientific and medical (ISM) band link. Benchmarking shows that the platform maintains uniformly strong performance across noise, power, footprint, bandwidth, and scalability, in contrast to existing designs that optimize only a single metric. Experimental demonstrations confirm reliable acquisition of high-density EMG and IMU signals across functional activities, highlighting the device’s robustness and wearability. The proposed system provides a compact and flexible solution for intent-aware wearable technologies, with applications in assistive exosuits, rehabilitation, and human–robot interaction.
|
| |
| 09:00-10:30, Paper ThI1I.306 | Add to My Program |
| Decoding Multi-Finger Motions and Grasp Types with Grasp-Specific Models and Lightmyography Based Muscle-Machine Interfaces |
|
| Wang, Zhe | University of Auckland |
| Guan, Bonnie | University of Auckland |
| Duan, Shifei | University of Auckland |
| Aw, Kean C. | The University of Auckland |
| Liarokapis, Minas | National Technical University of Athens |
Keywords: Intention Recognition, Gesture, Posture and Facial Expressions
Abstract: Efficiently decoding human movement and/or intention is essential for controlling advanced prosthetic and robotic systems. Various muscle-machine interfaces have been researched for this purpose, including electromyography and lightmyography based interfaces. However, the decoding effectiveness of lightmyography signals for multi-finger hand motions remains insufficiently explored. This study investigates the decoding of human multi-finger movements using different machine learning methods. Lightmyography and finger motion data were collected from six participants grasping five common objects. Data were preprocessed using the sliding window method and decoded using three machine learning algorithms: random forest, convolutional neural networks, and multi-layer perceptron. Moreover, models were trained in a grasp-specific manner increasing decoding accuracy. Finally, statistical analysis demonstrated that the random forest model significantly outperformed the other methods, establishing it as the most suitable technique for decoding multi-finger motions from lightmyography signals.
|
| |
| 09:00-10:30, Paper ThI1I.307 | Add to My Program |
| CLUE: Adaptively Prioritized Contextual Cues by Leveraging a Unified Semantic Map for Effective Zero-Shot Object-Goal Navigation |
|
| Kim, Taeyun | KAIST |
| Choi, Alvin Jinsung | Korea Advanced Institute of Science and Technology |
| Hong, Dasol | KAIST |
| Myung, Hyun | KAIST (Korea Advanced Institute of Science and Technology) |
Keywords: Autonomous Agents, Semantic Scene Understanding, AI-Enabled Robotics
Abstract: Zero-shot object-goal navigation (ZSON) is a challenging problem in robotics that requires a comprehensive understanding of both language and visual observations. Contextual cues from rooms and objects are critical, but their relative importance depends on the target: some objects are strongly tied to specific room types, while others are better predicted by nearby co-located objects. Existing methods overlook this distinction, leading to inefficient and inaccurate exploration. We present CLUE, a novel navigation framework that adaptively balances the use of contextual rooms and objects by leveraging commonsense knowledge extracted from an offline large language model (LLM). By estimating a target’s association with room types using LLM, the agent prioritizes room cues for predictable objects and object cues for those with weak room associations. Our framework constructs a unified semantic value map that integrates both types of contextual information, adaptively weighted by the target’s ambiguity to guide exploration. Combined with multi-viewpoint verification and an exploration strategy informed by contextual cues, CLUE achieves robust and efficient navigation. Extensive experiments in simulation and real-world deployments show that our method consistently outperforms state-of-the-art baselines in both success rate (SR) and success weighted by path length (SPL), demonstrating its effectiveness and practicality for real-world navigation tasks.
|
| |
| 09:00-10:30, Paper ThI1I.308 | Add to My Program |
| Vision-Centric 4D Occupancy Forecasting and Planning Via Implicit Residual World Models |
|
| Mei, Jianbiao | Zhejiang University |
| Yang, Yu | Zhejiang University |
| Yang, Xuemeng | Shanghai Artificial Intelligence Laboratory |
| Wen, Licheng | Shanghai Artificial Intelligence Laboratory |
| Lv, Jiajun | Zhejiang University |
| Shi, Botian | Shanghai Artificial Intelligence Laboratory |
| Liu, Yong | Zhejiang University |
Keywords: Visual Learning, Computer Vision for Transportation, Motion and Path Planning
Abstract: End-to-end autonomous driving systems increasingly rely on vision-centric world models to understand and predict their environment. However, a common ineffectiveness in these models is the full reconstruction of future scenes, which expends significant capacity on redundantly modeling static backgrounds. To address this, we propose IR-WM, an Implicit Residual World Model that focuses on modeling the current state and evolution of the world. IR-WM first establishes a robust bird's-eye-view representation of the current state from the visual observation. It then leverages the BEV features from the previous timestep as a strong temporal prior and predicts only the "residual", i.e., the changes conditioned on the ego-vehicle’s actions and scene context. To alleviate error accumulation over time, we further apply an alignment module to calibrate semantic and dynamic misalignments. Moreover, we investigate different forecasting–planning coupling schemes and demonstrate that the implicit future state generated by world models substantially improves planning accuracy. On the nuScenes benchmark, IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning.
|
| |
| 09:00-10:30, Paper ThI1I.309 | Add to My Program |
| Floating-Base Deep Lagrangian Networks |
|
| Schulze, Lucas | Technische Universität Darmstadt |
| Negri, Juliano | University of São Paulo |
| Barasuol, Victor | Istituto Italiano Di Tecnologia |
| Suzano Medeiros, Vivian | University of São Paulo |
| Becker, Marcelo | USP |
| Peters, Jan | Technische Universität Darmstadt |
| Arenz, Oleg | TU Darmstadt |
Keywords: Legged Robots, Model Learning for Control, Dynamics
Abstract: Grey-box methods for system identification combine deep learning with physics-informed constraints, capturing complex dependencies while improving out-of-distribution generalization. Despite the growing importance of floating-base systems such as humanoids and quadrupeds, current grey-box models ignore their specific physical constraints. For instance, the inertia matrix is not only positive definite but also exhibits branch-induced sparsity and input independence. Moreover, the 6×6 composite spatial inertia of the floating base inherits properties of single-rigid-body inertia matrices. As we show, this includes the triangle inequality on the eigenvalues of the composite rotational inertia. To address the lack of physical consistency in deep learning models of floating-base systems, we introduce a parameterization of inertia matrices that satisfies all these constraints. Inspired by Deep Lagrangian Networks (DeLaN), we train neural networks to predict physically plausible inertia matrices that minimize inverse dynamics error under Lagrangian mechanics. For evaluation, we collected and released a dataset on multiple quadrupeds and humanoids. In these experiments, our Floating-Base Deep Lagrangian Networks (FeLaN) achieve better overall performance on both simulated and real robots, while providing greater physical interpretability.
|
| |
| 09:00-10:30, Paper ThI1I.310 | Add to My Program |
| Simulation-Augmented Hysteresis Compensation in Continuum Surgical Robots Via Residual Learning |
|
| Jiang, Haolin | SDU Robotics, Maersk Mc-Kinney Moller Institute, University of Southern Denmark |
| Savarimuthu, Thiusius Rajeeth | University of Southern Denmark |
| Wu, Di | University of Southern Denmark |
|
|
| |
| 09:00-10:30, Paper ThI1I.311 | Add to My Program |
| Wearable, Fabric-Embedded Acoustic Waveguides for Meter-Scale Contact Localization and Force Sensing |
|
| Mason, Wilfred | McGill University |
| Ashkar, Jad | McGill University |
| Brenken, David | McGill University |
| Sedal, Audrey | McGill University |
Keywords: Force and Tactile Sensing, Soft Sensors and Actuators, Wearable Robotics
Abstract: Large-area tactile sensing remains a key challenge for wearable and robotic applications, where solutions must balance resolution and complexity, manufacturability, and conformability to various geometries. While acoustic waveguides have been used for contact localization and force estimation at the centimeter scale, scaling this technology to limb-scale wearable devices is unexplored. In this work, we introduce a soft, wearable tactile sleeve based on wrapped and meter-length acoustic waveguides. By patterning waveguides on a sleeve, one-dimensional time-of-flight measurements are mapped to two-dimensional contact locations. This enables conformable coverage with sparse transducers, while preserving mechanical robustness by placing rigid electronics away from the contact surface. We contribute the design and fabrication of the waveguide-based tactile sensor, provide an in-depth characterization of sensor response and evaluate frameworks for contact localization and force estimation, and demonstrate system performance on a human arm. Results show that the time-of-flight-based localization approach generalizes across contact sizes and curved geometries. However, more work is required to achieve sensitive and reliable force estimates. This work establishes acoustic waveguides as a manufacturable and reconfigurable modality for wearable tactile skins.
|
| |
| 09:00-10:30, Paper ThI1I.312 | Add to My Program |
| COLA: Learning Human-Humanoid Coordination for Collaborative Object Carrying |
|
| Du, Yushi | The University of Hong Kong |
| Li, Yixuan | Beijing Institute of Technology |
| Jia, Baoxiong | Beijing Institute for General Artificial Intelligence |
| Lin, Yutang | Peking University |
| Zhou, Pei | The University of Hong Kong |
| Liang, Wei | Beijing Institute of Technology |
| Yang, Yanchao | Stanford University |
| Huang, Siyuan | Beijing Institute for General Artificial Intelligence |
Keywords: Human-Robot Collaboration, Legged Robots, Human Factors and Human-in-the-Loop
Abstract: Human-humanoid collaboration shows significant promise for applications in healthcare, domestic assistance, and manufacturing. While compliant robot-human collaboration has been extensively developed for robotic arms, enabling compliant human-humanoid collaboration remains largely unexplored due to humanoids’ complex whole-body dynamics. In this paper, we propose a proprioception-only reinforcement learning approach, COLA, that combines leader and follower behaviors within a single policy. The model is trained in a closed-loop environment with dynamic object interactions to predict object motion patterns and human intentions implicitly, enabling compliant collaboration to maintain load balance through coordinated trajectory planning. We evaluate our approach through comprehensive simulator and real-world experiments on collaborative carrying tasks, demonstrating the effectiveness, generalization, and robustness of our model across various terrains and objects. Simulation experiments demonstrate that our model reduces human effort by 24.7% compared to baseline approaches while maintaining object stability. Real-world experiments validate robust collaborative carrying across different object types (boxes, desks, stretchers, etc.) and movement patterns (straight-line, turning, slope climbing). Human user studies with 23 participants confirm an average improvement of 27.4% compared to baselines. Our method enables compliant human-humanoid collaborative carrying without requiring external sensors or complex interaction models, offering a practical solution for real-world deployment.
|
| |
| 09:00-10:30, Paper ThI1I.313 | Add to My Program |
| CMAR-Search: Commonsense and Memory Augmented Reasoning for Object Search in Dynamic Interactive Environments |
|
| Liao, Kaiyao | Beihang University |
| Li, Qingfeng | Beihang University |
| Zhang, Xinlei | Beihang University |
| Chen, Chen | Hangzhou Innovation Institute of Beihanga University |
| Sun, Qing | Hangzhou Innovation Institute of Beihang University |
| Niu, Jianwei | Beihang University |
Keywords: Semantic Scene Understanding, Task and Motion Planning, Mobile Manipulation
Abstract: Dynamic interactive object search in large-scale human environments presents substantial challenges for existing methods. Current scene representations like 3D Scene Graphs (3DSG) only provide coarse-grained spatial segmentation and cannot identify functional areas such as storage or leisure areas. Without functional area understanding, existing methods are constrained to exhaustive sequential exploration at large scales, resulting in inefficient search behaviors—particularly in open-layout environments with numerous interactive objects such as drawers and cabinets. Moreover, these methods lack adaptability to environmental dynamics such as object relocations. To address these limitations, this paper proposes CMAR-search, a novel framework built upon Commonsense and Memory Augmented Reasoning (CMAR). Our approach leverages commonsense about area functionalities and aggregates environmental memory to construct a Functional 3D Scene Graph (F3DSG), which organizes the environment into functional areas with their associated containers. Through this structured representation, CMAR enables hierarchical action planning at both macro-area and micro-container levels, empowering the system to efficiently identify and inspect semantically relevant areas for effective object search. Notably, CMAR continuously integrates real-time perception, accumulated memory, and commonsense to dynamically relocalize objects in changing environments. Extensive experiments in simulation and real-world settings demonstrate that CMAR-search significantly surpasses state-of-the-art baselines in both success rate and search efficiency for object search in dynamic interactive environments.
|
| |
| 09:00-10:30, Paper ThI1I.314 | Add to My Program |
| Reimagination with Test-Time Observation Interventions: Distractor-Robust World Model Predictions for Visual Model Predictive Control |
|
| Chen, Yuxin | University of California, Berkeley |
| Wei, Jianglan | Univerisity of California, Berkeley |
| Xu, Chenfeng | University of California, Berkeley |
| Li, Boyi | UC Berkeley |
| Tomizuka, Masayoshi | University of California |
| Bajcsy, Andrea | Carnegie Mellon University |
| Tian, Thomas | University of California, Berkeley |
Keywords: Robot Safety, Machine Learning for Robot Control, Model Learning for Control
Abstract: World models enable robots to “imagine” future observations given current observations and planned actions, and have been increasingly adopted as generalized dynamics models to facilitate robot learning. Despite their promise, these models remain brittle when encountering novel visual distractors such as objects and background elements rarely seen during training. Specifically, novel distractors can corrupt action outcome predictions, causing downstream failures when robots rely on the world model imaginations for planning or action verification. In this work, we propose Reimagination with Observation Intervention (ReOI), a simple yet effective test-time strategy that enables world models to predict more reliable action outcomes in open-world scenarios where novel and unanticipated visual distractors are inevitable. Given the current robot observation, ReOI first detects visual distractors by identifying which elements of the scene degrade in physically implausible ways during world model prediction. Then, it modifies the current observation to remove these distractors and bring the observation closer to the training distribution. Finally, ReOI “reimagines” future outcomes with the modified observation and reintroduces the distractors post-hoc to preserve visual consistency for downstream planning and verification. We validate our approach on a suite of robotic manipulation tasks in the context of action verification, where the verifier needs to select desired action plans based on predictions from a world model. Our results show that ReOI is robust to both in-distribution and out-of-distribution visual distractors. Notably, it improves task success rates by up to 3× in the presence of novel distractors, significantly outperforming action verification that relies on world model predictions without imagination interventions.
|
| |
| 09:00-10:30, Paper ThI1I.315 | Add to My Program |
| NavMoE: Hybrid Model and Learning-Based Traversability Estimation for Local Navigation Via Mixture of Experts |
|
| He, Botao | University of Maryland |
| Shahidzadeh, Amir Hossein | University of Maryland |
| Chen, Yu | Carnegie Mellon University |
| Wu, Jiayi | University of Maryland, College Park |
| Guan, Tianrui | University of Maryland |
| Chen, Guofei | Carnegie Mellon University |
| Manocha, Dinesh | University of Maryland |
| Choset, Howie | Carnegie Mellon University |
| Chou, Glen | Georgia Institute of Technology |
| Fermüller, Cornelia | University of Maryland |
| Aloimonos, Yiannis | University of Maryland |
Keywords: Motion and Path Planning, Integrated Planning and Learning, Vision-Based Navigation
Abstract: This paper explores traversability estimation for robot navigation. A key bottleneck in traversability estimation lies in efficiently achieving reliable and robust predictions while accurately encoding both geometric and semantic information across diverse environments. We introduce Navigation via Mixture of Experts (NAVMOE), a hierarchical and modular approach for traversability estimation and local navigation. NAVMOE combines multiple specialized models for specific terrain types, each of which can be either a classical model-based or a learning-based approach that predicts traversability for specific terrain types. NAVMOE dynamically weights the contributions of different models based on the input environment through a gating network. Overall, our approach offers three advantages: First, NAVMOE enables traversability estimation to adaptively leverage specialized approaches for different terrains, which enhances generalization across diverse and unseen environments. Second, our approach significantly improves efficiency with negligible cost of solution quality by introducing a training-free lazy gating mechanism, which is designed to minimize the number of activated experts during inference. Third, our approach uses a two-stage training strategy that enables the training for the gating networks within the hybrid MoE method that contains nondifferentiable modules. Extensive experiments show that NAVMOE delivers a better efficiency and performance balance than any individual expert or full ensemble across different domains, improving cross- domain generalization and reducing average computational cost by 81.2% via lazy gating, with less than a 2% loss in path quality.
|
| |
| 09:00-10:30, Paper ThI1I.316 | Add to My Program |
| FeatX: Controlled Robot Configuration with Flexible Feature Binding |
|
| Gyimah, Jude | Ruhr University Bochum |
| Knopp, Henriette | Ruhr University Bochum |
| Berger, Thorsten | Ruhr University Bochum |
| Pelliccione, Patrizio | Gran Sasso Science Institute (GSSI) |
Keywords: Industrial Robots
Abstract: Robotic systems often need to be configured for different dynamic execution environments, hardware, or non-functional properties, such as energy consumption. Configuration options, a.k.a. features, can be used to enable, disable, and calibrate different parts of the system, ranging from whole subsystems over components to lines of code. While configuration mechanisms are abundant in robotics, they are limited in expressiveness, and the configuration files are often distributed over the codebase in different artifacts, challenging the consistent declaration and enforcement of dependencies. In addition, robotic systems require flexibility, since features need to be activated or changed at different times in the lifecycle of the system, which can cause intricate dependencies, especially when they depend on other static or dynamic features. To prevent misconfiguration and undefined behavior, such configuration spaces need to be properly declared and managed. We present FeatX, a model-based configuration technique. It uses and extends feature models, but accounts for the specific needs in robotics. Specifically, it allows declaring features, their dependencies, as well as the allowed binding times and binding modes, while the configurator enforces correct configuration and reconfiguration, considering intricate semantics of such models. We designed the syntax and semantics of the FeatX language and implemented them in the configurator. Our prototype is implemented for ROS2 with a command-line interface (ros2cli). We evaluated it upon realistic (re-)configuration scenarios.
|
| |
| 09:00-10:30, Paper ThI1I.317 | Add to My Program |
| Tactile Recognition of Both Shapes and Materials with Automatic Feature Optimization-Enabled Meta Learning |
|
| Zhao, Hongliang | Southeast University |
| Yang, Wenhui | Southeast University, Nanjing, China |
| Chen, Yang | Southeast University |
| Wang, Zhuorui | Southeast University |
| Liu, Baiheng | Southeast University |
| Qin, Longhui | Southeast University |
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation, Representation Learning
Abstract: Tactile perception is indispensable for robots to implement various manipulations dexterously, especially in contact-rich scenarios. However, alongside with the development of deep learning techniques, it meanwhile suffers from training data scarcity and time-consuming learning process in practical applications since the collection of a large amount of tactile data is costly and sometimes even impossible. Hence, we propose an automatic feature optimization-enabled prototypical network to realize meta learning, i.e., AFOP-ML framework. As a "learning to learn" network, it not only adapts to new unseen classes rapidly with few-shot, but also learns how to determine the optimal feature space automatically. Based on the four-channel signals acquired from a tactile finger, both shapes and materials are recognized. On a 36-category benchmark, it outperforms several existing approaches by attaining an accuracy of 96.08% in 5-way-1-shot scenario, where only 1 example is available for training. It still remains 88.7% in the extreme 36-way-1-shot case. The generalization ability is further validated through three groups of experiment involving unseen shapes, materials and force/speed perturbations. More insights are additionally provided by this work for the interpretation of recognition tasks and improved design of tactile sensors.
|
| |
| 09:00-10:30, Paper ThI1I.318 | Add to My Program |
| Dynamics-Aware Critical Neighbor Selection for Distributed Connectivity Maintenance in Multi-Agent Systems |
|
| Tan, Wei | China University of Geosciences |
| Chen, Xin | China University of Geosciences |
Keywords: Multi-Robot Systems, Distributed Robot Systems
Abstract: Maintaining connectivity in multi-agent systems often compromises task performance. Current strategies are frequently hampered by heavy communication loads and overly restrictive motion constraints. Furthermore, their local decision-making relies on static geometric information, neglecting agent dynamics. To address these shortcomings, this paper proposes a scalable, distributed framework centered on a novel dynamics-aware connection cost metric. This metric enables agents to prospectively select dynamically stable, task-compatible links, which are then enforced using control barrier functions (CBFs) within a cooperative optimization scheme. In multi-agent target-reaching tasks, simulations show our dynamics-aware metric reduces the final average goal distance by up to 26.1% compared to a static distance-based selection heuristic. Furthermore, our framework maintains persistent connectivity in highly dynamic scenarios, whereas a state-of-the-art algebraic connectivity-based method fails under limited communication bandwidth.
|
| |
| 09:00-10:30, Paper ThI1I.319 | Add to My Program |
| StereoAdapter: Adapting Stereo Depth Estimation to Underwater Scenes |
|
| Wu, Zhengri | University of Sydney |
| Wang, Yiran | University of Sydney |
| Wen, Yu | The University of Sydney |
| Zhang, Zeyu | The Australian National University |
| Wu, Biao | University of Technology Sydney |
| Tang, Hao | Peking University |
Keywords: Deep Learning for Visual Perception, Transfer Learning, Deep Learning Methods
Abstract: Underwater stereo depth estimation provides accurate 3D geometry for robotics tasks such as navigation, inspection, and mapping, offering metric depth from low-cost passive cameras while avoiding the scale ambiguity of monocular methods. However, existing approaches face two critical challenges: (i) parameter-efficiently adapting large vision foundation encoders to the underwater domain without extensive labeled data, and (ii) tightly fusing globally coherent but scale-ambiguous monocular priors with locally metric yet photometrically fragile stereo correspondences. To address these challenges, we propose StereoAdapter, a parameter-efficient self-supervised framework that integrates a LoRA-adapted monocular foundation encoder with a recurrent stereo refinement module. We further introduce dynamic LoRA adaptation for efficient rank selection and pre-training on the synthetic UW-StereoDepth-40K dataset to enhance robustness under diverse underwater conditions. Comprehensive evaluations on both simulated and real-world benchmarks show improvements of 6.11% on TartanAir and 5.12% on SQUID compared to state-of-the-art methods, while real-world deployment with the BlueROV2 robot further demonstrates the consistent robustness of our approach.
|
| |
| 09:00-10:30, Paper ThI1I.320 | Add to My Program |
| MachaGrasp: Morphology-Aware Cross-Embodiment Dexterous Hand Articulation Generation for Grasping |
|
| Zhang, Heng | Nanyang Technological University |
| Ma, Kevin Yuchen | National University of Singapore |
| Shou, Zheng | National University of Singapore |
| Lin, Weisi | Nanyang Technological University |
| Wu, Yan | A*STAR Institute for Infocomm Research |
Keywords: Grasping, Manipulation Planning, Multifingered Hands
Abstract: Dexterous grasping with multi-fingered hands remains challenging due to high-dimensional articulations and the cost of optimization-based pipelines. Existing end-to-end methods require training on large-scale datasets for specific hands, limiting their ability to generalize across different embodiments. We propose MachaGrasp, an eigengrasp-based, end-to-end framework for cross-embodiment grasp generation. From a hand’s morphology description, we derive a morphology embedding and an eigengrasp set. Conditioned on these, together with the object point cloud and wrist pose, an amplitude predictor regresses articulation coefficients in a low-dimensional space, which are decoded into full joint articulations. Articulation learning is supervised with a Kinematic-Aware Articulation Loss (KAL) that emphasizes fingertip-relevant motions and injects morphology-specific structure. In simulation on unseen objects across three dexterous hands, MachaGrasp attains a 91.9% average grasp success rate with <0.4s inference per grasp. With few-shot adaptation to an unseen hand, it achieves 85.6% success on unseen objects in simulation, and real-world experiments on this few-shot-generalized hand achieve an 87% success rate. The code and additional materials are available on our project website https://connor-zh.github.io/MachaGrasp/.
|
| |
| 09:00-10:30, Paper ThI1I.321 | Add to My Program |
| OmniDexGrasp: Generalizable Dexterous Grasping Via Foundation Model and Force Feedback |
|
| Wei, Yi-Lin | Sun Yat-Sen University |
| Luo, Zhexi | Sun Yat-Sen University |
| Lin, Yuhao | Sun Yat-Sen University |
| Lin, Mu | Sun Yat-Sen University |
| Liang, Zhizhao | Sun Yat-Sen University |
| Chen, Shuoyu | Sun Yat-Sen University |
| Zheng, Wei-Shi | Sun Yat-Sen University |
Keywords: Grasping, Dexterous Manipulation, Deep Learning in Grasping and Manipulation
Abstract: Enabling robots to dexterously grasp and manipulate objects based on human commands is a promising direction in robotics. However, existing approaches are challenging to generalize across diverse objects or tasks due to the limited scale of semantic dexterous grasp datasets. Foundation models offer a new way to enhance generalization, yet directly leveraging them to generate feasible robotic actions remains challenging due to the gap between abstract model knowledge and physical robot execution. To address these challenges, we propose OmniDexGrasp, a generalizable framework that achieves omni-capabilities in user prompting, dexterous embodiment, and grasping tasks by combining foundation models with the transfer and control strategies. OmniDexGrasp integrates three key modules: (i) foundation models are used to enhance generalization by generating human grasp images supporting omni-capability of user prompt and task; (ii) a human-image-to-robot-action transfer strategy converts human demonstrations into executable robot actions, enabling omni dexterous embodiment; (iii) force-aware adaptive grasp strategy ensures robust and stable grasp execution. Experiments in simulation and on real robots validate the effectiveness of OmniDexGrasp on diverse user prompts, grasp task and dexterous hands, and further results show its extensibility to dexterous manipulation tasks.
|
| |
| 09:00-10:30, Paper ThI1I.322 | Add to My Program |
| VCC: Efficient Voxel-Based Collision Checking Framework for Real-Time Robotic Motion Planning |
|
| Chen, Ching | National Yang Ming Chiao Tung University |
| Yeh, Tsung Tai | National Yang Ming Chiao Tung University |
Keywords: Software Architecture for Robotic and Automation, Embedded Systems for Robotic and Automation, Software-Hardware Integration for Robot Systems
Abstract: To navigate the environment with dynamic obstacles, a robot must continuously scan for them and find collision-free paths to reach a goal position. This process starts with receiving obstacle information in the form of a point cloud, followed by a pre-planning stage that involves preprocessing to remove unnecessary points and constructing an environment data structure. However, the pre-planning stage can consume more than 16times the runtime of the planning stage, slowing the robot’s reaction speed. Thus, in this work, we propose textit{VCC}, an efficient collision checking framework that primarily targets the pre-planning bottleneck. VCC first cleans the point cloud using Center-selective Voxel Filtering. It then divides the environment into voxels using Adaptive Workspace Voxelization and organizes them in a Multilevel Voxel Table (MVT). In addition, VCC manages the MVT in two memory pools to ensure high data locality and SIMD-aligned data layout. During motion planning, the planner can perform low-latency SIMD-accelerated collision checking using the MVT. Compared to the state-of-the-art method, the experimental results show a 3.63times speedup in filtering. In terms of environment data structure, MVT achieves a 220.48times speedup during construction and reduces memory usage by 97.73%. Additionally, VCC accelerates sampling-based planning by 1.94times. Altogether, VCC achieves an end-to-end speedup of 7.71times on the desktop CPU platform and 4.23times on the embedded computer platform, making real-time motion planning practical for resource-constrained edge devices.
|
| |
| 09:00-10:30, Paper ThI1I.323 | Add to My Program |
| AutoFocus-IL: VLM-Based Saliency Maps for Data-Efficient Visual Imitation Learning without Extra Human Annotations |
|
| Gong, Litian | University of California Riverside |
| Bahrani, Fatemeh | University of Southern California |
| Zhou, Yutai | University of Southern California |
| Banayeeanzade, Amin | University of Southern California |
| Li, Jiachen | University of California, Riverside |
| Bıyık, Erdem | University of Southern California |
Keywords: Imitation Learning, Learning from Demonstration, Representation Learning
Abstract: We present AutoFocus-IL, a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Saliency regularization has emerged as a promising way to achieve this, but existing approaches typically require costly supervision such as human gaze data or manual saliency annotations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Our findings highlight that VLM-driven saliency provides a scalable, annotation-free path toward robust imitation learning in robotics. Particularly, our experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that AutoFocus-IL not only outperforms standard behavior cloning but also surpasses state-of-the-art baselines that assume privileged access to human supervision, such as gaze data. The supplementary materials, including code, datasets, and trained policy videos, are publicly available at https://AutoFocus-IL.github.io/ .
|
| |
| 09:00-10:30, Paper ThI1I.324 | Add to My Program |
| Closed-Loop Cross-Scale Motion of Decoupled Light and Tendon Driven Miniature Continuum Robots |
|
| Zhou, Cheng | Shanghai Jiao Tong University |
| Qin, Xiaotong | Shanghai Jiao Tong University |
| Yu, Haoyang | Shanghai Jiao Tong University |
| Xia, Jingyuan | Shanghai Jiao Tong University |
| Xu, Zheng | Shanghai Jiao Tong University |
| Lin, Zecai | Shanghai Jiao Tong University |
| Gao, Anzhu | Shanghai Jiao Tong University |
Keywords: Surgical Robotics: Steerable Catheters/Needles, Soft Sensors and Actuators, Soft Robot Materials and Design
Abstract: Small-scale robots are rapidly advancing in diverse fields such as industry and medicine. To be effective, they must be capable of accessing narrow, tortuous, or otherwise hard-to-reach environments and performing precise manipulation. This paper presents a vision-based closed-loop motion control scheme for a developed fiber-driven continuum robot for cross-scale motion. Function-multiplexed optical fibers are employed to achieve macro motion through fiber actuation and micro motion through light transmission within the fibers. An external eye-to-hand camera system observes a fiducial tag to estimate its 3D pose relative to the camera frame. The coordinate transformation between the tag and the end-effector is calibrated, along with the mapping between input laser power and light-induced joint contractions. A two-stage image-based visual servoing strategy is then implemented to guide the tag toward the target image position, thereby realizing closed-loop hybrid macro–micro motion through the developed kinematics and visual feedback. Point-tracking experiments demonstrate that the small-scale continuum robot, with an outer diameter of approximately 1.2 mm, can achieve precise cross-scale motion across workspaces ranging from tens of microns to the millimeter scale under the proposed control scheme. This work highlights the potential of hybrid macro–micro motion with visual servoing for deep access and high-precision operation in endoluminal interventions.
|
| |
| 09:00-10:30, Paper ThI1I.325 | Add to My Program |
| Can a Robot Walk the Robotic Dog: Triple-Zero Collaborative Navigation for Heterogeneous Multi-Agent Systems |
|
| Wang, Yaxuan | Peking University |
| Xiang, Yifan | Peking University |
| Li, Ke | Beijing University of Posts and Telecommunications |
| Zhang, Xun | Peking University |
| Ye, BoWen | Peking University |
| Fan, Zhuochen | Pengcheng Laboratory |
| Wei, Fei | Beijing Jinruyi Large Model Technology Co., LTD |
| Yang, Tong | Peking University |
Keywords: Cooperating Robots, Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents
Abstract: We present Triple Zero Path Planning (TZPP), a collaborative framework for heterogeneous multi-robot systems that requires zero training, zero prior knowledge, and zero simulation. TZPP employs a coordinator–explorer architecture: a humanoid robot handles task coordination, while a quadruped robot explores and identiffes feasible paths using guidance from a multimodal large language model. We implement TZPP on Unitree G1 and Go2 robots and evaluate it across diverse indoor and outdoor environments, including obstacle-rich and landmark-sparse settings. Experiments show that TZPP achieves robust, human-comparable effffciency and strong adaptability to unseen scenarios. By eliminating reliance on training and simulation, TZPP offers a practical path toward real-world deployment of heterogeneous robot cooperation. Our code and video are provided at: https://github.com/triple-zeropp/Triple-zero-robot-agent
|
| |
| 09:00-10:30, Paper ThI1I.326 | Add to My Program |
| Learning Sidewalk Autopilot from Multi-Scale Imitation with Corrective Behavior Expansion |
|
| He, Honglin | UCLA |
| Ma, Yukai | Zhejiang University |
| Squicciarini, Brad | Coco Robotics |
| Wu, Wayne | University of California, Los Angeles |
| Zhou, Bolei | University of California, Los Angeles |
Keywords: Vision-Based Navigation, Imitation Learning, Wheeled Robots
Abstract: Sidewalk micromobility is a promising solution for last-mile transportation, but current learning-based control methods struggle in complex urban environments. Imitation learning (IL) learns policies from human demonstrations, yet its reliance on fixed offline data often leads to compounding errors, limited robustness, and poor generalization. To address these challenges, we propose a framework that advances IL through corrective behavior expansion and multi-scale imitation learning. On the data side, we augment teleoperation datasets with diverse corrective behaviors and sensor augmentations to enable the policy to learn to recover from its own mistakes. On the model side, we introduce a multi-scale IL architecture that captures both short-horizon interactive behaviors and long-horizon goal-directed intentions via horizon-based trajectory clustering and hierarchical supervision. Real-world experiments show that our approach significantly improves robustness and generalization in diverse sidewalk scenarios. Demo video and additional information are available on the project page.
|
| |
| 09:00-10:30, Paper ThI1I.327 | Add to My Program |
| A Pin-Array Structured Climbing Robot for Stable Locomotion on Steep Rocky Terrain |
|
| Nagaoka, Keita | Tohoku University |
| Uno, Kentaro | Tohoku University |
| Yoshida, Kazuya | Tohoku University |
Keywords: Field Robots, Grippers and Other End-Effectors, Mechanism Design
Abstract: Climbing robots face significant challenges when navigating unstructured environments, where reliable attachment to irregular surfaces is critical. We present a novel mobile climbing robot equipped with compliant pin-array structured grippers that passively conform to surface irregularities, ensuring stable ground gripping without the need for complicated sensing or control. Each pin features a vertically split design, combining an elastic element with a metal spine to enable mechanical interlocking with microscale surface features. Statistical modeling and experimental validation indicate that variability in individual pin forces and contact numbers are the primary sources of grasping uncertainty. The robot demonstrated robust and stable locomotion in indoor tests on inclined walls (10--30 degrees) and in outdoor tests on natural rocky terrain. This work highlights that a design emphasizing passive compliance and mechanical redundancy provides a practical and robust solution for real-world climbing robots while minimizing control complexity.
|
| |
| 09:00-10:30, Paper ThI1I.328 | Add to My Program |
| Adap-RPF: Adaptive Trajectory Sampling for Robot Person Following in Dynamic Crowded Environments |
|
| Situ, Weixi | Southern University of Science and Technology |
| Ye, Hanjing | Southern University of Science and Technology |
| Peng, Jianwei | Southern University of Science and Technology |
| Zhan, Yu | Southern University of Science and Technology |
| Zhang, Hong | SUSTech |
Keywords: Robot Companions, Human-Centered Automation, Surveillance Robotic Systems
Abstract: Robot person following (RPF) is a core capability in human–robot interaction, enabling robots to assist users in daily activities, collaborative work, and other service scenarios. However, achieving practical RPF remains challenging due to frequent occlusions, particularly in dynamic and crowded environments. Existing approaches often rely on fixed-point following or sparse candidate-point selection with oversimplified heuristics, which cannot adequately handle complex occlusions caused by moving obstacles such as pedestrians. To address these limitations, we propose an adaptive trajectory sampling method that generates dense candidate points within socially aware zones and evaluates them using a multi-objective cost function. Based on the optimal point, a person-following trajectory is estimated relative to the predicted motion of the target. We further design a prediction-aware model predictive path integral (MPPI) controller that simultaneously tracks this trajectory and proactively avoids collisions using predicted pedestrian motions. Extensive experiments show that our method outperforms state-of-the-art baselines in smoothness, safety, robustness, and human comfort, with its effectiveness further demonstrated on a mobile robot in real-world scenarios.
|
| |
| 09:00-10:30, Paper ThI1I.329 | Add to My Program |
| GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning |
|
| Li, Mingleyang | Peking University |
| Wang, Yuran | Peking University |
| Chen, Yue | Peking University |
| Chen, Tianxing | The University of Hong Kong |
| Liang, Jiaqi | Peking University |
| Shen, Zishun | Peking University |
| Lu, Haoran | Northwestern University |
| Wu, Ruihai | Peking University |
| Dong, Hao | Peking University |
Keywords: Manipulation Planning, Grasping, Representation Learning
Abstract: Garment manipulation has attracted increasing attention due to its critical role in home-assistant robotics. However, the majority of existing garment manipulation works assume an initial state consisting of only one garment, while piled garments are far more common in real-world settings. To bridge this gap, we propose a novel garment retrieval pipeline that can not only follow language instruction to execute safe and clean retrieval but also guarantee exactly one garment is retrieved per attempt, establishing a robust foundation for the execution of downstream tasks (e.g., folding, hanging, wearing). Our pipeline seamlessly integrates vision-language reasoning with visual affordance perception, fully leveraging the high-level reasoning and planning capabilities of VLMs alongside the generalization power of visual affordance for low-level actions. To enhance the VLM's comprehensive awareness of each garment's state within a garment pile, we employ visual segmentation model (SAM2) to execute object segmentation on the garment pile for aiding VLM-based reasoning with sufficient visual cues. A mask fine-tuning mechanism is further integrated to address scenarios where the initial segmentation results are suboptimal. In addition, a dual-arm cooperation framework is deployed to address cases involving large or long garments, as well as excessive garment sagging caused by incorrect grasping point determination, both of which are strenuous for a single arm to handle. The effectiveness of our pipeline are consistently demonstrated across diverse tasks and varying scenarios in both real-world and simulation environments. Project page: https://garmentpile2.github.io/.
|
| |
| 09:00-10:30, Paper ThI1I.330 | Add to My Program |
| UniDoorManip: Learning Universal Door Manipulation Policy Over Large-Scale and Diverse Door Manipulation Environments |
|
| Li, Yu | Beijing University of Posts and Telecommunications |
| Zhang, Xiaojie | Beijing University of Posts and Telecommunications |
| Wu, Ruihai | Peking University |
| Zhang, Zilong | Beijing University of Posts and Telecommunications |
| Geng, Yiran | Peking University |
| Dong, Hao | Peking University |
| He, Zhaofeng | Beijing University of Posts and Telecommunications |
Keywords: Perception for Grasping and Manipulation
Abstract: Learning a universal manipulation policy encompassing doors with diverse categories, geometries and mechanisms, is crucial for future embodied agents to effectively work in complex and broad real-world scenarios. Due to the limited datasets and unrealistic simulation environments, previous studies fail to achieve good performance across various doors. In this work, we build a novel door manipulation environment reflecting different realistic door manipulation mechanisms, and further equip this environment with a large-scale door dataset covering 6 door categories with hundreds of door bodies and handles, making up thousands of different door instances. To learn a universal policy over diverse doors, we propose a novel framework disentangling the whole manipulation process into three stages, and integrating them through conditional training. Extensive experiments validate the effectiveness of our designs and demonstrate our framework's strong performance in simulation and real world.
|
| |
| 09:00-10:30, Paper ThI1I.331 | Add to My Program |
| Informed Federated Learning to Train a Robotic Arm Inverse Dynamic Model |
|
| Jimenez-Perera, Gabriel | University of Granada |
| Valencia-Vidal, Brayan | University of Granada |
| Luque, Niceto R. | University of Granada |
| Ros, Eduardo | University of Granada |
| Barranco, Francisco | University of Granada |
Keywords: Deep Learning Methods, Data Sets for Robot Learning
Abstract: Access to real-world data in robotics domains is often challenging due to restrictions on data sharing and limited availability. Although privacy and intellectual property concerns are the main barriers, ensuring data access is crucial for advancing data-driven models. Specifically, machine-learning-based inverse dynamic models show promising results for nonrigid robot identification, but the data used to train them are often kept private due to intellectual property protections. Federated learning proposes a methodology to access such data without centralizing them in a single repository, thus avoiding intellectual property limitations. We propose a solution that uses federated learning to train a model from distributed data to develop a robust robotic arm inverse dynamic model. Our approach demonstrates the feasibility of using a machine learning method in which local robots train on their own data while collaborating without sharing raw information. Furthermore, we propose a novel custom aggregation method that integrates locally learned solutions from different workspaces into a single global model without requiring raw data sharing. This method improves accuracy in our federated solution by approximately 20% for the learned inverse dynamic model.
|
| |
| 09:00-10:30, Paper ThI1I.332 | Add to My Program |
| ActMVS: Active Scene Reconstruction with Monocular Multi-View Stereo |
|
| Pu, Guo | Peking University |
| Han, Yixuan | Peking University |
| Lian, Zhouhui | Peking University |
Keywords: View Planning for SLAM, Mapping, Vision-Based Navigation
Abstract: Active scene reconstruction enables robots/UAVs to autonomously plan trajectories and reconstruct environments without costly manual data acquisition. Unlike passive methods, active reconstruction requires real-time construction of high-confidence occupancy maps for collision-free navigation. Existing approaches rely on depth sensors for occupancy map updates, increasing platform cost and weight. To advance spatial intelligence, we aim for a vision-only monocular solution. However, current monocular scene reconstruction methods operate offline and fail to deliver globally consistent dense depth at the frame rates required for robots/UAVs navigation. To bridge this gap, we introduce ActMVS, the first framework for monocular active reconstruction. Our framework integrates a view factor graph construction for informed Multi-View Stereo depth prediction, along with a global depth optimization, to enable the online generation of high-quality, globally consistent dense depth maps. This enables monocular robots/UAVs to maintain reliable occupancy maps for safe trajectory planning during reconstruction. Experiments on Replica datasets demonstrate performance competitive with RGB-D methods. Our code and data are available at url{https://github.com/TrickyGo/ActMVS}.
|
| |
| 09:00-10:30, Paper ThI1I.333 | Add to My Program |
| Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects |
|
| Wang, Jiawei | University of California, San Diego |
| Wang, Dingyou | ShanghaiTech University |
| Hu, Jiaming | University of California, San Diego |
| Zhang, Qixuan | Shanghaitech University |
| Xu, Lan | ShanghaiTech University |
| Yu, Jingyi | ShanghaiTech University |
Keywords: Semantic Scene Understanding, Perception for Grasping and Manipulation, Mechanism Design
Abstract: A deep understanding of kinematic structures is essential for robot motion and interaction with the environment. Such understanding is captured through articulated objects, which are essential for physical simulation, motion planning, and policy learning. However, creating these models, particularly for objects with high degrees of freedom (DoF), remains a significant challenge. Existing methods typically rely on motion sequences or strong assumptions from hand-curated datasets. In this paper, we introduce Kinematify, an automated framework that synthesizes articulated objects from arbitrary RGB images or textual descriptions. Our method addresses two core challenges: (i) inferring kinematic topologies for high-DoF objects and (ii) estimating joint parameters from static 3D geometry. To achieve this, we combine MCTS search for structural inference with geometry-driven optimization for joint reasoning, producing physically consistent and functionally valid models. We evaluate Kinematify on diverse inputs from both synthetic environments and real-world, demonstrating improvements in registration and kinematic topology accuracy over prior work.
|
| |
| 09:00-10:30, Paper ThI1I.334 | Add to My Program |
| SoftMimic: Learning Compliant Whole-Body Control from Examples |
|
| Margolis, Gabriel | Massachusetts Institute of Technology |
| Wang, Michelle | Massachusetts Institute of Technology |
| Fey, Nolan | Massachusetts Institute of Technology |
| Agrawal, Pulkit | MIT |
Keywords: Human and Humanoid Motion Analysis and Synthesis, Whole-Body Motion Planning and Control, Reinforcement Learning
Abstract: We introduce SoftMimic, a framework for learning compliant whole-body control policies for humanoid robots from example motions. Imitating human motions with reinforcement learning allows humanoids to quickly learn new skills, but existing methods incentivize stiff control that aggressively corrects deviations from a reference motion, leading to brittle and unsafe behavior when the robot encounters unexpected contacts. In contrast, SoftMimic enables robots to respond compliantly to external forces while maintaining balance and posture. Our approach leverages an inverse kinematics solver to generate an augmented dataset of feasible compliant motions, which we use to train a reinforcement learning policy. By rewarding the policy for matching compliant responses rather than rigidly tracking the reference motion, SoftMimic learns to absorb disturbances and generalize to varied tasks from a single motion clip. We validate our method through simulations and real-world experiments, demonstrating safe and effective interaction with the environment.
|
| |
| 09:00-10:30, Paper ThI1I.335 | Add to My Program |
| Graphite: A GPU-Accelerated Mixed-Precision Graph Optimization Framework |
|
| Gopinath, Shishir | Simon Fraser University |
| Dantu, Karthik | University at Buffalo |
| Ko, Steven | Simon Fraser University |
Keywords: Visual-Inertial SLAM, Mapping, Performance Evaluation and Benchmarking
Abstract: We present Graphite, a GPU-accelerated nonlinear least squares graph optimization framework. It provides a CUDA C++ interface to enable the sharing of code between a real-time application, such as a SLAM system, and its optimization tasks. The framework supports techniques to reduce memory usage, including in-place optimization, support for multiple floating point types and mixed-precision modes, and dynamically computed Jacobians. We evaluate Graphite on well-known bundle adjustment problems and find that it achieves similar performance to MegBA, a solver specialized for bundle adjustment, while maintaining generality and using less memory. We also apply Graphite to global visual-inertial bundle adjustment on maps generated from stereo-inertial SLAM datasets, and observe speed-ups of up to 59× compared to a CPU baseline. Our results indicate that our framework enables faster large-scale optimization on both desktop and resource-constrained devices.
|
| |
| 09:00-10:30, Paper ThI1I.336 | Add to My Program |
| Error Fields: Personalized Robotic Training to Enhance Movement Accuracy across Speed |
|
| Borghi, Bruno | Shirley Ryan AbilityLab |
| Aghamohammadi, Naveed Reza | University of Illinois at Chicago |
| Cancrini, Adriana | Shirley Ryan AbilityLab, Chicago, IL |
| Ramirez, Arturo | University of Illinois Chicago |
| Celian, Courtney | Shirley Ryan AbilityLab (formerly the Rehabilitation Institute of Chicago) |
| Patton, James | U Illinois Chicago | Shirley Ryan AbilityLab |
|
|
| |
| 09:00-10:30, Paper ThI1I.337 | Add to My Program |
| Scalar-Measurement Attitude Estimation on SO(3) with Bias Compensation |
|
| Melis, Alessandro | Laboratoire I3S UCA-CNRS |
| Bouazza, Tarek | Laboratoire I3S UCA-CNRS |
| Alnahhal, Hassan | Islamic University of Gaza |
| Benahmed, Sifeddine | Capgemini Engineering |
| Berkane, Soulaimane | University of Quebec in Outaouais |
| Hamel, Tarek | UNSA-CNRS |
Keywords: Sensor Fusion
Abstract: Attitude estimation methods typically rely on full vector measurements from inertial sensors such as accelerometers and magnetometers. This paper shows that reliable estimation can also be achieved using only scalar measurements, which naturally arise either as components of vector readings or as independent constraints from other sensing modalities. We propose nonlinear deterministic observers on SO(3) that incorporate gyroscope bias compensation and guarantee uniform local exponential stability under suitable observability conditions. A key feature of the framework is its robustness to partial sensing: accurate estimation is maintained even when only a subset of vector components is available. Experimental validation on the BROAD dataset confirms consistent performance across progressively reduced measurement configurations, with estimation errors remaining small even under severe information loss. To the best of our knowledge, this is the first work to establish fundamental observability results showing that two scalar measurements under suitable excitation suffice for attitude estimation, and that three are enough in the static case. These results position scalar-measurement-based observers as a practical and reliable alternative to conventional vector-based approaches.
|
| |
| 09:00-10:30, Paper ThI1I.338 | Add to My Program |
| VBGS-SLAM: Variational Bayesian Gaussian Splatting Simultaneous Localization and Mapping |
|
| Zhu, Yuhan | University of California, Riverside |
| Zhang, Yanyu | University of California, Riverside |
| Xu, Jie | University of California, Riverside |
| Ren, Wei | University of California, Riverside |
Keywords: Probabilistic Inference, Localization, Mapping
Abstract: 3D Gaussian Splatting (3DGS) has shown promising results for 3D scene modeling using mixtures of Gaussians, yet its existing simultaneous localization and mapping (SLAM) variants typically rely on direct, deterministic pose optimization against the splat map, making them sensitive to initialization and susceptible to catastrophic forgetting as map evolves. We propose Variational Bayesian Gaussian Splatting SLAM (VBGS-SLAM), a novel framework that couples the splat map refinement and camera pose tracking in a generative probabilistic form. By leveraging conjugate properties of multivariate Gaussians and variational inference, our method admits efficient closed-form updates and explicitly maintains posterior uncertainty over both poses and scene parameters. This uncertainty-aware method mitigates drift and enhances robustness in challenging conditions, while preserving the efficiency and rendering quality of existing 3DGS. Our experiments demonstrate superior tracking performance and robustness in long sequence prediction, alongside efficient, high-quality novel view synthesis across diverse synthetic and real-world scenes.
|
| |
| 09:00-10:30, Paper ThI1I.339 | Add to My Program |
| GSUC-VLM: Geometrically-Guided Spatial Understanding Chain of Vision Language Model for Autonomous Driving |
|
| Zhao, Yifan | Shanghai JiaoTong University |
| Zheng, Ziyang | Shanghai Jiao Tong University |
| Chen, Congjia | Beihang University |
| Zhang, Shizhuo | Shanghai Jiao Tong University |
| Zhang, Huixin | Shanghai Jiao Tong University |
| Dai, Wenrui | Shanghai Jiao Tong University |
| He, Fan | Huawei |
| Xiong, Hongkai | Shanghai Jiao Tong University |
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation
Abstract: Robust spatial understanding is crucial for Visual Question Answering (VQA) in autonomous driving that aims to enhance decision-making, reduce positional risks, and ensure road safety by providing answers based on the perception, prediction, and planning of driving scenarios. Despite remarkable success in semantic understanding of images and videos, existing Vision-Language Models (VLMs), as the prevailing paradigms for VQA, are limited in spatial understanding for multi-view scenes due to the lack of latent unified 3D reconstruction capability. They usually resort to additional spatial modalities such as point clouds or prior detection frameworks to enhance spatial understanding ability, but are still challenged by modality misalignment and degraded scalability. To overcome these limitations, in this paper, we propose a Geometrically-Guided Spatial Understanding Chain Framework (GSUC-VLM) for autonomous driving that leverages pretrained VLMs to jointly exploit semantic and spatial information in multi-view images. Specifically, we first design a dual-encoder architecture to fuse the semantic and spatial features separately extracted from multi-view images with a lightweight connector rather than introducing external spatial modalities. Subsequently, we align semantic and spatial features via distillation loss to generate semantic tokens enriched with the spatial information at the latent layer. Furthermore, we develop a projective feature conditioning method that incorporates camera intrinsic and extrinsic parameters to embed projection matrix encoding into the input vectors and introduce 3D position embeddings into the fusion layer for capturing complex spatial relationship across multiple views in autonomous driving. Experimental results show that the proposed GSUC-VLM achieves state-of-the-art performance in VQA tasks while providing Chain-of-Thought (CoT) understanding. Remarkably, GSUC-VLM demonstrates strong generalization on zero-shot VQA tasks.
|
| |
| 09:00-10:30, Paper ThI1I.340 | Add to My Program |
| SAGrid: Scaling Robot Simulation through Automatic Affordance Annotation on In-The-Wild 3D Assets |
|
| Gokmen, Cem | Stanford University |
| Tur, Yalcin | Stanford University |
| Kumar, Aditesh | Stanford University |
| Nag, Auddithio | Stanford University |
| Fei-Fei, Li | Stanford University |
Keywords: Simulation and Animation, Deep Learning in Grasping and Manipulation, Data Sets for Robot Learning
Abstract: Robot simulation is a highly efficient approach for scaling data collection for robot learning, but scaling for most household tasks remains bottlenecked by a shortage of simulation-ready 3D assets. While modern robot simulators can model complex phenomena like temperature and fluids, most in-the-wild 3D models lack "simulation affordances" (specialized annotations such as fluid source and heat emitter positions) that are required for these features. As a result, costly manual annotation is required, severely limiting asset scale and variety. We introduce Simulation Affordance Grids (SAGrid), a method that automates the annotation of simulation affordances on in-the-wild 3D meshes. SAGrid leverages pretrained representations (DINOv2, TRELLIS) to predict a dense 3D distance field to the nearest affordance. Our approach operates effectively in a low-data regime, requiring as few as 10 training objects per affordance type to accurately locate these features. We validate our method by processing Objaverse-XL models and integrating them into the BEHAVIOR-1K simulator. Training robot policies on this automatically expanded asset suite significantly improves generalization to unseen objects in complex tasks, demonstrating that automated affordance annotation is crucial for scaling robot learning.
|
| |
| 09:00-10:30, Paper ThI1I.341 | Add to My Program |
| RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation |
|
| Wang, Yi Ru | University of Washington |
| Ung, Carter | University of Washington |
| Tan, Christopher | University of Washington |
| Tannert, Grant | University of Washington |
| Duan, Jiafei | University of Washington |
| Li, Josephine | University of Washington |
| Le, Anh | University of Washington |
| Oswal, Rishabh | University of Washington, Seattle |
| Grotz, Markus | University of Washington (UW) |
| Pumacay, Wilbert | Allen Institute for AI |
| Deng, Yuquan | University of Washington |
| Krishna, Ranjay | University of Washington |
| Fox, Dieter | University of Washington |
| Srinivasa, Siddhartha | University of Washington |
Keywords: Performance Evaluation and Benchmarking, Bimanual Manipulation, Imitation Learning
Abstract: We introduce RoboEval, a structured evaluation framework and benchmark for robotic manipulation that augments binary success with principled behavioral and outcome metrics. Existing evaluations often collapse performance into outcome counts, masking differences in execution quality and obscuring failure structure. RoboEval provides eight bimanual tasks with systematically controlled variations, more than three thousand expert demonstrations, and a modular simulation platform for reproducible experimentation. All tasks are instrumented with standardized metrics that quantify fluency, precision, and coordination, as well as outcome measures that trace stagewise progress and localize failure modes. Through extensive experiments with state-of-the-art visuomotor policies, we validate these metrics by analyzing their stability under variation, discriminative power across policies with similar success rates, and correlation with task success. Project Page: https://robo-eval.github.io
|
| |
| 09:00-10:30, Paper ThI1I.342 | Add to My Program |
| GSAT: Geometric Traversability Estimation Using Self-Supervised Learning with Anomaly Detection for Diverse Terrains |
|
| Cho, Dongjin | Inha University |
| Park, Miryeong | Inha University |
| Lee, Juhui | Inha University |
| Yang, Geonmo | Inha University |
| Cho, Younggun | Inha University |
Keywords: Semantic Scene Understanding, Field Robots, Mapping
Abstract: Safe autonomous navigation requires reliable estimation of environmental traversability. Traditional methods have relied on semantic or geometry-based approaches with human-defined thresholds, but these methods often yield unreliable predictions due to the inherent subjectivity of human supervision. While self-supervised approaches enable robots to learn from their own experience, they still face a fundamental challenge: the positive-only learning problem. To address these limitations, recent studies have employed Positive-Unlabeled (PU) learning, where the core challenge is identifying positive samples without explicit negative supervision. In this work, we propose GSAT, which addresses these limitations by constructing a positive hypersphere in latent space to classify traversable regions through anomaly detection without requiring additional prototypes (e.g., unlabeled or negative). Furthermore, our approach employs joint learning of anomaly classification and traversability prediction to more efficiently utilize robot experience. We comprehensively evaluate the proposed framework through ablation studies, validation on heterogeneous real-world robotic platforms, and autonomous navigation demonstrations in simulation environments. Our method is available at https://sparolab.github.io/research/gsat/.
|
| |
| 09:00-10:30, Paper ThI1I.343 | Add to My Program |
| SEF-MAP: Subspace-Decomposed Expert Fusion for Robust Multimodal HD Map Prediction |
|
| Fu, Haoxiang | National University of Singapore |
| Zhang, Lingfeng | Tsinghua Univeristy |
| Li, Hao | Independent Researcher |
| Hu, Ruibing | The Chinese University of HongKong |
| Li, Zhengrong | The University of Manchester |
| Liu, Guanjing | Renmin University of China |
| Tan, Zimu | Independent Researcher |
| Chen, Long | Xiaomi EV |
| Ye, Hangjun | Xiaomi EV |
| Hao, Xiaoshuai | Xiaomi EV |
Keywords: Deep Learning for Visual Perception, Sensor Fusion, Mapping
Abstract: High-definition (HD) maps are essential for autonomous driving, yet multi-modal fusion often suffers from inconsistency between camera and LiDAR modalities, leading to performance degradation under low-light conditions, occlusions, or sparse point clouds. To address this, we propose SEF-MAP, a Subspace-Expert Fusion framework for robust multi-modal HD map prediction. The key idea is to explicitly disentangle BEV features into four semantic subspaces: LiDAR-private, Image-private, Shared, and Interaction. Each subspace is assigned a dedicated expert, thereby preserving modality-specific cues while capturing cross-modal consensus. To adaptively combine expert outputs, we introduce an uncertainty-aware gating mechanism at the BEV-cell level, where unreliable experts are down-weighted based on predictive variance, complemented by a usage balance regularizer to prevent expert collapse. To enhance robustness in degraded conditions and promote role specialization, we further propose distribution-aware masking: during training, modality-drop scenarios are simulated using EMA-statistical surrogate features, and a specialization loss enforces distinct behaviors of private, shared, and interaction experts across complete and masked inputs. Experiments on nuScenes and Argoverse2 benchmarks demonstrate that SEF-MAP achieves state-of-the-art performance, surpassing prior methods by +4.2% and +4.8% in mAP, respectively. SEF-MAP provides a robust and effective solution for multi-modal HD map prediction under diverse and degraded conditions.
|
| |
| 09:00-10:30, Paper ThI1I.344 | Add to My Program |
| Real-Time Decoding of Movement Onset and Offset for Brain-Controlled Rehabilitation Exoskeleton |
|
| Mitra, Kanishka | Massachusetts Institute of Technology |
| Kumar, Satyam | The University of Texas at Austin/Texas Instruments |
| Racz, Frigyes Samuel | The University of Texas at Austin |
| Liu, Deland Hu | The University of Texas at Austin |
| Deshpande, Ashish | The University of Texas at Austin/Meta Reality Labs |
| Millán, José del R. | The University of Texas at Austin |
Keywords: Brain-Machine Interfaces, Rehabilitation Robotics, Prosthetics and Exoskeletons
Abstract: Robot-assisted therapy can deliver high-dose, task-specific training after neurologic injury, but most systems act primarily at the limb level - engaging the impaired neural circuits only indirectly - which remains a key barrier to truly contingent, neuroplasticity-targeted rehabilitation. We address this gap by implementing online, dual-state motor imagery control of an upper-limb exoskeleton, enabling goal-directed reaches to be both initiated and terminated directly from noninvasive EEG. Eight participants used EEG to initiate assistance and then volitionally halt the robot mid-trajectory. Across two online sessions, group-mean hit rates were 61.5% for onset and 64.5% for offset, demonstrating reliable start–stop command delivery despite instrumental noise and passive arm motion. Methodologically, we reveal a systematic, class-driven bias induced by common task-based recentering using an asymmetric margin diagnostic, and we introduce a class-agnostic fixation-based recentering method that tracks drift without sampling command classes while preserving class geometry. This substantially improves threshold-free separability (AUC gains: onset +56%, p=0.0117; offset +34%, p=0.0251) and reduces bias within and across days. Together, these results help bridge offline decoding and practical, intention-driven start–stop control of a rehabilitation exoskeleton, enabling precisely timed, contingent assistance aligned with neuroplasticity goals while supporting future clinical translation.
|
| |
| 09:00-10:30, Paper ThI1I.345 | Add to My Program |
| Design and Evaluation of a Variable Stiffness Module for an Open-End Tendon Antagonistic Actuator |
|
| Laohaphand, Lattawat | King Mongkut's University of Technology Thonburi |
| Pengwang, Eakkachai Ton | Institute of FIeld roBOtics (FIBO), King Mongkut’s University of Technology Thonburi |
|
|
| |
| 09:00-10:30, Paper ThI1I.346 | Add to My Program |
| Multi-Keypoint Affordance Representation for Functional Dexterous Grasping |
|
| Yang, Fan | Hunan university |
| Luo, Dongsheng | Hunan University |
| Chen, Wenrui | Hunan University |
| Lin, Jiacheng | Hunan University |
| Cai, Junjie | Hunan University |
| Yang, Kailun | Hunan University |
| Li, Zhiyong | HUNAN UNIVERSITY |
| Wang, Yaonan | College of Electric and Information Technology Engineering, Hunan University |
|
|
| |
| 09:00-10:30, Paper ThI1I.347 | Add to My Program |
| Learning Optimal Strategies for Needle Handover in Surgical Suturing |
|
| Kim, Cholin | Daegu Gyeongbuk Institute of Science and Technology(DGIST) |
| Yoon, Jeonghyeon | Daegu Gyeongbuk Institute of Science and Technology(DGIST) |
| Lee, Hakyun | MedInTech |
| Park, Sihyeoung | Daegu Gyeongbuk Institute of Science and Technology(DGIST) |
| Park, Hyojae | Daegu Gyeongbuk Institute of Science and Technology(DGIST) |
| Lee, Seungjun | Daegu Gyeongbuk Institute of Science and Technology(DGIST) |
| Yip, Michael | University of California San Diego (UCSD) |
| Hwang, Minho | Daegu Gyeongbuk Institute of Science and Technology(DGIST) |
Keywords: Surgical Robotics: Planning, Reinforcement Learning, Task Planning
Abstract: Automation of suturing subtasks, such as needle handover, has the potential to reduce surgeons' fatigue and improve surgical efficiency. Needle handover is particularly challenging due to the combinatorial nature of grasping and handover strategies, uncertainties in needle pose estimation, and inaccuracies inherent in cable-driven surgical robots such as the da Vinci system. In this work, we present a reinforcement learning framework for needle handover, spanning the process from initial pickup to a desired grasping state. We formulate the task as a goal-oriented planning problem and design a state–action representation that captures grasping and handover configurations. A DQN-based policy is trained with disturbances that reflect real-world kinematic errors to ensure robustness. The learned policy was validated on the da Vinci Research Kit (dVRK) and quantitatively compared with human teleoperation. Results demonstrate that our approach achieves human-level efficiency in terms of handover attempts (1.65 ± 0.50 vs. 1.62 ± 0.55), while improving consistency and joint-limit avoidance. The proposed framework demonstrates the potential of reinforcement learning for safe and reliable automation of surgical handover and points to opportunities for extending autonomy to more complex handover scenarios.
|
| |
| 09:00-10:30, Paper ThI1I.348 | Add to My Program |
| InvariantCloud: A Globally Invariant, Uniquely Indexed Point Cloud Framework for Robust 6-DoF Tactile Pose Tracking |
|
| Ye, Pengfei | HKUST |
| Ma, Yuxiang | Massachusetts Institute of Technology |
| Zhou, Yi | The Hong Kong University of Science and Technology |
| Chen, Wei | The Chinese University of Hong Kong |
| Dong, Wenzhen | The Chinese University of HongKong |
| Duan, Molong | Hong Kong University of Science and Technology |
Keywords: Force and Tactile Sensing, Contact Modeling, Perception for Grasping and Manipulation
Abstract: Recent advances in imitation learning and vision–language models highlight the need for high-fidelity tactile perception, with 6-DoF tactile object pose estimation providing a crucial foundation for precise robotic manipulation. We introduce InvariantCloud, a 6-DoF pose estimation framework that leverages the global invariance of surface marker constellations on vision-based tactile sensors. In contrast to recent approaches, our one-shot globally invariant point cloud registration suppresses cumulative drift and overcomes long-standing limitations in accurately estimating yaw (Z-axis) rotation. Experimental verifications show that InvariantCloud achieves superior yaw tracking accuracy and re-localization repeatability compared to existing benchmarks, demonstrating its precision and robustness in long-sequence manipulation tasks.
|
| |
| 09:00-10:30, Paper ThI1I.349 | Add to My Program |
| First, Learn What You Don't Know: Active Information Gathering for Driving at the Limits of Handling |
|
| Davydov, Alexander | Rice University |
| Djeumou, Franck | Rensselaer Polytechnic Institute |
| Greiff, Marcus | Toyota Research Institute |
| Suminaka, Makoto | Toyota Research Institute |
| Thompson, Michael | Toyota Research Institute |
| Subosits, John | Toyota Research Institute |
| Lew, Thomas | Toyota Research Institute |
Keywords: Model Learning for Control, Intelligent Transportation Systems
Abstract: Combining data-driven models that adapt online and model predictive control (MPC) has enabled effective control of nonlinear systems. However, when deployed on unstable systems, online adaptation may not be fast enough to ensure reliable simultaneous learning and control. For example, a controller on a vehicle executing highly dynamic maneuvers—such as drifting to avoid an obstacle—may push the vehicle’s tires to their friction limits, destabilizing the vehicle and allowing modeling errors to quickly compound and cause a loss of control. To address this challenge, we present an active information gathering framework for identifying vehicle dynamics as quickly as possible. We propose an expressive vehicle dynamics model that leverages Bayesian last-layer meta-learning to enable rapid online adaptation. The model’s uncertainty estimates are used to guide informative data collection and quickly improve the model prior to deployment. Dynamic drifting experiments on a Toyota Supra show that (i) the framework enables reliable control of a vehicle at the edge of stability, (ii) online adaptation alone may not suffice for zero-shot control and can lead to undesirable transient errors or spin-outs, and (iii) active data collection helps achieve reliable performance.
|
| |
| 09:00-10:30, Paper ThI1I.350 | Add to My Program |
| ApexNav: An Adaptive Exploration Strategy for Zero-Shot Object Navigation with Target-Centric Semantic Fusion |
|
| Zhang, Mingjie | The Hong Kong University of Science and Technology (Guangzhou) |
| Du, Yuheng | The Hong Kong University of Science and Technology (Guangzhou) |
| Wu, Chengkai | The Hong Kong University of Science and Technology (Guangzhou) |
| Zhou, Jinni | Hong Kong University of Science and Technology (Guangzhou) |
| Qi, Zhenchao | University of Edinburgh |
| Ma, Jun | The Hong Kong University of Science and Technology |
| Zhou, Boyu | Southern University of Science and Technology |
Keywords: Search and Rescue Robots, Vision-Based Navigation, Autonomous Agents
Abstract: Navigating unknown environments to find a target object is a significant challenge. While semantic information is crucial for navigation, relying solely on it for decision-making may not always be efficient, especially in environments with weak semantic cues. Additionally, many methods are susceptible to misdetections, especially in environments with visually similar objects. To address these limitations, we propose ApexNav, a zero-shot object navigation framework that is both more efficient and reliable. For efficiency, ApexNav adaptively utilizes semantic information by analyzing its distribution in the environment, guiding exploration through semantic reasoning when cues are strong, and switching to geometry-based exploration when they are weak. For reliability, we propose a target-centric semantic fusion method that preserves long-term memory of the target object and similar objects, reducing false detections and minimizing task failures. We evaluate ApexNav on the HM3Dv1, HM3Dv2, and MP3D datasets, where it outperforms state-of-the-art methods in both SR and SPL metrics. Comprehensive ablation studies further demonstrate the effectiveness of each module. Furthermore, real-world experiments validate the practicality of ApexNav in physical environments.
|
| |
| 09:00-10:30, Paper ThI1I.351 | Add to My Program |
| Semantics-Aware Predictive Inspection Path Planning (I) |
|
| Dharmadhikari, Mihir Rahul | NTNU - Norwegian University of Science and Technology |
| Alexis, Kostas | NTNU - Norwegian University of Science and Technology |
Keywords: Aerial Systems: Perception and Autonomy, Semantic Scene Understanding, Motion and Path Planning
Abstract: This paper presents a novel semantics-aware inspection path planning paradigm called "Semantics-aware Predictive Planning" (SPP). Industrial environments that require the inspection of specific objects or structures (called "semantics"), such as ballast water tanks inside ships, often present structured and repetitive spatial arrangements of the semantics of interest. Motivated by this, we first contribute an algorithm that identifies spatially repeating patterns of semantics - exact or inexact - in a semantic scene graph representation and makes predictions about the evolution of the graph in the unseen parts of the environment using these patterns. Furthermore, two inspection path planning strategies, tailored to ballast water tank inspection, that exploit these predictions are proposed. To assess the performance of the novel predictive planning paradigm, both simulation and experimental evaluations are performed. First, we conduct a simulation study comparing the method against relevant state-of-the-art techniques and further present tests showing its ability to handle imperfect patterns. Second, we deploy our method onboard a collision-tolerant aerial robot operating inside the ballast tanks of two real ships. The results, both in simulation and field experiments, demonstrate significant improvement over the state-of-the-art in terms of inspection time while maintaining equal or better semantic surface coverage.
|
| |
| 09:00-10:30, Paper ThI1I.352 | Add to My Program |
| Hysteresis-Aware Neural Network Modeling and Whole-Body Reinforcement Learning Control of Soft Robots |
|
| Chen, Zongyuan | Tsinghua University |
| Xia, Yan | Tsinghua University |
| Liu, Jiayuan | Tsinghua University |
| Liu, Jijia | Tsinghua University |
| Tang, Wenhao | Tsinghua University |
| Chen, Jiayu | Tsinghua University |
| Gao, Feng | Tsinghua University |
| Ma, Longfei | Tsinghua University |
| Liao, Hongen | Tsinghua University |
| Wang, Yu | Tsinghua University |
| Yu, Chao | Tsinghua University |
| Zhang, Boyu | Shanghai Jiao Tong University |
| Xing, Fei | Tsinghua University |
Keywords: Soft Robot Applications, Modeling, Control, and Learning for Soft Robots, Medical Robots and Systems
Abstract: Soft robots are inherently compliant and safe, making them suitable for humaninteractive applications such as surgery. However, their nonlinear and hysteretic behavior poses significant challenges for accurate modeling and control. We present a soft robotic system and propose a hysteresis-aware whole-body neural network model that accurately captures and predicts the soft robot’s whole-body motion, including hysteresis effects. Based on this model, we construct a highly parallel simulation environment for soft robot control and apply an on-policy reinforcement learning algorithm to efficiently train whole-body motion control policies. The trained policy is deployed on a real soft robot to evaluate its control performance, and it exhibits high precision in trajectory tracking tasks. Furthermore, we develop a soft robotic system for surgical applications and validate it through phantom-based laser ablation experiments. The results demonstrate that the proposed model significantly reduces prediction error compared to conventional methods. The overall framework shows strong performance in phantom-based surgical experiments, and demonstrates its potential for complex scenarios, including future real-world clinical applications.
|
| |
| 09:00-10:30, Paper ThI1I.353 | Add to My Program |
| Learned IMU Bias Prediction for Invariant Visual Inertial Odometry |
|
| Altawaitan, Abdullah | University of California San Diego |
| Stanley, Jason | University of California, San Diego |
| Ghosal, Sambaran | University of California San Diego |
| Duong, Thai | Rice University |
| Atanasov, Nikolay | University of California, San Diego |
Keywords: Localization, Aerial Systems: Applications, Deep Learning Methods
Abstract: Autonomous mobile robots operating in novel environments depend critically on accurate state estimation, often utilizing visual and inertial measurements. Recent work has shown that an invariant formulation of the extended Kalman filter improves the convergence and robustness of visual-inertial odometry by utilizing the Lie group structure of a robot's position, velocity, and orientation states. However, inertial sensors also require measurement bias estimation, yet introducing the bias in the filter state breaks the Lie group symmetry. In this paper, we design a neural network to predict the bias of an inertial measurement unit (IMU) from a sequence of previous IMU measurements. This allows us to use an invariant filter for visual inertial odometry, relying on the learned bias prediction rather than introducing the bias in the filter state. We demonstrate that an invariant multi-state constraint Kalman filter (MSCKF) with learned bias predictions achieves robust visual-inertial odometry in real experiments, even when visual information is unavailable for extended periods and the system needs to rely solely on IMU measurements.
|
| |
| 09:00-10:30, Paper ThI1I.354 | Add to My Program |
| TCB-VIO: Tightly-Coupled Focal-Plane Binary-Enhanced Visual Inertial Odometry |
|
| Lisondra, Matthew | University of Toronto |
| Kim, Junseo | Delft University of Technology |
| Shimoda, Takashi Glenn | University of Toronto |
| Zareinia, Kourosh | Toronto Metropolitan University |
| Saeedi, Sajad | University College London |
Keywords: Sensor Fusion, Visual-Inertial SLAM
Abstract: Vision algorithms can be executed directly on the image sensor when implemented on the next-generation sensors known as focal-plane sensor-processor arrays (FPSP)s, where every pixel has a processor. FPSPs greatly improve latency, reducing the problems associated with the bottleneck of data transfer from a vision sensor to a processor. FPSPs accelerate vision-based algorithms such as visual-inertial odometry (VIO). However, VIO frameworks suffer from spatial drift due to the vision-based pose estimation, whilst temporal drift arises from the inertial measurements. FPSPs circumvent the spatial drift by operating at a high frame rate to match the high-frequency output of the inertial measurements. In this paper, we present TCB-VIO, a tightly-coupled 6 degrees-of-freedom VIO by a Multi-State Constraint Kalman Filter (MSCKF), operating at a high frame-rate of 250 FPS and from IMU measurements obtained at 400 Hz. TCB-VIO outperforms state-of-the-art methods: ROVIO, VINS-Mono, and ORB-SLAM3.
|
| |
| 09:00-10:30, Paper ThI1I.355 | Add to My Program |
| Continual Learning for Traversability Prediction with Uncertainty-Aware Adaptation |
|
| Lee, Hojin | Ulsan National Institute of Science and Technology |
| Lee, Yunho | Department of Mechanical Engineering, Ulsan National Institute of Science and Technology (UNIST) |
| Duecker, Daniel Andre | Technical University of Munich (TUM) |
| Kwon, Cheolhyeon | Ulsan National Institute of Science and Technology |
|
|
| |
| 09:00-10:30, Paper ThI1I.356 | Add to My Program |
| S-Graphs 2.0 – a Hierarchical-Semantic Optimization and Loop Closure for SLAM |
|
| Bavle, Hriday | University of Luxembourg |
| Sanchez-Lopez, Jose Luis | University of Luxembourg |
| Shaheer, Muhammad | University of Luxembourg |
| Civera, Javier | Universidad De Zaragoza |
| Voos, Holger | University of Luxembourg |
Keywords: SLAM, Mapping, Localization
Abstract: The hierarchical nature of 3D scene graphs aligns well with the structure of man-made environments, making them highly suitable for representation purposes. Beyond this, however, their embedded semantics and geometry could also be leveraged to improve the efficiency of map and pose optimization, an opportunity that has been largely overlooked by existing methods. We introduce Situational Graphs 2.0 (S-Graphs 2.0), that effectively uses the hierarchical structure of indoor scenes for efficient data management and optimization. Our approach builds a four-layer situational graph comprising Keyframes, Walls, Rooms, and Floors. Our first contribution lies in the front-end, which includes a floor detection module capable of identifying stairways and assigning floor-level semantic relations to the underlying layers (Keyframes, Walls, and Rooms). Floor-level semantics allows us to propose a floor-based loop closure strategy, that effectively rejects false positive closures that typically appear due to aliasing between different floors of a building. Our second novelty lies in leveraging our representation hierarchy in the optimization. Our proposal consists of: (1) local optimization over a window of recent keyframes and their connected components across the four representation layers, (2) floor-level global optimization, which focuses only on keyframes and their connections within the current floor during loop closures, and (3) room-level local optimization, marginalizing redundant keyframes that share observations within the room, which reduces the computational footprint. We validate our algorithm extensively in different real multi-floor environments. Our approach shows state-of-the-art accuracy metrics in large-scale multi-floor environments, estimating hierarchical representations up to 10x faster, on average, than competing baselines.
|
| |
| 09:00-10:30, Paper ThI1I.357 | Add to My Program |
| Long-Term Human Motion Prediction Using Spatio-Temporal Maps of Dynamics |
|
| Zhu, Yufei | Örebro University |
| Rudenko, Andrey | TU Munich |
| Kucner, Tomasz Piotr | Aalto University |
| Lilienthal, Achim J. | TU Munich |
| Magnusson, Martin | Örebro University |
Keywords: Human Detection and Tracking, Autonomous Agents
Abstract: Long-term human motion prediction (LHMP) is important for the safe and efficient operation of autonomous robots and vehicles in environments shared with humans. Accurate predictions are important for applications including motion planning, tracking, human-robot interaction, and safety monitoring. In this paper, we exploit Maps of Dynamics (MoDs), which encode spatial or spatio-temporal motion patterns as environment features, to achieve LHMP for horizons of up to 60 seconds. We propose an MoD-informed LHMP framework that supports various types of MoDs and includes a ranking method to output the most likely predicted trajectory, improving practical utility in robotics. Further, a time-conditioned MoD is introduced to capture motion patterns that vary across different times of day. We evaluate MoD-LHMP instantiated with three types of MoDs. Experiments on two real-world datasets show that MoD-informed method outperforms learning-based ones, with up to 50% improvement in average displacement error, and the time-conditioned variant achieves the highest accuracy overall.
|
| |
| 09:00-10:30, Paper ThI1I.358 | Add to My Program |
| A Closed-Chain Approach to Generating Affordance Joint Trajectories for Robotic Manipulators |
|
| Panthi, Janak | The University of Texas at Austin |
| Alambeigi, Farshid | The University of Texas at Austin |
| Pryor, Mitchell | The University of Texas at Austin |
Keywords: Manipulation Planning, Mechanism Design, Kinematics, Affordance
Abstract: Robots operating in unpredictable environments require versatile, hardware-agnostic frameworks capable of adapting to various tasks. While a recent screw-based affordance approach shows promise, it faces challenges in avoiding undesirable configurations, singularity navigation, and task success prediction. To address these limitations, we propose a novel framework that incorporates gripper orientation control and generates complete joint trajectories in real time for screw-based task-affordance execution. Our method models the affordance and manipulator as a closed-chain mechanism, introducing an innovative approach to solving closed-chain inverse kinematics. It encapsulates task constraints and simplifies task definitions, while remaining hardware and robot agnostic, robust to errors, and invariant to the initial grasp. We validate our framework with simulations on a UR5 robot and real-world implementation on a Boston Dynamics Spot robot. Our experiments demonstrate rapid joint trajectory generation (0.0077 - 0.098s) for various tasks, including a 420-degree valve turn with consideration of the gripper orientation. Comparison with the state-of-the-art methods shows a 4x improvement in planning time, reduced joint movement, and achievement of greater task goals. Video demonstrations and the open-source code for this project are available online.
|
| |
| 09:00-10:30, Paper ThI1I.359 | Add to My Program |
| Towards Indirect Data-Driven Predictive Control for Heating Phase of Thermoforming Process (I) |
|
| Hosseinionari, Hadi | The University of British Columbia |
| Bajelani, Mohammad | The University of British Columbia |
| van Heusden, Klaske | University of British Columbia |
| S. Milani, Abbas | The University of British Columbia |
| Seethaler, Rudolf | The University of British Columbia |
Keywords: Manufacturing, Maintenance and Supply Chains, Additive Manufacturing, Factory Automation
Abstract: Shaping thermoplastic sheets into three-dimensional products is challenging since overheating results in failed manufactured parts and wasted material. To this end, we propose an indirect data-driven predictive control approach using Model Predictive Control capable of handling temperature constraints and heating-power saturation while delivering enhanced precision and overshoot control compared to state-of-the-art methods. We employ a Non-linear Auto-Regressive with Exogenous inputs model, rev{which is linearized} to define a linear control-oriented model at each operating point. Using a high-fidelity simulator, several simulation studies have been conducted to evaluate the proposed method's robustness and performance under parametric uncertainty, indicating overshoot and average steady-state error less than 2^circ mathrm{C} and 0.9^circ mathrm{C} (7^circ mathrm{C} and 2^circ mathrm{C}) for the nominal (worst-case) scenario. Finally, we applied the proposed method to a lab-scale thermoforming platform, resulting in a close response to the simulation analysis with overshoot and average steady-state error metrics of 5.3^circ mathrm{C} and 1^circ mathrm{C}, respectively.
|
| |
| 09:00-10:30, Paper ThI1I.360 | Add to My Program |
| LMC-VIO: Lane Model-Constrained Monocular Inertial Visual SLAM for High-Precision Localization in Highway Scenes |
|
| Luo, Man | Dongfeng USharing Technology Co., Ltd |
| Yan, Maosheng | Wuhan University |
| Guo, Yuan | Wuhan University of Technology |
| Li, Bijun | Wuhan University |
| Zhou, Jian | Wuhan University |
Keywords: Localization, Visual-Inertial SLAM, Autonomous Vehicle Navigation
Abstract: Continuous stability, as one of the core modules of the autopilot system, is particularly important for its performance. However, as the vehicle speed increases, the system positioning error may be amplified, consequently introducing deviations in the positioning consistency of the system. The inherent high speeds and motion constraints in highway environments introduce new challenges for feature matching, particularly in vision-based vehicle localization, where initialization and scale estimation biases are further expanded. Lane markings, characterized by their simple and uniform structures and high distinctiveness from the surrounding environment, serve as effective features for matching-based localization in autonomous driving. This paper introduces a high-precision and robust vehicle localization method based on lane model constraints. Initially, leveraging lane model parameters from prior maps, we track and model lane line detections across consecutive frames to enhance the completeness of lane representation. The tracking results, combined with prior map data on lane widths, are employed to optimize scale parameters. Subsequently, real-time detected lanes are matched with prior maps through point-map association to constrain the vehicle’s heading angle. Finally, map matching results are integrated into existing visual local odometry methods to perform real-time localization optimization, thereby improving localization performance. Experimental evaluations conducted on a self-collected highway dataset demonstrate that the incorporation of lane models significantly enhances system localization accuracy.
|
| |
| 09:00-10:30, Paper ThI1I.361 | Add to My Program |
| Generating and Optimizing Topologically Distinct Guesses for Mobile Manipulator Path Planning with Path Constraints |
|
| Wong, Rufus Cheuk Yin | KTH Royal Institute of Technology |
| Sewlia, Mayank | KTH Royal Institute of Technology |
| Wiltz, Adrian | KTH |
| Dimarogonas, Dimos V. | KTH Royal Institute of Technology |
Keywords: Motion and Path Planning, Constrained Motion Planning, Mobile Manipulation
Abstract: Optimal path planning is prone to convergence to local, rather than global, optima. This is often the case for mobile manipulators due to nonconvexities induced by obstacles, robot kinematics and constraints. This paper focuses on planning under end effector path constraints and attempts to circumvent the issue of converging to a local optimum. We propose a pipeline that first discovers multiple homotopically distinct paths, and then optimizes them to obtain multiple distinct local optima. The best out of these distinct local optima is likely to be close to the global optimum. We demonstrate that our pipeline is able to circumvent the problem of local optima and produces a final local optimum that is close to the global optimum.
|
| |
| 09:00-10:30, Paper ThI1I.362 | Add to My Program |
| DualMap: Online Open-Vocabulary Semantic Mapping for Natural Language Navigation in Dynamic Changing Scenes |
|
| Jiang, Jiajun | The Hong Kong University of Science and Technology (Guangzhou) |
| Zhu, Yiming | The Hong Kong University of Science and Technology (Guangzhou) |
| Wu, Zirui | The Hong Kong University of Science and Technology (Guangzhou) |
| Song, Jie | The Hong Kong University of Science and Technology (Guangzhou) |
Keywords: Mapping, Semantic Scene Understanding, Vision-Based Navigation
Abstract: We introduce DualMap, an online open-vocabulary mapping system that enables robots to understand and navigate dynamically changing environments through natural language queries. Designed for efficient semantic mapping and adaptability to changing environments, DualMap meets the essential requirements for real-world robot navigation applications. Our proposed hybrid segmentation frontend and object-level status check eliminate the costly 3D object merging required by prior methods, enabling efficient online scene mapping. The dual-map representation combines a global abstract map for high-level candidate selection with a local concrete map for precise goal-reaching, effectively managing and updating dynamic changes in the environment. Through extensive experiments in both simulation and real-world scenarios, we demonstrate state-of-the-art performance in 3D open-vocabulary segmentation, efficient scene mapping, and online language-guided navigation. Project page: https://eku127.github.io/DualMap/
|
| |
| 09:00-10:30, Paper ThI1I.363 | Add to My Program |
| MLLM-Fabric: Multimodal Large Language Model-Driven Robotic Framework for Fabric Sorting and Selection |
|
| Wang, Liman | University of York |
| Zhong, Hanyang | University of York |
| Wang, Tianyuan | University of York |
| Luo, Shan | King's College London |
| Zhu, Jihong | University of York |
Keywords: Object Detection, Segmentation and Categorization, Semantic Scene Understanding, Deep Learning in Grasping and Manipulation
Abstract: Choosing appropriate fabrics is critical for meeting functional and quality demands in robotic textile manufacturing, apparel production, and smart retail. We propose MLLM-Fabric, a robotic framework leveraging multimodal large language models (MLLMs) for fabric sorting and selection. Built on a multimodal robotic platform, the system is trained through supervised fine-tuning and explanation-guided distillation to rank fabric properties. We also release a dataset of 220 diverse fabrics, each with RGB images and synchronized visuotactile and pressure data. Experiments show that our Fabric-Llama-90B consistently outperforms pretrained vision-language baselines in both attribute ranking and selection reliability.
|
| |
| 09:00-10:30, Paper ThI1I.364 | Add to My Program |
| Multi-View Stereo with Geometric Encoding for Large-Scale Dense Scene Reconstruction (I) |
|
| Yang, Guidong | The Chinese University of Hong Kong |
| Cao, Rui | The Chinese University of Hong Kong |
| Wen, Junjie | The Chinese University of Hong Kong |
| Zhao, Benyun | The Chinese University of Hong Kong |
| Li, Qingxiang | The Chineses University of Hong Kong |
| Chen, Xi | The Chinese University of Hong Kong |
| Liu, Yunhui | Chinese University of Hong Kong |
| Chen, Ben M. | Chinese University of Hong Kong |
Keywords: Aerial Systems: Applications, Aerial Systems: Perception and Autonomy, Computer Vision for Automation
Abstract: Multi-view stereo (MVS) implicitly encodes photometric and geometric cues into the cost volume for multi-view correspondence matching, transferring insufficient geometric cues essential to depth estimation and reconstruction. This paper proposes GE-MVS, a novel multi-view stereo network with geometric encoding for more accurate and complete depth estimation and point cloud reconstruction. First, the cross-view adaptive cost volume aggregation module is proposed to strengthen the encoding of multi-view geometric cues during cost volume construction. Then, the depth consistency optimization is performed in 3D point space during learning by invoking ground-truth depth cues from adjacent views. Finally, the surface normal geometries are explicitly encoded to refine the sampled depth hypotheses to be consistent in the local neighbor regions. Extensive experiments on the standard MVS benchmarks including DTU, Tanks and Temples, and BlendedMVS demonstrate the state-of-the-art depth estimation and point cloud reconstruction performance of GE-MVS. The GE-MVS is further deployed in real-world experiments for UAV-based large-scale reconstruction, where our method outperforms the prevalent industrial reconstruction solutions in terms of reconstruction efficiency and effectiveness.
|
| |
| 09:00-10:30, Paper ThI1I.365 | Add to My Program |
| PPF: Pre-Training and Preservative Fine-Tuning of Humanoid Locomotion Via Model-Assumption-Based Regularization |
|
| Jung, Hyunyoung | Georgia Institute of Technology |
| Gu, Zhaoyuan | Georgia Institute of Technology |
| Zhao, Ye | Georgia Institute of Technology |
| Park, Hae-Won | Korea Advanced Institute of Science and Technology |
| Ha, Sehoon | Georgia Institute of Technology |
Keywords: Humanoid and Bipedal Locomotion, Reinforcement Learning, Continual Learning
Abstract: Humanoid locomotion is a challenging task due to its inherent complexity and high-dimensional dynamics, as well as the need to adapt to diverse and unpredictable environments. In this work, we introduce a novel learning framework for effectively training a humanoid locomotion policy that imitates the behavior of a model-based controller while extending its capabilities to handle more complex locomotion tasks, such as more challenging terrain and higher velocity commands. Our framework consists of three key components: pre-training through imitation of the model-based controller, fine-tuning via reinforcement learning, and model-assumption-based regularization (MAR) during fine-tuning. In particular, MAR aligns the policy with actions from the model-based controller only in states where the model assumption holds to prevent catastrophic forgetting. We evaluate the proposed framework through comprehensive simulation tests and hardware experiments on a full-size humanoid robot, Digit, demonstrating a forward speed of 1.5 m/s and robust locomotion across diverse terrains, including slippery, sloped, uneven, and sandy terrains.
|
| |
| 09:00-10:30, Paper ThI1I.366 | Add to My Program |
| EES: A Data-Driven End-To-End Escorting System Via Spatiotemporal Feature Fusion |
|
| Yu, Youjin | National University of Defense Technology |
| Li, Junxiang | National University of Defense Technology |
| Li, Bowen | National University of Defense Technology |
| Wu, Tao | National University of Defense Technology |
| Zhao, Huijing | Peking University |
Keywords: Human-Robot Collaboration, Autonomous Vehicle Navigation, Motion and Path Planning
Abstract: This letter presents a technique that allows unmanned vehicles to escort a human to their destinations. Current human-centered following methods depend solely on human movement, which presents significant limitations. The complexity of human movement during tactical maneuvers can lead to erratic vehicle motion. Additionally, the static relative positioning between the human and vehicle creates a rigid following pattern, thereby constraining the vehicle’s ability to dynamically adjust its position for optimal coverage. To address these limitations, we propose a data-driven end-to-end escorting system (EES) that takes into account both environmental in formation and human movement to achieve adaptive escorting. We propose a soft-coding paradigm to replace the traditional hard-coding intent modeling to address the inconsistency of human intention and vehicle motion, and establish human-scene following through a cross-modal attention gating network. We conducted experiments in the CARLA simulation and the real world. The results demonstrate that the proposed EES reduces prediction errors by 41.2% during overall processes and by 54.5% during cornering. Additionally, EES can adapt to various positions and dynamically adjust the relative positions between humans and unmanned systems to adapt to complex scenarios.
|
| |
| 09:00-10:30, Paper ThI1I.367 | Add to My Program |
| S^2-Diffusion: Generalizing from Instance-Level to Category-Level Skills in Robot Manipulation |
|
| Yang, Quantao | KTH Royal Institute of Technology |
| Welle, Michael C. | KTH Royal Institute of Technology |
| Kragic, Danica | KTH |
| Andersson, Olov | KTH Royal Institute of Technology |
Keywords: Imitation Learning, Learning from Demonstration, Deep Learning in Grasping and Manipulation
Abstract: Recent advances in skill learning has propelled robot manipulation to new heights by enabling it to learn complex manipulation tasks from a practical number of demonstrations. However, these skills are often limited to the particular action, object, and environment instances that are shown in the training data, and have trouble transferring to other instances of the same category. In this work we present an open-vocabulary Spatial-Semantic Diffusion policy (S^2-Diffusion) which enables generalization from instance-level training data to category-level, enabling skills to be transferable between instances of the same category. We show that functional aspects of skills can be captured via a promptable semantic module combined with a spatial representation. We further propose leveraging depth estimation networks to allow the use of only a single RGB camera. Our approach is evaluated and compared on a diverse number of robot manipulation tasks, both in simulation and in the real world. Our results show that S^2-Diffusion is invariant to changes in category-irrelevant factors as well as enables satisfying performance on other instances within the same category, even if it was not trained on that specific instance. Project website: url{https://s2-diffusion.github.io}.
|
| |
| 09:00-10:30, Paper ThI1I.368 | Add to My Program |
| Real-Time Communication Relay Planning with a Low-Complexity Network Quality Prediction Model in Dynamic Indoor Missions |
|
| Seo, Jaemin | UNIST |
| Kim, Jongyun | Cranfield University |
| Kim, Seunghwan | UNIST |
| Kim, Changseung | Ulsan National Institute of Science and Technology |
| Shin, Woojae | Korea Advanced Institute of Science and Technology |
| Oh, Hyondong | KAIST |
Keywords: Multi-Robot Systems, Networked Robots, Path Planning for Multiple Mobile Robots or Agents
Abstract: Relay robots are crucial for extending communication when a client robot performs long-range missions. However, existing network quality prediction models and relay planning methods often struggle with real-time operation due to their high computational cost and poor adaptability to frequently changing missions. To address this, we propose a real-time communication relay system featuring two key contributions. First, a low-complexity network quality prediction model using Kalman filter-based Gaussian process regression achieves efficient online inference with constant-time updates (~0.02s). Second, a hierarchical relay planning strategy, employing a Monte Carlo tree search-based sequential planner, generates communication-aware trajectories satisfying network constraints at discrete steps. Real-world experiments validate our system's effectiveness, demonstrating near-continuous network availability (99.1% channel reliability) and boosting the packet delivery ratio from a baseline of 44.7% to 73.7%. Our integrated approach offers a practical and robust solution for dynamic indoor missions.
|
| |
| 09:00-10:30, Paper ThI1I.369 | Add to My Program |
| PGP-DOR: A Point-Grid-Point Scheme for Efficient Dynamic Object Removal |
|
| Wang, Shuo | National University of Defense Technology |
| Sun, Zhenping | National University of Defense Technology |
| Xue, Hanzhang | National University of Defense Technology |
| Liu, Bokai | National University of Defense Technology |
| Fu, Hao | National University of Defense Technology |
| Luo, YinFu | National University of Defense Technology |
Keywords: Object Detection, Segmentation and Categorization, Mapping, Probabilistic Inference
Abstract: In the field of autonomous driving, constructing high-precision maps, typically represented as 3D point cloud maps or bird's-eye view (BEV) grid maps, is essential for both offline and online applications. However, the presence of dynamic objects within a scene can introduce artifacts and noise that significantly degrade the quality of these maps. To address this challenge, we propose a method in this paper that can accurately identify those dynamic objects in both online and offline settings. Our approach fully exploits the spatio-temporal attributes of BEV grid maps and utilizes a point-grid-point (PGP) scheme to identify moving objects at both the 3D point cloud level and the 2D BEV grid level. Experimental results from public datasets, as well as a self-collected dataset, demonstrate that our method consistently outperforms state-of-the-art approaches in dynamic object removal in both online and offline contexts. The code and the newly introduced dataset will be made publicly available at: https://anonymous.4open.science/r/PGP-DOR-0686.
|
| |
| 09:00-10:30, Paper ThI1I.370 | Add to My Program |
| Real-Time Linear MPC for Quadrotors on SE(3): An Analytical Koopman-Based Realization |
|
| Rajkumar, Santosh Mohan | The Ohio State University |
| Yang, Chengyu | University of Illinois Urbana-Champaign |
| Gu, Yuliang | UIUC |
| Cheng, Sheng | University of Illinois Urbana-Champaign |
| Hovakimyan, Naira | University of Illinois at Urbana-Champaign |
| Goswami, Debdipta | The Ohio State University |
Keywords: Aerial Systems: Mechanics and Control, Underactuated Robots, Optimization and Optimal Control
Abstract: This letter presents an analytical linear parameter- varying (LPV) representation of quadrotor dynamics utilizing Koopman theory, facilitating computationally efficient linear model predictive control (LMPC) for real-time trajectory track- ing. By leveraging carefully designed Koopman observables, the proposed approach enables a compact lifted-space evolution that mitigates the curse of dimensionality while preserving the non- linear characteristics of the system. Although model predictive control (MPC) is a powerful strategy for quadrotor control, it faces a trade-off between the high computational cost of nonlinear MPC (NMPC) and the reduced accuracy of LMPC. To address this gap, we introduce KQ-LMPC (Koopman Quasilinear LPV MPC), which leverages the Koopman-lifted LPV formulation to enforce constraints, ensure lower computational burden and real- time feasibility, and deliver tracking performance comparable to NMPC. Experimental validation confirms the effectiveness of the framework in reasonably agile flight. To the best of our knowledge, this is the first experimentally validated LMPC for quadrotors that employs analytically derived Koopman observ- ables without requiring training data.
|
| |
| 09:00-10:30, Paper ThI1I.371 | Add to My Program |
| Distributed Pose Graph Optimization Via Contractive Belief Sharing |
|
| Liu, Xiangyu | University of Cyprus |
| Chli, Margarita | ETH Zurich & University of Cyprus |
Keywords: Multi-Robot SLAM, Localization, SLAM
Abstract: Following the relative maturity of single-robot Simultaneous Localization And Mapping (SLAM) techniques, works addressing collaborative SLAM have started emerginglately. Driven by the need for robust and scalable multirobot systems, the community has been targeting Distributed Pose Graph Optimization (DPGO), with current DPGO methods falling into two categories: optimization-based methods providing favorable convergence properties at the expense of excessive communication rounds among participants, and belief-propagation methods that exhibit better scalability and faster computation, albeit risking divergence on loopy and noisy graphs. Inspired by the need for more effective DPGO techniques, this work introduces Contractive Belief Sharing (CBS), a two-stage message-passing algorithm that combines Maximum-A-Posteriori (MAP) optimization with belief propagation with a Hellinger-distance-based damping rule. In this way, CBS ensures fast and reliable convergence while maintaining fully distributed computation and communication withneighbors only. Experiments on benchmarks show that CBS reaches convergence substantially faster and more efficient and scalable than the state-of-the-art methods while maintaining high trajectory accuracy, opening up new capabilities for collaborative SLAM.
|
| |
| 09:00-10:30, Paper ThI1I.372 | Add to My Program |
| TiCoSS: Tightening the Coupling between Semantic Segmentation and Stereo Matching within a Joint Learning Framework (I) |
|
| Tang, Guanfeng | Tongji University of China |
| Wu, Zhiyuan | King's College London |
| Li, Jiahang | Tongji University |
| Zhong, Ping | Central South University |
| Chen, Xieyuanli | The Chinese University of Hong Kong |
| Lu, Huimin | National University of Defense Technology |
| Fan, Rui | Tongji University |
Keywords: Deep Learning for Visual Perception, RGB-D Perception
Abstract: Semantic segmentation and stereo matching, respectively analogous to the ventral and dorsal streams in our human brain, are two key components of autonomous driving perception systems. Addressing these two tasks with separate networks is no longer the mainstream direction in developing computer vision algorithms, particularly with the recent advances in large vision models and embodied artificial intelligence. The trend is shifting towards combining them within a joint learning framework, especially emphasizing feature sharing between the two tasks. The major contributions of this study lie in comprehensively tightening the coupling between semantic segmentation and stereo matching. Specifically, this study makes three key contributions: (1) a tightly coupled, gated feature fusion strategy, (2) a hierarchical deep supervision strategy, and (3) a coupling tightening loss function. The combined use of these technical contributions results in TiCoSS, a state-of-the-art joint learning framework that simultaneously tackles semantic segmentation and stereo matching. Through extensive experiments on the KITTI, vKITTI2, and Cityscapes datasets, along with both qualitative and quantitative analyses, we validate the effectiveness of our developed strategies and loss function. Our approach demonstrates superior performance compared to prior arts, with a notable increase in mean intersection over union by over 9%.
|
| |
| 09:00-10:30, Paper ThI1I.373 | Add to My Program |
| Tactile-Driven Dexterous In-Hand Writing Via Extrinsic Contact Sensing |
|
| Zhao, Can | Shanghai Jiao Tong University |
| Xie, Lingzi | South China University of Technology |
| Huang, Bidan | Tencent |
| Wang, Shuai | Tencent |
| Ma, Daolin | Shanghai Jiao Tong University |
Keywords: In-Hand Manipulation, Force and Tactile Sensing, Reinforcement Learning
Abstract: Dexterous in-hand manipulation, especially involving interactions between grasped objects and external environments, remains a formidable challenge in robotics. This study tackles the complexities of in-hand manipulation under extrinsic contact through a representative three-finger handwriting task. We propose a hybrid arm-hand coordination framework that combines reinforcement learning with compliance control, offering both flexibility and robustness. Leveraging tactile sensors embedded in each finger, our tactile-driven estimation model dynamically predicts in-hand object pose and external contact, eliminating the need for fixed contact states. The proposed framework is first validated in simulation, where it successfully executes diverse writing tasks with accurate contact sensing. Sim-to-Real transfer is achieved through systematic calibration of finger joints and tactile sensors, supported by domain randomization. Real-world experiments further demonstrate the system's adaptability to writing tools with varying physical properties—such as radius, length, mass, and friction—while maintaining stability across different trajectories. This work advances robotic manipulation capabilities in unstructured environments.
|
| |
| 09:00-10:30, Paper ThI1I.374 | Add to My Program |
| GraspControl: Text-Sketch Instruction As an Interface for Controllable Grasp Synthesis |
|
| Wen, XiaoPeng | Dalian University of Technology |
| Tian, Songtao | Dalian University of Technology |
| Sun, Yi | Dalian University of Technology |
Keywords: Deep Learning in Grasping and Manipulation, Grasping, Deep Learning for Visual Perception
Abstract: Large vision-language models have been shown to perform complex tasks. However, aligning language instructions with object visual information to enable general inference for robotic grasping poses a significant challenge. To tackle this issue, we introduce GraspControl, a method that leverages grasp language instructions and sketches of objects to control the generation of grasps. Initially, we construct a dataset that augments language instructions with position and orientation information of grasps, and visual information with sketches of the gripper and target objects. Subsequently, we develop a model capable of generating 2D grasp sketches given grasp language and 2D object sketches as input prompts, thereby bridging the gap between the linguistic and visual representations of the object to be grasped. These generated 2D grasp sketches serve as an innovative input modality for grasp synthesis, directing the creation of 3D object models and corresponding 3D grasp poses through a 3D reconstruction module. Furthermore, we incorporate a multi-modal attention loss to ensure the consistency between high-level semantic grasp features and intricate low-level visual features, with a particular emphasis on the grasping area of the object. We evaluate the capabilities of our grasp approach through extensive experiments in both simulated and real-world robotic scenarios. The experimental results confirm that our method can execute grasps in complex environments.
|
| |
| 09:00-10:30, Paper ThI1I.375 | Add to My Program |
| A Policy Model Based Efficient and Accurate Scene Recognition Method for Service Robot |
|
| Liu, Shaopeng | The Hong Kong Polytechnic University |
| Zhou, Guanzhong | The Hong Kong Polytechnic University |
| Huang, Chao | Adelaide University |
Keywords: Recognition, Cognitive Modeling, Reinforcement Learning
Abstract: In domestic environments, assigning scene semantic labels (scene recognition) to each node of a topological semantic map is an important task. Given the limitations of current scene recognition methods in efficiency, and accuracy for service robot, this paper proposes a scene recognition method based on a policy model. Considering the similarity of images captured from the adjacent nodes and the low-quality image caused by the uncertain node position and observation direction of the robot, we develop a policy model using a deep Q-learning network (DQN). This model enhances accuracy and efficiency by deciding whether to (1) inherit the scene type from the preceding node without re-recognition or (2) adjust the robot’s observation angle to capture a more informative image. A rule-based reward function integrated with a scene score model enables simultaneous learning of similarity assessment and viewpoint adjustment policies. Fur-thermore, a training strategy based on generated path is proposed to provide sufficient data for training the policy model. Extensive comparative experiments in simulated environments demonstrate that our method surpasses state-of-the-art approaches in both recognition accuracy and efficiency. Deployment on a mobile robot confirms its practical efficacy, achieving precise and efficient scene recognition across diverse real-world environments.
|
| |
| 09:00-10:30, Paper ThI1I.376 | Add to My Program |
| GAIA: Generating Task Instruction Aware Simulation Grounded in Real Contexts Using Vision-Language Models |
|
| Ko, Dogyu | Kyung Hee University |
| Yeo, Chanyoung | Kyung Hee University |
| Kim, Daeho | Kyung Hee University |
| Kim, Jaeho | KyungHee Universtiy |
| Hwang, Hyoseok | Kyung Hee University |
Keywords: Simulation and Animation, Task and Motion Planning, Deep Learning for Visual Perception
Abstract: Enabling robots to interact effectively with the real world requires extensive learning from physical interaction data, making simulation crucial for generating such data safely and cost-effectively. Despite the advantages of simulation, manual environment creation remains a laborious process, motivating the development of automated generation approaches. However, the limitations of current automatic virtual scene generation approaches in bridging the sim-to-real gap and achieving task readiness necessitate the creation of automatically generated, realistic, and task-ready virtual scenes. In this paper, we propose GAIA, a novel methodology to automatically generate interactive, task-ready simulation environments grounded in real contexts from only a single RGB image and a task instruction. GAIA utilizes a pre-trained Vision-Language Model (VLM) without requiring explicit training, and jointly understands the visual context and the user’s instruction. Based on this understanding, it infers and places necessary task-aware objects, including unseen ones to construct an interactive virtual environment that maintains real-scene fidelity while reflecting task requirements without additional manual setup. We show qualitative experiments that GAIA generates spaces consistent with user instructions, and quantitative results that policies learned within these GAIA-generated environments successfully transfer to target environments. Source code and supplementary materials are available at our project page https://sites.google.com/view/gaia-project-page.
|
| |
| 09:00-10:30, Paper ThI1I.377 | Add to My Program |
| EGS-SLAM: RGB-D Gaussian Splatting SLAM with Events |
|
| Chen, Siyu | Nanyang Technological University |
| Yuan, Shenghai | Nanyang Technological University |
| Nguyen, Thien-Minh | The University of Queensland |
| Huang, Zhuyu | Beihang University |
| Shi, Chenyang | Beihang University |
| Jin, Jing | Beihang University |
| Xie, Lihua | NanyangTechnological University |
| |
| 09:00-10:30, Paper ThI1I.378 | Add to My Program |
| CLOVER: Context-Aware Long-Term Object Viewpoint and Environment Invariant Representation Learning |
|
| Adkins, Amanda | University of Texas at Austin |
| Lee, Dongmyeong | University of Texas at Austin |
| Biswas, Joydeep | The University of Texas at Austin |
Keywords: Deep Learning for Visual Perception, Data Sets for Robotic Vision, Semantic Scene Understanding
Abstract: Mobile service robots can benefit from object-level understanding of their environments, including the ability to distinguish object instances and re-identify previously seen instances. Object re-identification is challenging across different viewpoints and in scenes with significant appearance variation arising from weather or lighting changes. Existing works on object re-identification either focus on specific classes or require foreground segmentation. Further, these methods, along with object re-identification datasets, have limited consideration of challenges such as occlusions, outdoor scenes, and illumination changes. To address this problem, we introduce CODa Re-ID: an in-the-wild object re-identification dataset containing 1,037,814 observations of 557 objects across 8 classes under diverse lighting conditions and viewpoints. Further, we propose CLOVER, a representation learning method for object observations that can distinguish between static object instances without requiring foreground segmentation. We also introduce MapCLOVER, a method for scalably summarizing CLOVER descriptors for use in object maps and matching new observations to summarized descriptors. Our results show that CLOVER achieves superior performance in static object re-identification under varying lighting conditions and viewpoint changes and can generalize to unseen instances and classes.
|
| |
| 09:00-10:30, Paper ThI1I.379 | Add to My Program |
| Mini Diffuser: Fast Multi-Task Diffusion Policy Training Using Two-Level Mini-Batches |
|
| Hu, Yutong | KU Leuven |
| Song, Pinhao | KU Leuven |
| Wen, Kehan | ETH Zurich |
| Detry, Renaud | KU Leuven |
Keywords: Deep Learning in Grasping and Manipulation, Learning from Demonstration, Imitation Learning
Abstract: We present a method that reduces, by an order of magnitude, the time and memory needed to train multi-task vision-language robotic diffusion policies. This improvement arises from a previously underexplored distinction between action diffusion and the image diffusion techniques that inspired it: In image generation, the target is high-dimensional. By contrast, in action generation, the dimensionality of the target is comparatively small, and only the image condition is high-dimensional. Our approach, Mini Diffuser, exploits this asymmetry by introducing two-level minibatching, which pairs multiple noised action samples with each vision-language condition, instead of the conventional one-to-one sampling strategy. To support this batching scheme, we introduce architectural adaptations to the diffusion transformer that prevent information leakage across samples while maintaining full conditioning access. In RLBench simulations, Mini-Diffuser achieves 95% of the performance of state-of-the-art multi-task diffusion policies, while using only 5% of the training time and 7% of the memory. Real-world experiments further validate that Mini-Diffuser preserves the key strengths of diffusion-based policies, including the ability to model multimodal action distributions and produce behavior conditioned on diverse perceptual inputs.
|
| |
| 09:00-10:30, Paper ThI1I.380 | Add to My Program |
| Robust Monocular Visual Odometry Via Dual-Paradigm Curriculum Learning |
|
| Lahiany, Assaf | University of Haifa |
| Gal, Oren | University of Haifa |
Keywords: Vision-Based Navigation, SLAM, Deep Learning Methods
Abstract: Monocular visual odometry (VO) is accurate in controlled settings yet drifts sharply under aggressive motion and sensor noise. We offer a fundamental rethinking of VO robustness as a training-schedule problem rather than an architectural challenge, introducing a novel dual-paradigm curriculum learning framework that operates at both trajectory and loss-component levels. (i) A motion-based curriculum orders trajectories by measured motion complexity. (ii) A hierarchical component curriculum adaptively re-weights optical-flow, pose, and rotation losses via Self-Paced and in-training Reinforcement Learning (RL) schedulers. Integrated into an unmodified DPVO baseline, these strategies cut TartanAir ATE by 33% with only 31% extra training wall-time, and reach baseline accuracy 47% faster (Self-Paced). Without fine-tuning, the same models improve zero-shot performance on EuRoC (-13%), TUM-RGBD (-9%), and ICL-NUIM (-32%). We show that explicit difficulty progression or adaptive loss weighting provides a practical, zero-inference-overhead path to robust monocular VO and could extend to other geometric vision tasks.
|
| |
| 09:00-10:30, Paper ThI1I.381 | Add to My Program |
| Robot Deformable Object Manipulation Via NMPC-Generated Demonstrations in Deep Reinforcement Learning (I) |
|
| Wang, Haoyuan | Huazhong University of Science and Technology |
| Dong, Zihao | Huazhong University of Science and Technology |
| Zhu, Tong | Huazhong University of Science and Technology |
| Lei, HongLiang | Huazhong University of Science and Technology |
| Shi, Weizhuang | Huazhong University of Science and Technology |
| Zhang, Zejia | Huazhong University of Science and Technology |
| Luo, Wei | China Ship Development &design Center |
| Wan, Weiwei | Osaka University |
| Chen, Xinxing | Huazhong University of Science and Technology |
| Huang, Jian | Huazhong University of Science and Technology |
Keywords: Deep Learning in Grasping and Manipulation, Reinforcement Learning, Dexterous Manipulation
Abstract: In this work, we conducted research on deformable object manipulation by robots based on demonstration-enhanced reinforcement learning (RL). We present FADERL (FuzzyAugmented Demonstration-Embedded Reinforcement Learning),a novel framework for robotic manipulation of deformable objects that significantly improves reinforcement learning efficiency through synergistic unification of High-Dimensional Takagi-Sugeno-Kang (HTSK) fuzzy systems, Generative Adversarial Behavior Cloning (GABC), and Conditional Policy Learning (CPL). Compared to the Rainbow-DDPG baseline, FADERL achieves 2:01× higher global average reward and reduces standard deviation to 45% while requiring fewer computational resources. To address the high cost of human demonstration collection, we introduce a Nonlinear Model Predictive Control (NMPC)-based data augmentation method that generates high-quality demonstrations at minimal cost. Simulation results demonstrate that NMPC-generated demonstrations enable FADERL to achieve performance comparable to human demonstrations. Physical experiments on fabric manipulation tasks—diagonal folding, central-axis folding, and flattening—achieve success rates of 83.3%, 80.0%,and 96.7% respectively, validating our approach’s effectiveness in real-world scenarios. Unlike computationally intensive large-model approaches, FADERL provides a lightweight, task-specific solution with efficient adaptability, making it suitable for practical robotic applications in manufacturing.
|
| |
| 09:00-10:30, Paper ThI1I.382 | Add to My Program |
| Accelerating Residual Reinforcement Learning with Uncertainty Estimation |
|
| Dodeja, Lakshita | Brown University |
| Schmeckpeper, Karl | Robotics and AI Institute |
| Vats, Shivam | Brown University |
| Weng, Thomas | Boston Dynamics AI Institute |
| Jia, Mingxi | Brown University |
| Konidaris, George | Brown University |
| Tellex, Stefanie | Brown University |
Keywords: Reinforcement Learning, Deep Learning Methods, Machine Learning for Robot Control
Abstract: Residual Reinforcement Learning (RL) is a popular approach for adapting pretrained policies by learning a lightweight residual policy that provides corrective actions. While Residual RL is more sample-efficient than finetuning the entire base policy, existing methods struggle with sparse rewards and are designed for deterministic base policies. We propose two improvements to Residual RL that further enhance its sample efficiency and make it suitable for stochastic base policies. First, we leverage uncertainty estimates of the base policy to focus exploration on regions in which the base policy is not confident. Second, we propose a simple modification to off-policy residual learning that allows it to observe base actions and better handle stochastic base policies. We evaluate our method with both Gaussian-based and Diffusion-based stochastic base policies on tasks from Robosuite and D4RL, and compare against state-of-the-art finetuning methods, demo-augmented RL methods, and other Residual RL methods. Our algorithm significantly outperforms existing baselines in a variety of simulation benchmark environments. We also deploy our learned policies in the real world to demonstrate their robustness with zero-shot sim-to-real transfer.
|
| |
| 09:00-10:30, Paper ThI1I.383 | Add to My Program |
| SPILL: Size, Pose, and Internal Liquid Level Estimation of Transparent Glassware for Robotic Bartending |
|
| Adriaens, Louis | Ghent University |
| Lips, Thomas | Ghent University |
| De Coster, Mathieu | Ghent University |
| Verleysen, Andreas | Ghent University |
| Wyffels, Francis | Ghent University |
Keywords: Perception for Grasping and Manipulation, Object Detection, Segmentation and Categorization, Service Robotics
Abstract: Robotic perception of transparent objects presents unique challenges due to their refractive properties, lack of texture, and limitations of conventional RGB-D sensors in capturing reliable depth information. These challenges significantly hinder robotic manipulation capabilities in real-world settings such as household assistance, hospitality, and healthcare. To address these issues, we propose SPILL: A lightweight perception pipeline for Size, Pose, and Internal Liquid Level estimation of unknown transparent glassware using a single view. SPILL combines object detection with semantic keypoint detection, and operates without requiring object-specific 3D models or depth completion. We demonstrate its effectiveness in autonomous robotic pouring tasks. Additionally, to enhance the robustness and generalization of keypoint detection to diverse real-world scenarios, we introduce Glasses-in-the-Wild, a new dataset that captures a wide variety of glass types in realistic environments. Evaluated on a robot manipulator, SPILL achieves a 93.6% success rate across 500 autonomous pours with 20 unseen glasses in three diverse real-world scenes. We further demonstrate robustness through multiple live public events in real-world, human-centered environments. In one recorded session, the robot autonomously served 62 drinks with a 98.3% success rate. These results demonstrate that task-relevant keypoint detection enables scalable, real-world transparent object interaction, paving the way for practical applications in service and assistive robotics - without spilling a drop. Dataset and code will be released upon acceptance.
|
| |
| 09:00-10:30, Paper ThI1I.384 | Add to My Program |
| PokeFlex: A Real-World Dataset of Volumetric Deformable Objects for Robotics |
|
| Obrist, Jan | ETH Zurich |
| Zamora Mora, Miguel Angel | ETH Zurich |
| Zheng, Hehui | ETH Zurich |
| Hinchet, Ronan | ETH Zurich |
| Ozdemir, Firat | ETH Zurich and EPFL |
| Zarate, Juan Jose | ETH Zurich |
| Katzschmann, Robert Kevin | ETH Zurich |
| Coros, Stelian | ETH Zurich |
Keywords: Data Sets for Robotic Vision, Deep Learning for Visual Perception, Data Sets for Robot Learning
Abstract: Data-driven methods have shown great potential in solving challenging manipulation tasks; however, their application in the domain of deformable objects has been constrained, in part, by the lack of data. To address this lack, we propose PokeFlex, a dataset featuring real-world multimodal data that is paired and annotated. The modalities include 3D textured meshes, point clouds, RGB images, and depth maps. Such data can be leveraged for several downstream tasks, such as online 3D mesh reconstruction, and it can potentially enable underexplored applications such as the real-world deployment of traditional control methods based on mesh simulations. To deal with the challenges posed by real-world 3D mesh reconstruction, we leverage a professional volumetric capture system that allows complete 360° reconstruction. PokeFlex consists of 18 deformable objects with varying stiffness and shapes. Deformations are generated by dropping objects onto a flat surface or by poking the objects with a robot arm. Interaction wrenches and contact locations are also reported for the latter case. Using different data modalities, we demonstrated a use case for our dataset training models that, given the novelty of the multimodal nature of Pokeflex, constitute the state-of-the-art in multi-object online template-based mesh reconstruction from multimodal data, to the best of our knowledge. We refer the reader to our website or the supplementary material for further demos and examples.
|
| |
| 09:00-10:30, Paper ThI1I.385 | Add to My Program |
| Hybrid Contact Dynamics and Residual-RL Framework for Multi-Point Object Pushing |
|
| Chen, Chen | McGill University |
| Dai, Xu | McGill University |
| Kovecses, Jozsef | McGill University |
Keywords: Contact Modeling, Manipulation Planning, Reinforcement Learning
Abstract: Robotic contact manipulation involves applying controlled forces at contact points to guide an object along a desired trajectory while respecting the underlying physical interactions. This paper presents a novel framework that integrates dynamic modeling and Reinforcement Learning (RL) to achieve robust object pushing with a redundant robotic manipulator. First, a comprehensive dynamic contact model is formulated, incorporating unilateral constraints and a box friction model to capture the nonlinearities present in real-world contact dynamics. Second, the model is extended to handle multiple simultaneous point contacts, enabling effective trajectory planning and tracking for redundant robotic manipulators in multi-contact pushing tasks. Third, an RL strategy is introduced as a residual module that augments a model-based controller to improve pushing performance. Simulation and real-world experiments with a Kinova Gen2 arm demonstrate that the proposed method achieves accurate trajectory following and stable contact interactions, significantly outperforming traditional PD control strategies in dynamic pushing scenarios.
|
| |
| 09:00-10:30, Paper ThI1I.386 | Add to My Program |
| The Developments and Challenges towards Dexterous and Embodied Robotic Manipulation: A Survey |
|
| Li, Gaofeng | Zhejiang University |
| Wang, Ruize | Zhejiang University |
| Xu, Peisen | Zhejiang University |
| Ye, Qi | Zhejiang University |
| Chen, Jiming | Zhejiang University |
Keywords: Dexterous Manipulation, Multifingered Hands, AI-Enabled Robotics
Abstract: Achieving human-like dexterous robotic manipulation remains a central goal and a pivotal challenge in robotics. The development of Artificial Intelligence (AI) has allowed rapid progress in robotic manipulation. This survey summarizes the evolution of robotic manipulation from mechanical programming to embodied intelligence, alongside the transition from simple grippers to multi-fingered dexterous hands, outlining key characteristics and main challenges. Focusing on the current stage of embodied dexterous manipulation, we highlight recent advances in two critical areas: dexterous manipulation data collection (via simulation, human demonstrations, and teleoperation) and skill-learning frameworks (imitation and reinforcement learning). Then, based on the overview of the existing data collection paradigm and learning framework, three key challenges restricting the development of dexterous robotic manipulation are summarized and discussed.
|
| |
| 09:00-10:30, Paper ThI1I.387 | Add to My Program |
| Deployment of an Aerial Multiagent System for Automated Task Execution in Large-Scale Underground Mining Environments (I) |
|
| Dahlquist, Niklas | Luleå University of Technology |
| Nordström, Samuel | Luleå University of Technology |
| Stathoulopoulos, Nikolaos | Luleå University of Technology |
| Lindqvist, Björn | Luleå University of Technology |
| Saradagi, Akshit | Luleå University of Technology, Luleå, Sweden |
| Nikolakopoulos, George | Luleå University of Technology |
Keywords: Field Robots, Mining Robotics
Abstract: In this article, we present a framework for deploying aerial multiagent systems in large-scale subterranean environments with minimal supporting infrastructure. The objective is to optimally and reactively execute routine inspection tasks, selected by a mine operator on-the-fly. The assignment of currently available tasks to the agents is accomplished through an auction-based system, where the agents bid for available tasks, which are used by a central auctioneer to optimally assign the tasks. A mobile Wi-Fi mesh supports interagent communication and bi-directional communication between agents and the task allocator, while the task execution is performed completely infrastructure-free. Given a task to be accomplished, reliable and modular agent behavior is synthesized by generating behavior trees from a pool of agent capabilities, using a back-chaining approach. The auction system is reactive and supports the addition of new tasks on-the-go, at any point through a user-friendly operator interface. The framework has been validated in a real underground mining environment using three aerial agents, with several inspection locations spread in an environment of almost 200 m as a proof-of-concept. The scalability, fault tolerance, and the influence of agent initializations on the multiagent architecture have been tested through complementary Gazebo simulations in a cave environment. The proposed framework can be utilized in a subterranean environment for missions involving rapid i
|
| |
| 09:00-10:30, Paper ThI1I.388 | Add to My Program |
| HR2-KILO: A High-Rate, Robust, Kinematic-Inertial-LiDAR Odometry for Humanoid Robots |
|
| Gao, Jixin | Harbin Institute of Technology |
| Zha, Fusheng | Harbin Institute of Technology |
| Zhang, Lianzhao | Harbin Institute of Technology |
| Guo, Wei | Harbin Institute of Technology |
| Wang, Pengfei | Harbin Institute of Technology, State Key Laboratory of Robotics and System |
| Sun, Lining | harbin institute of technology |
| Li, Mantian | Institute of Intelligent Manufacturing Technology |
|
|
| |
| 09:00-10:30, Paper ThI1I.389 | Add to My Program |
| A Framework for Adaptive Load Redistribution in Human-Exoskeleton-Cobot Systems |
|
| Mobedi, Emir | Izmir Institute of Technology |
| Solak, Gokhan | Istituto Italiano Di Tecnologia |
| Ajoudani, Arash | Istituto Italiano Di Tecnologia |
Keywords: Human-Robot Collaboration, Wearable Robotics, Physically Assistive Devices
Abstract: Wearable devices like exoskeletons are designed to reduce excessive loads on specific joints of the body. Specifically, single- or two-degrees-of-freedom (DOF) upper-body industrial exoskeletons typically focus on compensating for the strain on the elbow and shoulder joints. However, during daily activities, there is no assurance that external loads are correctly aligned with the supported joints. Optimizing work processes to ensure that external loads are primarily (to the extent that they can be compensated by the exoskeleton) directed onto the supported joints can significantly enhance the overall usability of these devices and the ergonomics of their users. Collaborative robots (cobots) can play a role in this optimization, complementing the collaborative aspects of human work. In this study, we propose an adaptive and coordinated control system for the human-cobot-exoskeleton interaction. This system adjusts the task coordinates to maximize the utilization of the supported joints. When the torque limits of the exoskeleton are exceeded, the framework continuously adapts the task frame, redistributing excessive loads to non-supported body joints to prevent overloading the supported ones. We validated our approach in an equivalent industrial painting task involving a single-DOF elbow exoskeleton, a cobot, and four subjects, each tested in four different initial arm configurations with five distinct optimisation weight matrices and two different payloads.
|
| |
| 09:00-10:30, Paper ThI1I.390 | Add to My Program |
| Time-Optimal Anti-Sloshing Trajectory Planning for Multiple Liquid-Filled Containers Subject to SCARA Motion |
|
| Ferrari, Andrea | University of Bologna |
| Di Leva, Roberto | University of Bologna |
| Soprani, Simone | University of Bologna |
| Biagiotti, Luigi | University of Modena and Reggio Emilia |
| Palli, Gianluca | University of Bologna |
| Carricato, Marco | University of Bologna |
Keywords: Motion and Path Planning, Industrial Robots, Optimization and Optimal Control
Abstract: This paper develops algorithms for planning time-optimal pick-and-place trajectories for multiple cylindrical containers filled with liquid and simultaneously transported by a robot. The considered trajectories comprise 3D translations combined with a 1D rotation about the vertical direction, i.e. SCARA motions. The presented approach minimizes the execution time, while ensuring that the liquid surface within each container remains below an imposed threshold throughout the motion. Two types of optimal trajectories are studied: one optimizes the motion law along a given path, the other optimizes both the path and the motion law. Extensive simulations identify the most efficient optimization setup, whereas experiments validate the approach. The data sets of all simulated and experimental motions are distributed through an external repository.
|
| |
| 09:00-10:30, Paper ThI1I.391 | Add to My Program |
| Curvature-Constrained Vector Field for Motion Planning of Nonholonomic Robots |
|
| Qiao, Yike | Peking University |
| He, Xiaodong | University of Science and Technology Beijing |
| Zhuo, An | Peking University |
| Sun, Zhiyong | Peking University (PKU) |
| Bao, Weimin | China Aerospace Science and Technology Corporation |
| Li, Zhongkui | Peking University |
Keywords: Autonomous Vehicle Navigation, Motion and Path Planning, Nonholonomic Motion Planning, Curvature-Constrained Vector Field
Abstract: Vector fields handle nonholonomic motion planning as they provide reference orientation for robots. However, additionally incorporating curvature constraints becomes challenging, due to the interconnection between the design of the curvature-bounded vector field and the tracking controller under underactuation. In this paper, we present a novel framework to co-develop the vector field and the control laws, guiding the nonholonomic robot to the target configuration with curvature-bounded trajectory. First, we construct a curvature-constrained vector field (CVF) via blending and distributing basic flow fields to provide curvature-bounded reference trajectory. Next, we propose the saturated control laws with a dynamic gain. Under the control laws, kinematically constrained nonholonomic robots are guaranteed to track the reference CVF and converge to the target positive limit set with bounded trajectory curvature. Numerical simulations show that the proposed CVF method outperforms other vector-field-based algorithms. Experiments on Ackermann UGVs and semi-physical fixed-wing UAVs demonstrate that the method can be effectively implemented in real-world scenarios.
|
| |
| 09:00-10:30, Paper ThI1I.392 | Add to My Program |
| Design, Control and Evaluation of a Novel Soft Everting Robot for Colonoscopy |
|
| Shi, Jialei | Imperial College London |
| Borvorntanajanya, Korn | Imperial College London |
| Chen, Kaiwen | Imperial College London |
| Franco, Enrico | Imperial College London |
| Rodriguez y Baena, Ferdinando | Imperial College, London, UK |
Keywords: Soft Robot Applications, Soft Robot Materials and Design, Medical Robots and Systems, Modeling, Control, and Learning for Soft Robots
Abstract: Colonoscopy is a medical procedure used to examine the inside of the colon for abnormalities, such as polyps or cancer. Traditionally, this is done by manually inserting a long, flexible tube called a colonoscope into the colon. However, this method can cause pain, discomfort, and even the risk of perforation. To address these shortcomings, advancements in technology are needed to develop safer, more intelligent colonoscopes. This paper presents the design, control and evaluation of a self-growing soft robotic colonoscope, leveraging the evertion principle. The device features a tube with an 18 mm diameter, constructed from stretchable fabric, which grows 1.6 m at the tip under pressurization. A pneumatically driven, elastomer-based manipulator enables omni-directional steering over 180 degrees at the tip. An airtight base houses motors and spools that control the material and regulate growth speed. The robot operates in two modes: teleoperation via joysticks and autonomous navigation using sensor inputs, such as a tip-mounted camera. Thorough in-vitro experiments are conducted to assess the system's functionality and performance. Results illustrate that the robot can achieve locom
|
| |
| 09:00-10:30, Paper ThI1I.393 | Add to My Program |
| Semantic Hierarchy-Guided Adversarial Attack for Autonomous Driving |
|
| Kim, Gwangbin | Gwangju Institute of Science and Technology |
| Kim, SeungJun | Gwangju Institute of Science and Technology |
Keywords: Computer Vision for Transportation, Semantic Scene Understanding, Robot Safety
Abstract: Autonomous vehicles employ semantic segmentation as a foundational component for perception and scene understanding, upon which driving decisions can be informed. Despite their performance, these deep learning models remain susceptible to subtle input perturbations that can cause severe deviation in model output. To enhance algorithmic robustness by examining such vulnerabilities, researchers have investigated adversarial examples, which are visually imperceptible yet can severely degrade model performance. However, traditional attacks produce arbitrary misclassifications that ignore semantic relationships, making the attack less effective. This paper introduces a semantic hierarchy-guided adversarial attack (SHAA), a white-box adversarial attack against semantic segmentation for autonomous driving. By combining semantic hierarchy and adaptive momentum-based updates across the image, SHAA produces semantically nontrivial yet highly effective perturbations. The SHAA method exposes deeper vulnerabilities with a higher attack success rate in semantic segmentation than existing methods, aiding the design of a more resilient perception system for autonomous vehicles.
|
| |
| 09:00-10:30, Paper ThI1I.394 | Add to My Program |
| TD-CD-MPPI: Temporal-Difference Constraint-Discounted Model Predictive Path Integral Control |
|
| Crestaz, Pietro Noah | University of Trento, LAAS-CNRS |
| De Matteïs, Ludovic | LAAS-CNRS |
| Chane-Sane, Elliot | LAAS, CNRS |
| Mansard, Nicolas | CNRS |
| Del Prete, Andrea | University of Trento |
Keywords: Optimization and Optimal Control, Legged Robots, Whole-Body Motion Planning and Control
Abstract: Path Integral methods have demonstrated remarkable capabilities for solving non-linear stochastic optimal control problems through sampling-based optimization. However, their computational complexity grows linearly with the prediction horizon, limiting long-term reasoning, while constraints are merely enforced through handcrafted penalties. In this work, we propose a unified and efficient framework for enabling long-horizon reasoning and constraint enforcement within Model Predictive Path Integral (MPPI) control. First, we introduce a practical method to incorporate a terminal value function, learned offline via temporal-difference learning, to approximate the long-term cost-to-go. This allows for significantly shorter roll-outs while enabling infinite-horizon reasoning, thereby improving computational efficiency and motion performance. Second, we propose a discount modulation strategy that adjusts the return of sampled trajectories based on constraint violations. This provides a more interpretable and effective mechanism for enforcing constraints compared to traditional cost shaping. Our formulation retains the flexibility and sampling efficiency of MPPI while supporting structured integration of long-term objectives and constraint handling. We validate our approach on both simulated and real-world robotic locomotion tasks, demonstrating improved performance, constraint-awareness, and generalization under reduced computational budgets.
|
| |
| 09:00-10:30, Paper ThI1I.395 | Add to My Program |
| CCRobot-S: A Robotic Cable-Climbing Squad Collaborating for Fast Inspection and Heavy-Duty Maintenance |
|
| Zheng, Zhenliang | The Chinese University of Hong Kong, Shenzhen |
| Ding, Ning | The Chinese University of Hong Kong, Shenzhen |
| Werner, Herbert | Hamburg University of Technology |
| Ren, Feng | Shenzhen Institute of Artificial Intelligence and Robotics for Society |
| Xu, Yongyuan | The Chinese University of Hong Kong CUHK (Shenzhen) |
| Zhang, Wenchao | AIRS |
| Hu, Xiaoli | Shenzhen Institute of Artificial Intelligence and Robotics for Society |
| Zhang, Jianguo | Shenzhen Institute of Artificial Intelligence and Robotics for Society |
| Lam, Tin Lun | The Chinese University of Hong Kong, Shenzhen |
|
|
| |
| 09:00-10:30, Paper ThI1I.396 | Add to My Program |
| A Neuro-Inspired Control Architecture to Enhance Robot Self-Preservation and Adaptation in Autonomous Navigation Tasks |
|
| Usai, Andrea | Politecnico Di Torino |
| Rizzo, Alessandro | Politecnico Di Torino |
Keywords: Bioinspired Robot Learning, Neurorobotics, Cognitive Control Architectures
Abstract: Ensuring survival and self-preservation is essential to design intelligent robots that adapt to dynamic and unfamiliar environments. Inspired by the dual-pathway model from neuroscience, we introduce a control architecture designed to ensure the adaptability of robotic behavior during navigation. This approach parallels the neuroscientific ``Low Road'' paradigm by incorporating constructs resembling the thalamus, implemented as a nonlinear filter; the amygdala, modeled as a Soft Actor-Critic (SAC) reinforcement learning agent; and the brainstem-cerebellum connection, represented by a Nonlinear Model Predictive Controller (NMPC). Our findings indicate superior adaptiveness, generalizability, and computational efficiency compared to standard NMPCs and Artificial Potential Fields in both static and dynamic environments with obstacles of varying risk levels.
|
| |
| 09:00-10:30, Paper ThI1I.397 | Add to My Program |
| Whole-Body Integrated Motion Planning for Aerial Manipulators |
|
| Deng, Weiliang | Sun Yat-Sen University |
| Chen, Hongming | Sun Yat-Sen University |
| Ye, Biyu | Sun Yat-Sen University |
| Chen, Haoran | Sun Yat-Sen University |
| Li, Ziliang | Sun Yat-Sen University |
| Lyu, Ximin | Sun Yat-Sen University |
Keywords: Motion and Path Planning, Aerial Systems: Applications, Collision Avoidance, Aerial Manipulation
Abstract: Expressive motion planning for Aerial Manipulators (AMs) is essential for tackling complex manipulation tasks, yet achieving coupled trajectory planning adaptive to various tasks remains challenging, especially for those requiring aggressive maneuvers. In this work, we propose a novel whole-body integrated motion planning framework for quadrotor-based AMs that leverages flexible waypoint constraints to achieve versatile manipulation capabilities. These waypoint constraints enable the specification of individual position requirements for either the quadrotor or end-effector, while also accommodating higher-order velocity and orientation constraints for complex manipulation tasks. To implement our framework, we exploit spatio-temporal trajectory characteristics and formulate an optimization problem to generate feasible trajectories for both the quadrotor and manipulator while ensuring collision avoidance considering varying robot configurations, dynamic feasibility, and kinematic feasibility. Furthermore, to enhance the maneuverability for specific tasks, we employ Imitation Learning (IL) to facilitate the optimization process to avoid poor local optima. The effectiveness of our framework is validated through comprehensive simulations and real-world experiments, where we successfully demonstrate nine fundamental manipulation skills across various environments.
|
| |
| 09:00-10:30, Paper ThI1I.398 | Add to My Program |
| Impedance Control Design Framework Using Commutative Map between SE(3) and Se(3) |
|
| Kim, Jonghyeok | POSTECH |
| Sung, Minchang | Hanyang University |
| Choi, Youngjin | Hanyang University |
| Park, Jonghoon | Neuromeka |
| Chung, Wan Kyun | POSTECH |
Keywords: Compliance and Impedance Control, Motion Control of Manipulators, Force Control, Lie group and Lie algebra
Abstract: Impedance control is a widely adopted approach that ensures the compliant behavior of robot manipulators as they interact with their environment according to specifically designed dynamics. For tasks involving six degrees of freedom (DoF), it is crucial to appropriately manage the position and orientation of the end effector by controlling dynamic behavior. However, describing orientational displacement and designing the corresponding rotational impedance can be challenging, especially when we use a minimal representation. The well-known minimal representation for orientation, the Euler angle, suffers from representation singularity. As a remedy, the quaternion or dual quaternion can be an alternative but with non-minimal representations. This lack of minimal representation, which does not suffer from the representation singularity, often leads to handling the impedance design by directly defining the potential energy function in the matrix Lie group. This paper proposes a framework for the six-DoF impedance control design that takes advantage of Lie group theory with minimal representation, known as exponential coordinate. Since the exponential coordinate can be treated as the Eu
|
| |
| 09:00-10:30, Paper ThI1I.399 | Add to My Program |
| MULE – Multi-Terrain and Unknown Load Adaptation for Effective Quadrupedal Locomotion |
|
| Kurva, Vamshi Kumar | IISc |
| Kolathaya, Shishir | Indian Institute of Science |
Keywords: Legged Robots, Reinforcement Learning, Robust/Adaptive Control
Abstract: Quadrupedal robots deployed for load-carrying applications must maintain stable locomotion across diverse ter- rains and varying payloads. Traditional approaches like Model Predictive Control (MPC) can handle such variations but often rely on predefined gait schedules and manually tuned trajectory planners, limiting adaptability in unstructured environments. To address this, we propose an adaptive reinforcement learning (RL) framework that enables quadrupedal robots to respond dynamically to terrain and payload changes without relying on contact force measurements or gait designs. The controller con- sists of a nominal policy that learns general locomotion across terrains and an adaptive policy that outputs corrective actions for handling dynamic variations due to payloads. We validate our approach through extensive simulations in Isaac Gym across payloads (2–10 kg) and terrains including flat ground, slopes, and stairs. Our method achieves higher success rates and lower height-tracking errors while maintaining the Cost of Transport (CoT) comparable to the best-performing baselines and to no-load (NL) operation. Real-world deployment on a Unitree Go1 confirms the approach’s effectiveness under both static and dynamic payload changes, including freely moving masses. The policy also performs well on outdoor terrains such as grass, soil, and staircases. The adaptive policy modulates corrections based on payload changes, improving body stability and tracking without post-deployment fine-tuning.
|
| |
| 09:00-10:30, Paper ThI1I.400 | Add to My Program |
| Multi-Task Visual Perception with Temporal Feature Fusion for Autonomous Driving |
|
| Lin, Huei-Yung | National Taipei University of Technology |
| Wei, Shih-Han | National Taipei University of Technology |
Keywords: Autonomous Vehicle Navigation, Intelligent Transportation Systems, Computer Vision for Transportation
Abstract: With the rapid developments of autonomous driving technologies, accurate scene perception has become essential for safe and efficient navigation. The key perception tasks such as lane detection, semantic segmentation of road markings and road area, and object detection directly impact vehicle decision-making and obstacle avoidance. However, most existing methods are trained on a single-task dataset, limiting data diversity and reducing performance in complex scenarios or under occlusion and illumination variation. In this work we propose a multi-task perception network based on image sequence input, integrating lane detection, road marking and road area segmentation, and object detection into a unified framework. The network model employs multi-task learning to share features and improve the computational efficiency, and adopts the cross-dataset training paradigm to enhance generalization across tasks. Furthermore, the temporal information from adjacent frames is leveraged to compensate visual degradation in current frames. Experimental results obtained on multiple datasets demonstrate the proposed technique achieves competitive performance compared to state-of-the-art approaches. Code is available at https://github.com/(removed_for_review).
|
| |
| 09:00-10:30, Paper ThI1I.401 | Add to My Program |
| Design and Actuation of a Multipole Ring Magnet for Navigating Endovascular Magnetic Instruments |
|
| Raub, Julian | EPFL |
| Pancaldi, Lucio | EPFL |
| Von Deschwanden, Loïc Jonathan | EPFL |
| Sakar, Mahmut Selman | EPFL |
Keywords: Micro/Nano Robots, Surgical Robotics: Steerable Catheters/Needles, Medical Robots and Systems
Abstract: This letter presents the design and analysis of a compact magnetic field generator for robotic navigation of magnetic endovascular instruments in clinical settings. The system features eight symmetrically arranged permanent magnets in a ring configuration, maximizing magnetic field uniformity and aperture size for X-ray transmission while minimizing spatial footprint and magnet motion for uninterrupted fluoroscopy imaging. The rationale behind the design of the system is explained through analytical considerations of magnetic field distribution. In vitro demonstrations inside perfused biomimetic phantoms confirm the system’s capability for 3D steering of flow-driven magnetic microcatheters, opening a path for pre-clinical testing.
|
| |
| 09:00-10:30, Paper ThI1I.402 | Add to My Program |
| A General Approach to Path Planning within C-Space Safe Corridors |
|
| Abbati, Riccardo | University of Parma |
| Riboli, Marco | University of Applied Sciences of Southern Switzerland, SUPSI |
| Rosselli, Michel | University of Applied Sciences of Southern Switzerland, SUPSI |
| Tasora, Alessandro | Universita Di Parma |
| Silvestri, Marco | University of Parma, Italy |
Keywords: Motion and Path Planning, Collision Avoidance, Optimization and Optimal Control
Abstract: This letter presents a novel strategy for collision-free path planning in robotic manipulators. The method operates in two stages: first, a sampling-based exploration of the configuration space is performed to construct a safe corridor composed of axis-aligned bounding boxes. Within this corridor, an optimisation-based trajectory generation phase addresses the channel problem by computing smooth joint trajectories as non-rational splines. A multi-objective cost function is minimised to reduce geometric acceleration along the path, while also maximising the distance from obstacles to improve safety margins. The proposed algorithm is general and applicable to a wide range of kinematic structures, and supports user-defined path degree and geometric continuity. Simulation results demonstrate superior performance compared to existing methods, and experimental validation further confirms its practical effectiveness. Our implementation is open-sourced and available on Github.
|
| |
| 09:00-10:30, Paper ThI1I.403 | Add to My Program |
| OverlapMamba: A Shift State Space Model for LiDAR-Based Place Recognition |
|
| Luo, Jiehao | South China Normal University |
| Cheng, Jintao | South China Normal University |
| Xiang, Qiuchi | South China Normal University |
| Wu, Jin | University of Science and Technology Beijing |
| Fan, Rui | Tongji University |
| Chen, Xieyuanli | National University of Defense Technology |
| Tang, Xiaoyu | South China Normal University |
Keywords: Localization, Deep Learning for Visual Perception, Computer Vision for Automation
Abstract: Place recognition is the foundation for autonomous systems to achieve independent decision-making and secure operation. It is also crucial in tasks such as loop closure detection and global localization in Simultaneous Localization and Mapping (SLAM) technology. Existing LiDAR-based place recognition (LPR) methods use raw point cloud representations or multifarious point cloud representations as inputs, as well as employ convolutional neural networks or transformer architectures. However, the recently proposed Mamba deep learning model combined with State Space Models (SSMs) has enormous potential in long sequence modeling. Therefore, we have developed a novel place recognition network OverlapMamba, which represents input range views (RVs) as sequences. In a novel way, we use a stochastic reconstruction method to establish shifted state space models to compress the visual representation. Extensive experiments on three public datasets demonstrate that OverlapMamba achieves competitive performance with real-time inference speed, which effectively detects loop closure even when traversing previously visited locations from different directions, indicating its strong place recognition ability and real-time efficiency.
|
| |
| 09:00-10:30, Paper ThI1I.404 | Add to My Program |
| Gesture-Controlled Aerial Robot Formation for Human-Swarm Interaction in Safety Monitoring Applications |
|
| Kratky, Vit | Czech Technical University in Prague |
| Silano, Giuseppe | Ceske vysoke uceni technicke v Praze, FEL |
| Vrba, Matous | Faculty of Electrical Engineering, Czech Technical University in Prague |
| Papaioannidis, Christos | Department of Informatics, Aristotle University of Thessaloniki |
| Mademlis, Ioannis | Aristotle University of Thessaloniki |
| Penicka, Robert | Czech Technical University in Prague |
| Pitas, Ioannis | Aristotle University of Thessaloniki |
| Saska, Martin | Czech Technical University in Prague |
|
|
| |
| 09:00-10:30, Paper ThI1I.405 | Add to My Program |
| Predictive Admittance Control for Aerial Physical Interaction |
|
| Alharbat, Ayham | Saxion University of Applied Sciences/University of Twente |
| Gabellieri, Chiara | University of Twente |
| Mersha, Abeje Yenehun | Saxion University of Applied Science |
| Franchi, Antonio | University of Twente / Sapienza University of Rome |
Keywords: Aerial Systems: Mechanics and Control, Compliance and Impedance Control, Optimization and Optimal Control
Abstract: This paper introduces a novel approach for controlling aerial robots during physical interaction by integrating Admittance Control with Nonlinear Model Predictive Control (NMPC). Unlike existing methods, our technique incorporates the desired impedance dynamics directly into the NMPC prediction model, alongside the robot’s dynamics. This allows for the explicit prediction of how the robot’s impedance will respond to interaction forces within the prediction horizon. Consequently, our controller effectively tracks the desired impedance behavior during physical interaction while seamlessly transitioning to trajectory tracking in free motion, all while consistently respecting actuator constraints. The efficacy of this method is validated through realtime simulations and experiments involving physical interaction tasks with an aerial robot. Our findings demonstrate that, across most scenarios, our method significantly outperforms the state-ofthe-art (which does not predict future impedance state), achieving a reduction in tracking error of up to 90%. Furthermore, the results indicate that our approach enables smoother and safer physical interaction, characterized by reduced oscillations and the absence of the unstable behavior observed with the state-ofthe-art method in certain situations.
|
| |
| 09:00-10:30, Paper ThI1I.406 | Add to My Program |
| Continuum Robot Segments with High Output Stiffness Via Diagonal Backbones |
|
| Eisenhauer, Ethan | University of Tennessee - Knoxville |
| Gaston, Joshua | The University of Tennessee, Knoxville |
| Milam, Eli | University of Tennessee |
| Rucker, Caleb | University of Tennessee |
Keywords: Actuation and Joint Mechanisms, Soft Sensors and Actuators, Modeling, Control, and Learning for Soft Robots
Abstract: Continuum robots offer unique advantages for applications such as minimally invasive surgery, navigation through confined environments, and safe human-robot interaction. However, while most continuum robot segments are designed to exhibit constant curvature over their length, they passively deform into a non-constant curvature s-shape when holding payloads at the tip, and their dynamic movement is often subject to unwanted vibration of the passive non-constant curvature modes. In this paper, we propose a simple solution to dramatically improve these issues: a continuum robot segment design that utilizes a diagonal backbone and flexible push-pull actuation rods. This simple modification to common continuum-robot construction enables us to eliminate the passive s-shaped mode, creating a bending segment that can handle large loads without significant deformation or vibration while requiring no more actuation force than conventional designs. We show that a modified version of 1-DOF constant-curvature kinematics accurately describes the structure when actuator translations are equal and opposite. We also develop and validate a 2-DOF model that predicts tip position and orientation resulting from more general actuation inputs. The models and increased output stiffness were verified experimentally and the concept was demonstrated on a multi-segment robot following a 3D trajectory with minimal disturbance from added loads.
|
| |
| 09:00-10:30, Paper ThI1I.407 | Add to My Program |
| VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation |
|
| Chen, Yixiang | Institute of Automation, Chinese Academy of Science |
| Huang, Yan | Institute of Automation, Chinese Academy of Sciences |
| He, Keji | Shandong University |
| Li, Peiyan | Institute of Automation, Chinese Academy of Sciences |
| Wang, Liang | Institute of Automation, Chinese Academy of Sciences |
Keywords: Deep Learning for Visual Perception, Deep Learning in Grasping and Manipulation, Learning from Demonstration
Abstract: When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the VERM (Virtual Eye for Efficient Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual optimal view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving 1.89x speedup in training time and 1.54x speedup in inference speed.
|
| |
| 09:00-10:30, Paper ThI1I.408 | Add to My Program |
| From Composable Models to Correct-By-Construction Software for Contact-Rich Robotic Mobile-Manipulation Tasks |
|
| Schneider, Sven | Hochschule Bonn-Rhein-Sieg |
| Kalagaturu, Vamsi Krishna | University of Bremen |
| Bruyninckx, Herman | KU Leuven |
| Hochgeschwender, Nico | University of Bremen |
Keywords: Software Tools for Robot Programming, Methods and Tools for Robot System Design, Mobile Manipulation
Abstract: Software frameworks like the Stack of Tasks (SoT), the Stanford Whole-Body Control (WBC) library, or the instantaneous Task Specification using Constraints (iTaSC) have enabled robots to perform advanced, contact-oriented manipulation tasks. jgeom constr and eTaSL are among the few formal, computer-interpretable languages that allow users to specify such tasks independent of these frameworks. We analyse these languages for their limitations with respect to composability, the design for extensibility without having to change existing models, and compositionality, meaning that the semantics of compositions unambiguously follows from the semantics of the components and of the composition relations. To overcome these limitations we design a graph-structured and well-defined interchange format for such tasks. The associated tooling enables us to generate correct-by-construction code that adheres to predefined rules and constraints. We showcase our models and toolchain by incrementally constructing a workspace-alignment application for a highly-redundant mobile platform that is equipped with two 7-DoF, torque-controlled manipulators.
|
| |
| 09:00-10:30, Paper ThI1I.409 | Add to My Program |
| Robot Arm Self-Calibration Using RGB-D Camera |
|
| Lee, Jiyong | Korea Institute of Science and Technology |
| Kim, KangGeon | Korea Institute of Science and Technology |
| You, Bum-Jae | Korea Institute of Science and Technology |
Keywords: Calibration and Identification, Computer Vision for Automation
Abstract: Kinematic and hand-eye calibration of robotic arms is a critical research area in robotics, essential to ensuring the accuracy of manipulation tasks. The widely adopted methods for robotic arm calibration typically rely on specialized markers or external sensors to achieve precise measurements. However, these approaches are often expensive and require additional effort, such as the installation and maintenance of auxiliary equipment. Furthermore, many downstream tasks require separate hand-eye calibration steps because of differences between the sensors used for calibration and those used for task execution. Comprehensive calibration of both the robot arm and sensors plays a vital role in optimizing system performance. However, the robot's posture could be constrained due to either the sensor's limited range or textureless scenes when a camera is used. To address these limitations, this study proposes a cost-effective self-calibration method that simultaneously calibrates the robot arm and determines the spatial relationship between the robot and an RGB-D camera, allowing for data collection at multiple locations. The proposed approach leverages recent advancements in machine learning to identify correspondences between images captured at different robot postures, facilitating automatic data selection. Furthermore, the removal of location constraints increases flexibility, enabling the collection of sufficient data as the robot's location changes. The method is evaluated using a Franka Emika Panda robotic arm, and the experimental results demonstrate its effectiveness in achieving accurate calibration without the need for external devices or markers.
|
| |
| 09:00-10:30, Paper ThI1I.410 | Add to My Program |
| Above and Below: Heterogeneous Multi-Robot SLAM across Surface and Underwater Domains |
|
| McConnell, John | United States Naval Academy |
| Shariati, Armon | University of Pennsylvania |
| Szenher, Paul | Stevens Institute of Technology |
| Li, Yaxuan | Stevens Institute of Technology |
Keywords: Marine Robotics, SLAM, Range Sensing
Abstract: Multi-robot simultaneous localization and mapping (SLAM) is a fundamental task in multi-robot operations. Robots must have a common understanding of their location and that of their team members to complete coordinated actions. However, multi-robot SLAM between Uncrewed Surface Vessels (USVs) and Autonomous Underwater Vehicles (AUVs) has primarily been achieved through acoustic pinging between robots to retrieve range measurements; a measurement technique requires that robots to be in similar locations simultaneously, have an uninterrupted path for signal propagation, and may necessitate synchronized clocks. This is especially challenging in complex, cluttered maritime environments, where structures may impede signals. However, these same structures may be observable above and below the water's surface, presenting an opportunity for inter-robot SLAM loop closure between USV and AUV data streams. This work builds upon recent research on inter-robot SLAM loop closure between USV and AUV data cite{msm}, extending it to propose a centralized multi-robot SLAM system. Each robot performs its state estimation, and we detect loop closures between each AUV and the USV data. These inter-robot loop closures are used to merge each robot's state estimate into a centralized graph, yielding estimates for the whole time history of the USV and all AUVs in the system. Validation is performed using real-world perceptual data in three different environments. Results show improved errors for AUVs in the multi-robot SLAM system compared to single-robot SLAM over the same trajectories. To our knowledge, this is the first instance of a multi-robot SLAM system with AUVs and USVs built on loop closures rather than acoustic distance measurements.
|
| |
| 09:00-10:30, Paper ThI1I.411 | Add to My Program |
| Reliable and Fast Humans Removed Visual Scene Representation |
|
| Iscan, Serhat | Bogazici University |
| Bozma, H. Isil | Bogazici University |
Keywords: RGB-D Perception, Recognition, Human-Centered Robotics
Abstract: This paper introduces a reliable and fast method for scene representation from a single RGB frame, even with human occlusion. Our goal is to enhance vision-based spatial reasoning in dynamic environments where human presence varies over time. Once humans are detected, the method addresses two key challenges: estimating the level of visual obstruction and generating a scene descriptor with humans removed. The first is handled via a novel visual obstruction measure that prevents descriptor generation under high occlusion. The second is addressed by adapting the previously presented bubble descriptor so that surface regions corresponding to detected humans are deformed using a modified spherical interpolation method—eliminating the need for inpainting or reconstruction and enabling rapid computation. We validate our approach through extensive comparisons across multiple datasets, including two new datasets collected using both stationary and mobile robots. Results show comparable representation quality with a 14-44× reduction in computation time.
|
| |
| 09:00-10:30, Paper ThI1I.412 | Add to My Program |
| Scalable Multi-Agent Reinforcement Learning Framework for Multi-Machine Tending |
|
| Abdalwhab, Abdalwhab Bakheet Mohamed | École De Technologie Supérieure |
| Beltrame, Giovanni | Ecole Polytechnique De Montreal |
| St-Onge, David | Ecole De Technologie Superieure |
Keywords: AI and Machine Learning in Manufacturing and Logistics Systems, Path Planning for Multiple Mobile Robots or Agents, Integrated Planning and Control
Abstract: Robotic manipulators hold significant untapped potential for manufacturing industries, particularly when deployed in multi-robot configurations that can enhance resource utilization, increase throughput, and reduce costs. However, industrial manipulators typically operate in isolated one-robot, one-machine setups, limiting both utilization and scalability. Even mobile robot implementations generally rely on centralized architectures, creating vulnerability to single points of failure and requiring robust communication infrastructure. This paper introduces SMAPPO (Scalable Multi-Agent Proximal Policy Optimization), a scalable input-size invariant multi-agent reinforcement learning model for decentralized multi-robot management in industrial environments. MAPPO (Multi-Agent Proximal Policy Optimization) represents the current state-of-the-art approach. We optimized an existing simulator to handle complex multi-agent reinforcement learning scenarios and designed a new multi-machine tending scenario for evaluation. Our novel observation encoder enables SMAPPO to handle varying numbers of agents, machines, and storage areas with minimal or no retraining. Results demonstrate SMAPPO’s superior performance compared to the state-of-the-art MAPPO across multiple conditions: full retraining (up to 61% improvement), curriculum learning (up to 45% increased productivity and up to 49% fewer collisions), zero-shot generalization to significantly different scale scenarios (up to 272% better performance without retraining), and adaptability under extremely low initial training (up to 100% increase in parts delivery).
|
| |
| 09:00-10:30, Paper ThI1I.413 | Add to My Program |
| A Hierarchical Framework for Real-Time Path Planning of Microswarm in Dynamic Environments |
|
| Li, Yamei | The Hong Kong Polytechnic University |
| Ge, Ruijian | Vanderbilt University |
| Zhu, Aoji | The Hong Kong Polytechnic University |
| Zhao, Jiachi | The HONG KONG Polytechnic University |
| Shi, Danjing | The Hong Kong Polytechnic University |
| Sun, Yinghan | The Hong Kong Polytechnic University |
| Li, Yangmin | The Hong Kong Polytechnic University |
| Yang, Lidong | The Hong Kong Polytechnic University |
Keywords: Automation at Micro-Nano Scales, Micro/Nano Robots, Motion and Path Planning
Abstract: Autonomous navigation of magnetic microswarms in dynamic and unstructured environments is essential for biomedical applications, such as targeted therapy and minimally invasive interventions. However, existing path planning methods struggle to simultaneously achieve real-time adaptability and path smoothness in dynamic obstacle environments. To address this, we propose a hierarchical Dynamic Rapidly-exploring Random Tree Star (D-RRT*) path planning framework that integrates dynamic step size adjustment, local target selection, and local planning that considers microswarms' turning capabilities and energy optimization. Comparative simulations and experiments validate the effectiveness of the proposed planning framework, and results show that it can significantly improve the planning efficiency, path smoothness, and collision avoidance in complex dynamic scenarios.
|
| |
| 09:00-10:30, Paper ThI1I.414 | Add to My Program |
| A Versatile Underactuated Robotic Hand for Cloth Manipulation |
|
| Chen, Xinhao | TOHOKU University |
| Petrilli Barceló, Alberto Elías | Tohoku University |
| Shiry Ghidary, Saeed | Amirkabir University of Technology |
| Chen, Dayuan | Tohoku University |
| Salazar Luces, Jose Victorio | Tohoku University |
| Hirata, Yasuhisa | Tohoku University |
Keywords: Industrial Robots, Product Design, Development and Prototyping
Abstract: Traditional cloth manipulation often employs low- DOF grippers to simplify hardware. However, this approach presents significant challenges for algorithms due to the complex and dynamic behavior of fabrics. To address these limitations, we propose a novel approach based on a human-hand-like de- sign that integrates an underactuated grasping mechanism with a suction system. By incorporating multiple single-layer suction principles, the robotic hand achieves greater adaptability and flexibility, enabling it to perform tasks that would typically require high-DOF hands and complex control strategies. This paper outlines the design of the robot hand with a suction system and evaluates its performance in various tasks.
|
| |
| 09:00-10:30, Paper ThI1I.415 | Add to My Program |
| CornerVINS: Accurate Localization and Layout Mapping for Structural Environments Leveraging Hierarchical Geometric Representations |
|
| Zhang, Yidi | University of Chinese Academy of Sciences |
| Tang, Fulin | Institute of Automation, Chinese Academy of Sciences, University of Chinese Academy of Sciences, China |
| Wu, Yihong | National Laboratory of Pattern Recognition, InstituteofAutomation, Chinese Academy of Sciences |
|
|
| |
| 09:00-10:30, Paper ThI1I.416 | Add to My Program |
| EiGS: Event-Informed 3D Deblur Reconstruction with Gaussian Splatting |
|
| Weng, Yuchen | China University of Mining and Technology |
| Li, Nuo | China University of Mining and Technology |
| Yu, Peng | China University of Mining and Technology |
| Wang, Qi | China University of Mining and Technology |
| Qi, Yongqiang | China University of Mining and Technology |
| You, Shaoze | China University of Mining and Technology |
| Wang, Jun | China University of Mining and Technology |
Keywords: Sensor Fusion, Mapping, Deep Learning Methods
Abstract: Neural Radiance Fields (NeRF) have significantly advanced photorealistic novel view synthesis. Recently, 3D Gaussian Splatting has emerged as a promising technique with faster training and rendering speeds. However, both methods rely heavily on clear images and precise camera poses, limiting performance under motion blur. To address this, we introduce Event-Informed 3D Deblur Reconstruction with Gaussian Splatting(EiGS), a novel approach leveraging event camera data to enhance 3D Gaussian Splatting, improving sharpness and clarity in scenes affected by motion blur. Our method employs an Adaptive Deviation Estimator to learn Gaussian center shifts as the inverse of complex camera jitter, enabling simulation of motion blur during training. A motion consistency loss ensures global coherence in Gaussian displacements, while Blurriness and Event Integration Losses guide the model toward precise 3D representations. Extensive experiments demonstrate superior sharpness and real-time rendering capabilities compared to existing methods, with ablation studies validating the effectiveness of our components in robust, high-quality reconstruction for challenging motion-blurred environments.
|
| |
| 09:00-10:30, Paper ThI1I.417 | Add to My Program |
| Accelerating High-Capacity Ridepooling in Robo-Taxi Systems |
|
| Li, Xinling | Massachusetts Institute of Technology |
| Gammelli, Daniele | Stanford |
| Wallar, Alex | Unaffiliated |
| Zhao, Jinhua | Massachusetts Institute of Technology |
| Zardini, Gioele | Massachusetts Institute of Technology |
Keywords: Intelligent Transportation Systems, Multi-Robot Systems, Networked Robots
Abstract: Rapid urbanization has increased demand for customized urban mobility, making on-demand services and robo-taxis central to future transportation. The efficiency of these systems hinges on real-time fleet coordination algorithms. This work accelerates the state-of-the-art high-capacity ridepooling framework by identifying its computational bottlenecks and introducing two complementary strategies: (i) a data-driven feasibility predictor that filters low-potential trips, and (ii) a graph-partitioning scheme that enables parallelizable trip generation. Using real-world Manhattan demand data, we show that the acceleration algorithms reduce the optimality gap by up to 27% under real-time constraints and cut empty travel time by up to 5%. These improvements translate into tangible economic and environmental benefits, advancing the scalability of high-capacity robo-taxi operations in dense urban settings.
|
| |
| 09:00-10:30, Paper ThI1I.418 | Add to My Program |
| DRO-EDL-MPC: Evidential Deep Learning-Based Distributionally Robust Model Predictive Control for Safe Autonomous Driving |
|
| Ham, Hyeongchan | KAIST |
| Ahn, Heejin | KAIST |
Keywords: Planning under Uncertainty, Robot Safety, Machine Learning for Robot Control
Abstract: Safety is a critical concern in motion planning for autonomous vehicles. Modern autonomous vehicles rely on neural network-based perception, but making control decisions based on these inference results poses significant safety risks due to inherent uncertainties. To address this challenge, we present a distributionally robust optimization (DRO) framework that accounts for both aleatoric and epistemic perception uncertainties using evidential deep learning (EDL). Our approach introduces a novel ambiguity set formulation based on evidential distributions that dynamically adjusts the conservativeness according to perception confidence levels. We integrate this uncertainty-aware constraint into model predictive control (MPC), proposing the DRO-EDL-MPC algorithm with computational tractability for autonomous driving applications. Validation in the CARLA simulator demonstrates that our approach maintains efficiency under high perception confidence while enforcing conservative constraints under low confidence.
|
| |
| 09:00-10:30, Paper ThI1I.419 | Add to My Program |
| Safe Vector Field for Robot Navigation in N-Dimensions |
|
| Dias Nunes, Arthur Henrique | Universidade Federal De Minas Gerais |
| Gonçalves, Vinicius Mariano | Federal University of Minas Gerais, UFMG, Brazil |
| Pimenta, Luciano | Universidade Federal De Minas Gerais |
Keywords: Collision Avoidance, Motion and Path Planning, Motion Control
Abstract: In this work, we propose a novel artificial vector field for robot navigation in n-dimensional path-following tasks, designed to ensure safety and convergence with a smoothed control law. Unlike previous methods based on discontinuous Euclidean distance functions, our approach uses a smooth Euclidean-like function to achieve a continuous control law formulation and a field combination to balance the objectives of avoiding obstacles and following the path. This results in a navigation method that follows a target path while preventing robots from approaching obstacles, which can be used in different applications. We provide formal proofs for safety using barrier functions concepts and path convergence via Lyapunov theory. The methodology is validated through extensive numerical simulations and real-world experiments. Those include extrapolations of the methodology in more complex cases, such as quadcopters and multi-robot systems to underline the method's advantages in achieving safe and reliable robot navigation.
|
| |
| 09:00-10:30, Paper ThI1I.420 | Add to My Program |
| LIO-HKDT: Fast and Accurate LiDAR-Inertial Odometry with Hash K-D Tree |
|
| Mu, Yuexin | Chongqing University |
| Ren, Ao | Chongqing University |
| Liu, Duo | Chongqing University |
| Wang, Murong | Chongqing University |
| Zhang, Zihao | Chongqing University |
| Lu, Haojie | Chongqing University |
| Tan, Yujuan | Chongqing University |
| Zhong, Kan | Chongqing University |
Keywords: SLAM, Localization, Range Sensing
Abstract: LiDAR-inertial odometry(LIO) has been widely applied in intelligent robotics and autonomous driving, providing high-precision and low-latency ego-motion estimation. However, the massive point clouds generated by LiDAR introduce intensive data processing demands, making k-nearest neighbor(KNN) search and map update a critical bottleneck that limits the real-time performance of the LIO system. This paper proposes a novel data structure, the hkd-Tree, which uses hashed voxel indices as keys and local k-d tree as values. It combines the localized search advantages of voxel-based methods with the efficient search capability of k-d tree, enabling fast KNN search and point cloud insertion. To further improve the performance of the hkd-Tree, we propose a voxel distribution mechanism and buffered update strategy, where each new point is assigned to neighboring voxels within the search radius and inserted into local k-d tree via parallel batch updates. We develop a LiDAR-inertial odometry system, LIO-HKDT, based on the proposed hkd-Tree. Extensive experiments demonstrate that the hkd-Tree enables highly efficient point cloud search and insertion. LIO-HKDT achieves comparable accuracy to state-of-the-art LIO systems while significantly improving runtime efficiency.
|
| |
| 09:00-10:30, Paper ThI1I.421 | Add to My Program |
| OMCL: Open-Vocabulary Monte Carlo Localization |
|
| Kruzhkov, Evgenii | University of Bonn |
| Memmesheimer, Raphael | University of Bonn |
| Behnke, Sven | University of Bonn |
Keywords: Localization, Semantic Scene Understanding, Mapping
Abstract: Robust robot localization is an important prerequisite for navigation, but it becomes challenging when the map and robot measurements are obtained from different sensors. Prior methods are often tailored to specific environments, relying on closed-set semantics or fine-tuned features. In this work, we extend Monte Carlo Localization with vision-language features, allowing OMCL to robustly compute the likelihood of visual observations given a camera pose and a 3D map created from posed RGB-D images or aligned point clouds These open-vocabulary features enable us to associate observations and map elements from different modalities, and to natively initialize global localization through natural language descriptions of nearby objects. We evaluate our approach using Matterport3D and Replica for indoor scenes and demonstrate generalization on SemanticKITTI for outdoor scenes.
|
| |
| 09:00-10:30, Paper ThI1I.422 | Add to My Program |
| Harnessing Robotics for EU Forest Habitats Monitoring |
|
| Tolomei, Simone | University of Pisa |
| Di Lorenzo, Giovanni | University of Pisa |
| Angelini, Franco | University of Pisa |
| de Simone, Leopoldo | Siena University |
| Fanfarillo, Emanuele | Siena University |
| Fiaschi, Tiberio | Siena University |
| Cannucci, Silvia | University of Siena |
| Maccherini, Simona | Siena University |
| Remagnino, Paolo | Kingston University, London, UK |
| Angiolini, Claudia | Università Di Siena |
| Garabini, Manolo | Università Di Pisa |
Keywords: Environment Monitoring and Management, Legged Robots, Field Robots
Abstract: This paper presents a novel approach to forest habitat monitoring using robotics and advanced data analysis techniques. We introduce a quadrupedal robot with LiDAR and onboard cameras to collect detailed data about forest structure and composition. The data is then processed using a combination of data analysis techniques and machine learning algorithms to perform a comprehensive dendrometric and floristic survey. Our approach provides an efficient and accurate method for assessing the ecological health of forest ecosystems. This work contributes to the ongoing efforts in habitat conservation and offers a promising tool for future environmental monitoring tasks.
|
| |
| 09:00-10:30, Paper ThI1I.423 | Add to My Program |
| A Sequential Approach for Accurate Parameters Identification of Heavy-Duty Hydraulic Manipulators Ensuring Physical Feasibility |
|
| Huang, Weidi | Zhejiang University |
| Chen, Zhiwei | Zhejiang University |
| Zhang, Fu | Northeastern University |
| Cheng, Min | Chongqing University |
| Ding, Ruqi | East China Jiaotong University |
| Zhang, Junhui | Zhejiang University |
| Xu, Bing | ZheJiang University |
| |
| 09:00-10:30, Paper ThI1I.424 | Add to My Program |
| ModuLoop: Low-Level Code Generation Using Modular Synthesizer and Closed-Loop Debugger for Robotic Control |
|
| Yoon, Gina | Sookmyung Woman's University |
| Lee, Sumin | Sookmyung Woman's University |
| Sim, Joo Yong | Sookmyung Women's University |
Keywords: AI-Enabled Robotics, Calibration and Identification, Task Planning
Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various domains, including code generation and problem solving. However, their application in robotic control—particularly in low-level tasks that require precise manipulation, real-time feedback, and environment-dependent execution—remains limited. To address this challenge, we propose the Closed-Loop Modular Code Synthesizer framework. This framework leverages a pre-trained LLM without any task-specific fine-tuning to perform modular code planning and generation, and iteratively executes the generated code while inserting debugging probes to observe its behavior. This closed-loop structure facilitates systematic debugging and refinement, ultimately producing executable control programs. We apply the proposed framework to the calibration of an RGB-D camera and a robotic arm, validating its effectiveness in real-world settings. Furthermore, through a subsequent pick-and-place task, we demonstrate not only the accuracy of the calibration but also the potential extensibility of the framework. Across both tasks, the framework achieved high execution accuracy and autonomy, illustrating the practicality and scalability of LLM-based robotic control using our framework.
|
| |
| 09:00-10:30, Paper ThI1I.425 | Add to My Program |
| An Anatomy-Aware Shared Control Approach for Assisted Teleoperation of Lung Ultrasound Examinations |
|
| Nardi, Davide | University of Trento |
| Lamon, Edoardo | University of Trento |
| Fontanelli, Daniele | University of Trento |
| Saveriano, Matteo | University of Trento |
| Palopoli, Luigi | University of Trento |
Keywords: Medical Robots and Systems, Physical Human-Robot Interaction, Telerobotics and Teleoperation
Abstract: Although fully autonomous systems still face challenges due to patients' anatomical variability, teleoperated systems appear to be more practical in current healthcare settings. This paper presents an anatomy-aware control framework for teleoperated lung ultrasound. Leveraging biomechanically accurate 3D modelling, the system applies virtual constraints on the ultrasound probe pose and provides real-time visual feedback to assist in precise probe placement tasks. A twofold evaluation, one with 5 naïve operators on a single volunteer and the second with a single experienced operator on 6 volunteers, compared our method with a standard teleoperation baseline. The results of the first one characterised the accuracy of the anatomical model and the improved perceived performance by the naïve operators, while the second one focused on the efficiency of the system in improving probe placement and reducing procedure time compared to traditional teleoperation. The results demonstrate that the proposed framework enhances the physician's capabilities in executing remote lung ultrasound, reducing more than 20% of execution time on 4-point acquisitions, towards faster, more objective and repeatable exams.
|
| |
| 09:00-10:30, Paper ThI1I.426 | Add to My Program |
| FlowSight: Vision-Based Artificial Lateral Line Sensor for Water Flow Perception |
|
| Zhang, Tiandong | Institute of Automation, Chinese Academy of Sciences |
| Wang, Rui | Institute of Automation, Chinese Academy of Sciences |
| Cao, Qiyuan | Institute of Automation, Chinese Academy of Sciences |
| Cui, Shaowei | Institute of Automation, Chinese Academy of Sciences |
| Zheng, Gang | INRIA |
| Wang, Shuo | Chinese Academy of Sciences |
Keywords: Biomimetics, Biologically-Inspired Robots, Marine Robotics, Water flow vector perception
Abstract: This article presents a novel vision-based artificial lateral line (ALL) sensor, FlowSight, enhancing the perception capabilities of underwater robots. Through an autonomous vision system, FlowSight allows for simultaneous sensing the speed and direction of local water flow without relying on external auxiliary equipment. Inspired by the lateral line neuromast of fish, a flexible bionic tentacle is designed to sense water flow. Deformation and motion characteristics of the tentacle are modeled and analyzed using bidirectional fluid-structure interaction (FSI) simulation. Upon contact with water flow, the tentacle converts water flow information into elastic deformation information, which is captured and processed into an image sequence by the autonomous vision system. Subsequently, a water flow perception method based on deep neural networks is proposed to estimate the flow speed and direction from the captured image sequence. The perception network is trained and tested using data collected from practical experiments conducted in a controllable swim tunnel. Finally, the FlowSight sensor is integrated into the bionic underwater robot RoboDact, and a closed-loop motion control experiment based on water flow perception is conducted. Experiments conducted in the swim tunnel and water pool demonstrate the feasibility and effectiveness of FlowSight sensor and the water flow perception method.
|
| |
| 09:00-10:30, Paper ThI1I.427 | Add to My Program |
| CoCoPlan: Adaptive Coordination and Communication for Multi-Robot Systems in Dynamic and Unknown Environments |
|
| Zhang, Xintong | Duke Kunshan University |
| Chen, Junfeng | Peking University |
| Zhu, Yuxiao | Duke Kunshan University |
| Luo, Bing | Duke Kunshan University |
| Guo, Meng | Peking University |
Keywords: Multi-Robot Systems, Task and Motion Planning, Cooperating Robots
Abstract: Multi-robot systems can greatly enhance efficiency through coordination and collaboration, yet in practice, full-time communication is rarely available and interactions are constrained to close-range exchanges. Existing methods either maintain all-time connectivity, rely on fixed schedules, or adopt pairwise protocols, but none adapt effectively to dynamic spatio-temporal task distributions under limited communication, resulting in suboptimal coordination. To address this gap, we propose CoCoPlan, a unified framework that co-optimizes collaborative task planning and team-wise intermittent communication. Our approach integrates a branch-and-bound architecture that jointly encodes task assignments and communication events, an adaptive objective function that balances task efficiency against communication latency, and a communication event optimization module that strategically determines when, where and how the global connectivity should be re-established. Extensive experiments demonstrate that it outperforms state-of-the-art methods by achieving a 22.4% higher task completion rate, reducing communication overhead by 58.6%, and improving the scalability by supporting up to 100 robots in dynamic environments. Hardware experiments include the complex 2D office environment and large-scale 3D disaster-response scenario.
|
| |
| 09:00-10:30, Paper ThI1I.428 | Add to My Program |
| Enhanced Probabilistic Collision Detection for Motion Planning under Sensing Uncertainty |
|
| Wang, Xiaoli | National University of Singapore |
| Ruan, Sipu | National University of Singapore |
| Meng, Xin | National University of Singapore |
| Chirikjian, Gregory | University of Delaware |
Keywords: Collision Avoidance, Planning under Uncertainty, Motion and Path Planning
Abstract: Probabilistic collision detection (PCD) is essential in motion planning for robots operating in unstructured environments, where considering sensing uncertainty helps prevent damage. Existing PCD methods mainly used simplified geometric models and addressed only position estimation errors. This paper presents an enhanced PCD method with two key advancements: (a) using superquadrics for more accurate shape approximation and (b) accounting for both position and orientation estimation errors to improve robustness under sensing uncertainty. Our method first computes an enlarged surface for each object that encapsulates its observed rotated copies, thereby addressing the orientation estimation errors. Then, the collision probability is formulated as a chance constraint problem that is solved with a tight upper bound. Both two steps leverage the recently developed normal parameterization of superquadric surfaces. Results show that our PCD method is twice as close to the Monte-Carlo sampled baseline as the best existing PCD method and reduces path length by 30% and planning time by 37%, respectively. A Real2Sim2Real pipeline further validates the importance of considering orientation estimation errors, showing that the collision probability of executing the planned path is only 2%, compared to 9% and 29% when considering only position estimation errors or none at all.
|
| |
| 09:00-10:30, Paper ThI1I.429 | Add to My Program |
| Sensor Query Schedule and Sensor Noise Covariances for Accuracy-Constrained Trajectory Estimation |
|
| Goudar, Abhishek | University of Toronto |
| Schoellig, Angela P. | TU Munich |
Keywords: Localization, Probability and Statistical Methods
Abstract: Trajectory estimation involves determining the trajectory of a mobile robot by combining prior knowledge about its dynamic model with noisy observations of its state obtained using sensors. The accuracy of such a procedure is dictated by the system model fidelity and the sensor parameters, such as the accuracy of the sensor (as represented by its noise covariance) and the rate at which it can generate observations, referred to as the sensor query schedule. Intuitively, high- rate measurements from accurate sensors lead to accurate trajectory estimation. However, cost and resource constraints limit the sensor accuracy and its measurement rate. Our work’s novel contribution is the estimation of sensor schedules and sensor covariances necessary to achieve a specific estimation accuracy. Concretely, we focus on estimating: (i) the rate or schedule with which a sensor of known covariance must generate measurements to achieve specific estimation accuracy, and alternatively, (ii) the sensor covariance necessary to achieve specific estimation accuracy for a given sensor update rate. We formulate the problem of estimating these sensor parameters as semidefinite programs, which can be solved by off-the- shelf solvers. We validate our approach in simulation and real experiments by showing that the sensor schedules and the sensor covariances calculated using our proposed method achieve the desired trajectory estimation accuracy. Our method also identifies scenarios where certain estimation accuracy is unachievable with the given system and sensor characteristics.
|
| |
| 09:00-10:30, Paper ThI1I.430 | Add to My Program |
| Nezha-X: A Self-Foldable HAUV That Can Launch from a Tube |
|
| Wang, Dongping | Shanghai Jiao Tong University |
| Zhang, Ziyang | Shanghai Jiao Tong University |
| Zeng, Zheng | Shanghai Jiao Tong University |
| Lian, Lian | Shanghai Jiaotong University |
Keywords: Aerial Systems: Applications, Marine Robotics, Field Robots
Abstract: Hybrid aerial underwater vehicles (HAUVs) are developing rapidly with the urgent need for joint air-sea observation missions. This paper proposes a novel HAUV that combines a folding wing mechanism and an underwater thrust system with a centralized tail in an inverted triangle configuration. In addition to ensuring underwater and aerial maneuverability, the design’s overall streamlined structure minimizes the drag of underwater movement and is more suitable for working in confined spaces.The hydrodynamic performance of the system was evaluated using computational fluid dynamics (CFD) simulation. The results indicate that the folding wing design effectively reduces underwater motion drag by 41.9%. Additionally, the centralized underwater thrust system located at the tail generates sufficient torque to ensure the underwater maneuverability of the HAUV. Field experiments further validate the vehicle’s capability to operate in confined environments, execute complex underwater missions, and maintain stable aerial flight. This study provides valuable insights into the drag reduction of HAUV folding wings and the optimization of thruster configuration.
|
| |
| 09:00-10:30, Paper ThI1I.431 | Add to My Program |
| Constrained Articulated Body Algorithms for Closed-Loop Mechanisms |
|
| Sathya, Ajay Suresha | Inria |
| Carpentier, Justin | INRIA |
Keywords: Direct/Inverse Dynamics Formulation, Dynamics, Optimization and Optimal Control, Humanoid Robots
Abstract: Existing recursive rigid body dynamics algorithms with low computational complexity are mostly restricted to kinematic trees with external contact constraints or are sensitive to singular cases (e.g., linearly dependent constraints and kinematic singularities), severely impacting their practical usage in existing simulators. This article introduces two original low-complexity recursive algorithms, loop-constrained articulated body algorithm (LCABA) and proxBBO, based on proximal dynamics formulation for forward simulation of mechanisms with loops. These algorithms are derived from first principles using non-serial dynamic programming, depict linear complexity in practical scenarios, and are numerically robust to singular cases. They extend the existing constrained articulated body algorithm (constrainedABA) to handle internal loops and the pioneering BBO algorithm from the 1980s to singular cases. Both algorithms have been implemented by leveraging the open-source Pinocchio library, benchmarked in detail, and depict state-of-the-art performance for various robot topologies, including over 6x speed-ups compared to existing non-recursive algorithms for high-degree-of-freedom systems with internal loops, such as recent humanoid robots.
|
| |
| 09:00-10:30, Paper ThI1I.432 | Add to My Program |
| An Informative Planning Framework for Target Tracking and Active Mapping in Dynamic Environments with ASVs |
|
| Ramkumar Sudha, Sanjeev Kumar | Norwegian University of Science and Technology |
| Popovic, Marija | TU Delft |
| Coates, Erlend M. | Norwegian University of Science and Technology |
Keywords: Motion and Path Planning, Marine Robotics, Mapping
Abstract: Mobile robot platforms are increasingly being used to automate information-gathering tasks such as environmental monitoring. Efficient target tracking in dynamic environments is critical for applications such as search and rescue and pollutant cleanups. In this letter, we study active mapping of floating targets that drift due to environmental disturbances such as wind and currents. This is a challenging problem as it involves predicting both spatial and temporal variations in the map due to changing conditions. We introduce an integrated framework combining dynamic occupancy grid mapping and an informative planning approach to actively map and track freely drifting targets with an autonomous surface vehicle. A key component of our adaptive planning approach is a spatiotemporal prediction network that predicts target position distributions over time. We further propose a planning objective for target tracking that leverages these predictions. Simulation experiments show that this planning objective improves target tracking performance compared to existing methods that consider only entropy reduction as the planning objective. Finally, we validate our approach in field tests, showcasing its ability to track targets in real-world monitoring scenarios.
|
| |
| 09:00-10:30, Paper ThI1I.433 | Add to My Program |
| Data-Driven Anomaly Detection in Robots Using Matrix Chernoff Bounds |
|
| Dubey, Richa | Indian Institute of Technology, Jodhpur |
| Tripathy, Niladri Sekhar | IIT Jodhpur |
| Shah, Suril Vijaykumar | Indian Institute of Technology Jodhpur |
Keywords: Failure Detection and Recovery, Robot Safety, Probability and Statistical Methods
Abstract: This work proposes a novel data-driven anomaly detection framework for robotic systems, grounded in statistical concentration inequalities. The method leverages the Matrix Chernoff Inequality to establish probabilistic bounds on the eigenvalues of cumulative error covariance matrices computed over a sliding window of robot state deviations. An anomaly is flagged when the eigenvalues, computed in real time, violate these theoretical bounds. The proposed approach is model independent, computationally efficient, and straightforward to implement, requiring only the numerical solution of two transcendental equations to determine the bounds. It further offers design flexibility via tunable parameters such as the confidence level and window size. The effectiveness of the detector is validated through both simulation and hardware experiments across distinct anomaly scenarios for different robots, including input delay, sensor corruption, and external perturbations. A comprehensive performance evaluation is also presented using standard metrics such as Detection Rate, False Positive Rate, Accuracy, and Receiver Operating Characteristics (ROC), along with a method for effective parameter selection and comparison with existing works.
|
| |
| 09:00-10:30, Paper ThI1I.434 | Add to My Program |
| Lunar Rover Cargo Transport: Mission Concept and Field Test (I) |
|
| Krawciw, Alec | University of Toronto |
| Olmedo, Nicolas Alejandro | MDA Space |
| Rehmatullah, Faizan | MDA Space |
| Desjardins-Goulet, Maxime | Centre De Technologies Avancées |
| Toupin, Pascal | Centre De Technologies Avancées |
| Barfoot, Timothy | University of Toronto |
Keywords: Space Robotics and Automation, Field Robots
Abstract: In future operations on the lunar surface, automated vehicles will be required to transport cargo between known locations. Such vehicles must be able to navigate precisely in safe regions to avoid natural hazards, human-constructed infrastructure, and dangerous dark shadows. Rovers must be able to park their cargo autonomously within a small tolerance to achieve a successful pickup and delivery. In this field test, Lidar Teach and Repeat provides an ideal autonomy solution for transporting cargo in this way. A one-tonne path-to-flight rover was driven in a semi-autonomous remote-control mode to create a network of safe paths. Once the route was taught, the rover immediately repeated the entire network of paths autonomously while carrying cargo. The closed-loop performance is accurate enough to align the vehicle to the cargo and pick it up. This field report describes a two-week deployment at the Canadian Space Agency’s Analogue Terrain, culminating in a simulated lunar operation to evaluate the system's capabilities. Successful cargo collection and delivery were demonstrated in harsh environmental conditions.
|
| |
| 09:00-10:30, Paper ThI1I.435 | Add to My Program |
| A Taxonomy for Evaluating Generalist Robot Manipulation Policies |
|
| Gao, Jensen | Stanford University |
| Belkhale, Suneel | Stanford University |
| Dasari, Sudeep | Google DeepMind |
| Balakrishna, Ashwin | Google DeepMind |
| Shah, Dhruv | Google DeepMind |
| Sadigh, Dorsa | Stanford University |
Keywords: Big Data in Robotics and Automation, Deep Learning in Grasping and Manipulation, Imitation Learning
Abstract: Machine learning for robot manipulation promises to unlock generalization to novel tasks and environments. But how should we measure the progress of these policies towards generalization? Evaluating and quantifying generalization is the Wild West of modern robotics, with each work proposing and measuring different types of generalization in their own, often difficult to reproduce settings. In this work, our goal is (1) to outline the forms of generalization we believe are important for robot manipulation in a comprehensive and fine-grained manner, and (2) to provide reproducible guidelines for measuring these notions of generalization. We first propose STAR-Gen, a taxonomy of generalization for robot manipulation structured around visual, semantic, and behavioral generalization. Next, we instantiate STAR-Gen with two case studies on real-world benchmarking: one based on open-source models and the Bridge V2 dataset, and another based on the bimanual ALOHA 2 platform that covers more dexterous and longer horizon tasks. Our case studies reveal many interesting insights: for example, we observe that open-source vision-language-action models often struggle with semantic generalization, despite pre-training on internet-scale language datasets. We provide videos and other supplementary material at our website stargen-taxonomy.github.io.
|
| |
| ThI1LB Late Breaking Results Session, Hall C |
Add to My Program |
| Late Breaking Results 5 |
|
| |
| |
| 09:00-10:30, Paper ThI1LB.1 | Add to My Program |
| Design and Optimization of a Tensioner-Driven Compliant Pulley Mechanism for Supermicrosurgical Robot End-Effectors |
|
| Kim, Min Chul | Korea Institute of Science and Technology |
| Lee, Young Min | SKKU |
| Ihn, Yong Seok | Korea Institute of Science and Technology |
Keywords: Mechanism Design, Tendon/Wire Mechanism, Surgical Robotics: Steerable Catheters/Needles
Abstract: Supermicrosurgery requires exceptionally highprecision manipulation to perform anastomosis of microscopic vessels and nerves, often involving diameters between 0.3 mm and 0.8 mm. To achieve successful outcomes with robotic systems, it is critical to ensure the stable grasping and delicate manipulation of ultra-fine needles and sutures under varying surgical conditions.This paper proposes a spring-based flexible pulley mechanism designed specifically to overcome the fundamental limitations of conventional cable-driven systems, most notably the unpredictable tension fluctuations. Instead of traditional fixed pulleys, the proposed mechanism introduces a tensionerdriven pulley displacement, allowing the system to maintain optimal tension dynamically.We present a comprehensive mathematical framework that includes a non-linear kinematic modeland static equilibrium equations to describe the interactionbetween spring compression and wire tension. To maximize manipulation performance, we performed optimization of the design parameters using MATLAB simulations, focusing on the guaranteed grasping force and the workspace limits defined bythe rotation of the motor shaft.Our results demonstrate that themechanism ensures a stable grasping force even at significantangular displacements, with a specific rotation range of ±40◦optimized for surgical needle manipulation. This compliantpulley system effectively resolves geometric complexities andensures driving symmetry, providing a robu
|
| |
| 09:00-10:30, Paper ThI1LB.2 | Add to My Program |
| Structural Interlocking-Based Weaving Gripper for Enhanced Grasping Performance |
|
| Jun, Yuvin | Korea Institute of Science and Technology |
| Kim, Daehyun | Korea Institute of Science and Technology |
| Jeong, Seokhwan | Mechanical Eng., Sogang University |
| Bae, Joonbum | Korea University |
| Song, Kahye | Korea Institute of Science and Technology |
Keywords: Grasping, Grippers and Other End-Effectors, Soft Robot Materials and Design
Abstract: Robotic grippers have been extensively developed to enable stable and efficient object manipulation across diverse applications. While soft grippers offer high adaptability and safety, their performance remains constrained by an inherent trade-off between flexibility and load-bearing capacity. This study was undertaken with the objective of addressing these challenges by proposing a compact weaving gripper that exploits structurally induced interlocking. Additionally, a prediction model is developed to predict and control grasping configurations. The proposed gripper is integrated with continuum robot, enabling operation in confined environments, and demonstrates applicability across diverse robotic platforms.
|
| |
| 09:00-10:30, Paper ThI1LB.3 | Add to My Program |
| Real-Time Seam Tracking for Robotic Welding Via Registration-Based Deformation Estimation |
|
| Balajepalli, Surag | Path Robotics Inc |
| Baskaran, Amrish | University of Maryland |
| Shounak, Naik | Path Robotics |
| Eubel, Christopher | Path Robotics |
| Meyarian, Abolfazl | Path Robotics |
| Dai, Andong | Path Robotics, Inc |
| Rajendran, Pradeep | University of Southern California |
| Yamane, Katsu | Path Robotics Inc |
| Alex, Trazkovich | Path Robotics |
Keywords: Industrial Robots, Visual Tracking, Deep Learning for Visual Perception
Abstract: Arc welding induces thermal deformation that continuously displaces the seam path during execution, causing the robot to miss the joint on long seams. We present a real-time seam tracking system with three principal contributions: (1) a constrained ICP registration of live leading-laser scans against a prescan point-cloud prior combined with exponentially decaying spatial propagation; (2) a laser-line detection network retrained on 1,000 arc-on images, raising F1 from 0.59 (prescan baseline) to 0.84 on a held-out arc-on test set; and (3) an asynchronous execution architecture for ensuring that smooth joint commands are sent at the robot's control cycle (40 Hz) even with perception delays or interruptions. Internal testing confirms the system remains in the joint on 97 cm-long seams with less than 2 sec cycle-time overhead. Field deployment improved weld quality acceptance rate from 81% to 95-98%.
|
| |
| 09:00-10:30, Paper ThI1LB.4 | Add to My Program |
| Knowledge-Based Locomotion Policy for Quadruped Robots under Incomplete Terrain Observation |
|
| Kim, Taehyeong | Kyungpook National University |
| Lee, Sangmoon | Kyungpook National University |
Keywords: Legged Robots, Reinforcement Learning, Reactive and Sensor-Based Planning
Abstract: Body-mounted LiDAR sensors suffer from systematic blind spots during stair locomotion, creating a partial observability problem that single-step terrain snapshots cannot resolve. We address this with a recurrent locomotion policy for the Unitree Go2 that builds implicit knowledge of stair geometry through a GRU-based recurrent encoder over pointcloud and proprioceptive inputs, enabling robust stair ascent and descent even under occluded LiDAR conditions. Ablation experiments show that masking pointcloud input at inference time causes catastrophic failure on stair terrain and severe performance degradation overall, confirming that implicit stair knowledge is a critical cue for step negotiation rather than a merely complementary signal.
|
| |
| 09:00-10:30, Paper ThI1LB.5 | Add to My Program |
| VS-Graphs: Environment-Aware 3D Scene Graphs for Visual SLAM |
|
| Tourani, Ali | University of Luxembourg |
| Ejaz, Saad | University of Luxembourg |
| Fernandez-Cortizas, Miguel | University of Luxembourg |
| Sanchez-Lopez, Jose Luis | University of Luxembourg |
| Voos, Holger | University of Luxembourg |
Keywords: Semantic Scene Understanding, Visual-Inertial SLAM, Localization
Abstract: We introduce the latest achievements and results of Visual S-Graphs (vS-Graphs), our open-source, real-time VSLAM framework that tightly couples map reconstruction to online 3D scene graph generation. vS-Graphs employs visual and depth cues to detect and localize building components, such as walls and ground surfaces, from which higher-level structural elements, including variant-shaped rooms and floors, are inferred. These entities are incorporated into an optimizable hierarchical 3D scene graph, jointly maintained with the SLAM pipeline, enabling richer map semantics and improved localization. The framework is publicly available at https://github.com/snt-arg/visual_sgraphs. We evaluated vS-Graphs on both public RGB-D benchmarks and our in-house SMapper dataset, which includes diverse multi-room indoor environments with LiDAR-derived ground truth. These evaluations focused on trajectory estimation, map quality, semantic structural detection, and runtime performance. The results highlight the potential of tightly coupling VSLAM with online hierarchical scene graph generation for richer, more structurally meaningful environmental understanding. In particular, the ability of vS-Graphs to infer higher-level layout entities from visually detected building components suggests a promising direction for bridging geometric mapping and semantic scene reasoning within a unified framework. Full evaluation results and figures are available on https://snt-arg.github.io/vsgraphs-results/.
|
| |
| 09:00-10:30, Paper ThI1LB.6 | Add to My Program |
| From Simulation to Deployment: Curriculum-Based Domain Adaptation for Semantic Segmentation in Autonomous Forklifts |
|
| Schützenhöfer, Christof | KNAPP Industry Solutions GmbH |
| Rechberger, Patrick | KNAPP Industry Solutions GmbH |
| Ulz, Thomas | Graz University of Technology |
| Steger, Christian | Graz University of Technology |
Keywords: Industrial Robots, Object Detection, Segmentation and Categorization, Computer Vision for Transportation
Abstract: Deploying semantic segmentation models for autonomous forklifts in industrial environments is challenging because visual conditions vary across sites, leading to poor cross-domain generalization and costly re-annotation efforts. We propose a curriculum-based domain adaptation framework that progressively transfers a segmentation model from simulation to real-world industrial deployment. The model is first pretrained on synthetic datasets with increasing complexity, then fine-tuned on a labeled real source domain to reduce the sim-to-real gap and adapt to camera-specific characteristics. Finally, it is adapted to a new target domain using pseudo-label-based self-training. To reduce drift during target adaptation, pseudo-labeled target samples are combined with labeled samples from the source-real domain, while a replay buffer improves robustness to class imbalance by oversampling rare classes. Preliminary experiments with DDRNet demonstrate improved performance under both moderate and hard domain shifts, with mIoU gains from 67.37 to 71.36 and from 49.57 to 57.22, respectively. The results highlight the potential of progressive multi-domain adaptation for scalable industrial robotic perception.
|
| |
| 09:00-10:30, Paper ThI1LB.7 | Add to My Program |
| An Electrostatic Linear Film Actuator with Passive Phase Switching |
|
| Kim, Junsoo | EPFL, Lausanne, Switzerland |
| Sánchez, Claudia | University Carlos III of Madrid |
| Byeon, Seongju | Seoul National University |
| Monje, Concepción A. | University Carlos III of Madrid |
| Han, Amy Kyungwon | Seoul National University |
| Shea, Herbert | EPFL |
Keywords: Actuation and Joint Mechanisms, Human-Centered Robotics, Prosthetics and Exoskeletons
Abstract: Artificial muscles for soft robots and human-interactive machines must deliver high force, fast dynamics, and large stroke (tens of percent strain) while remaining mechanically compliant. Sliding electrostatic film actuators enable compact multilayer integration without lateral expansion and provide large stroke (≈40% contraction) by generating shear forces between ultrathin (≈50–150 µm) slider and stator films. However, in multi-stack configurations, force additivity degrades under curvature or misalignment due to uneven electrode overlap. Here, we introduce an electrostatic linear film actuator with passive mechanical phase switching via patterned brush contacts. As the slider moves, each stator electrode is locally assigned HV or ground at the correct position, maintaining force direction under bending and stretch, and enabling operation from a single DC source.
|
| |
| 09:00-10:30, Paper ThI1LB.8 | Add to My Program |
| Development of a Variable Stiffness High Resolution Tension Sensor for Tendon-Driven Robot Hands |
|
| Pyo, Ginwoo | Sogang University |
| Lee, Chunghyeon | Sogang University |
| Jeong, Seokhwan | Mechanical Eng., Sogang University |
Keywords: Mechanism Design, Tendon/Wire Mechanism, Compliant Joints and Mechanisms
Abstract: Tendon-driven robotic hands face a fundamental “Triple Trade-Off” among high force control bandwidth, low mechanical impedance, and precise force estimation. This extended abstract presents a novel variable-stiffness tension sensor designed to overcome these conflicting requirements. By integrating a nonlinear spring mechanism into the tendon routing path, the sensor dynamically modulates its physical stiffness according to the applied tension while simultaneously providing high-resolution force feedback. Experimental results confirm that the system’s force control bandwidth dynamically increases from 12.70 Hz to 19.53 Hz as the tendon tension scales up. Furthermore, the feasibility of the system was validated on a 3DoF tendon-driven robotic finger, successfully demonstrating the sensor’s high sensitivity by delicately actuating a 50 gf mechanical keyboard switch at low tensions.
|
| |
| 09:00-10:30, Paper ThI1LB.9 | Add to My Program |
| Robotic Rendering of Oriental Ink Paintings with Spatial Awareness |
|
| Hu, Shaojun | Northwest A&F University |
| Yan, Zhiqi | Northwest A&F University |
| Liu, Yuhao | Northwest A&F University |
| Jin, Hao | Northwest A&F University |
| Lian, Minghui | Northwest A&F University |
| Zhang, Zhiyi | Northwest A&F University |
Keywords: Art and Entertainment Robotics, Human Factors and Human-in-the-Loop, Intelligent and Flexible Manufacturing
Abstract: Enabling robots to imitate artists who observe objects in the 3D world and physically draw them in a specific style is a challenging problem. Shuimohua (Sumi-e) is a typical non-photorealistic oriental ink painting art that uses simple ink, water and brush to paint and convey poetic imagery. Although digital rendering of non-photorealistic paintings has been extensively studied, physical rendering of stylized painting from 3D space is still a challenge because it requires to consider the abstraction process of vectorized strokes with various styles in physical environment. In this paper, we propose a robotic rendering approach for physically drawing oriental ink painting from 3D shapes, and aim to mimic the painting process of a real-world artist using binocular vision to estimate 3D scenes. By referring to the artists' drawing habit, first, we extract expressive contours as drawing outlines and vectorize the contours to polylines. Next, the polylines are converted to line strokes with varying thicknesses by considering clamped B-spline fitting and isophote distances. In order to generate a typical dot shading effect, an oriented Poisson disk sampling approach is proposed to create dot strokes to depict the internal features of the 3D models. Finally, we build an ink gradient model and map the coordinates of the strokes to a robotic arm, and a control method of guide rails is proposed for robotic painting on large canvases.
|
| |
| 09:00-10:30, Paper ThI1LB.10 | Add to My Program |
| Thin-Film Programmable Robotic Damper Enabled by a Stick-Slip-Free Electrostatic Clutch |
|
| Ma, Jihyeong | Korea Advanced Institute of Science and Technology |
| Nam, Jongseok | Korea Advanced Institute of Science & Technology (KAIST) |
| Lee, Nak Hyeong | Korea Advanced Institute of Science and Technology |
| Kyung, Ki-Uk | Korea Advanced Institute of Science & Technology (KAIST) |
Keywords: Soft Robot Applications, Mechanism Design, Wearable Robotics
Abstract: Electrostatic (ES) clutches are promising candidates for wearable and assistive robotics due to their thin, lightweight, and low-power characteristics. However, conventional ES clutches typically suffer from mechanical instability caused by the stick-slip phenomenon, restricting their operation to simple binary (locked or free) modes. In this work, we present a Stick-slip-free Variable Electrostatic (SV-ES) clutch that functions as a high-performance programmable robotic damper. By utilizing a PVC-gel friction layer, the device achieves stable and continuous sliding even under high shear stress (29 N/cm² at 100 V). We demonstrate that this stability allows for precise closed-loop modulation of kinetic friction and motor-free position control. The versatility of the SV-ES clutch is validated through three robotic applications: active motion assistance for a robotic arm, high-fidelity haptic rendering, and programmable impact damping for a robotic leg.
|
| |
| 09:00-10:30, Paper ThI1LB.11 | Add to My Program |
| Experimental Validation of a Motor–SMA Hybrid Actuation for Lightweight Wearable Robot |
|
| Bak, Jeongae | Korea Institute of Machinery & Materials |
| Choi, Kyungjun | Korea Institute of Machinery and Materials |
| Jung, Hyunmok | Korea Institute of Machinery and Materials |
| Seo, Hyunuk | Korea Institute of Machinery and Materials |
| Kim, Daehyun | Korea Institute of Machinery & Materials |
Keywords: Actuation and Joint Mechanisms, Wearable Robotics, Soft Sensors and Actuators
Abstract: Conventional motor-driven wearable robots often suffer from increased weight and limited torque output. To address this issue, this study proposes a motor–SMA hybrid actuation approach that combines the advantages of electric motors and shape memory alloy (SMA) actuators. A dedicated testbed was developed to evaluate the proposed method under varying load conditions. Experimental results show that the SMA actuator provides additional assistive torque of approximately 3 N·m compared to motor-only operation, without significant increase in system weight. These results demonstrate the feasibility of hybrid actuation for achieving lightweight and high-performance wearable robotic systems.
|
| |
| 09:00-10:30, Paper ThI1LB.12 | Add to My Program |
| ROOM-3D: Real-Time Unsupervised Online 3D Room Segmentation |
|
| Flor Rodríguez-Rabadán, Rafael | Alcalá University |
| Gutiérrez Álvarez, Carlos | Universidad De Alcalá |
| Bañuls-González, Alexis | University of Alcalá |
| Lafuente-Arroyo, Sergio | University of Alcalá |
| Maldonado-Bascón, Saturnino | Universidad De Alcalá |
| López-Sastre, Roberto J. | University of Alcalá |
Keywords: Semantic Scene Understanding, Object Detection, Segmentation and Categorization, Mapping
Abstract: Room-level understanding is essential for mobile robots operating in unseen indoor environments. Existing room segmentation methods predominantly assume an offline setting, typically requiring a complete scene reconstruction before producing the final result, which limits their applicability to real-time robotic navigation. In this work, we introduce the novel problem of emph{online 3D room segmentation}, where a robot must continuously segment rooms and detect room transitions from streaming sensory observations during exploration. By framing 3D room segmentation in an online setting, we aim to encourage further research in practical, real-time semantic mapping for autonomous agents operating in unknown environments. To properly assess this novel online setting, we also introduce instantaneous evaluation metrics tailored to online room segmentation and transition detection. We also propose textbf{ROOM-3D}: a real-time unsupervised framework, for the problem of online 3D room segmentation. textbf{ROOM-3D} combines Gaussian-based SLAM with open-vocabulary semantic reasoning to incrementally generate a semantically structured 3D room segmentations, as well as transition estimates, without access to future observations or global post-processing. Experiments on HM3D-Semantics dataset demonstrate that ROOM-3D achieves temporally consistent and accurate segmentation under strict online constraints, while offering state-of-the-art results for the offline experimental evaluation.
|
| |
| 09:00-10:30, Paper ThI1LB.13 | Add to My Program |
| A New Decoupling Method for Cable-Driven Joints Based on an Anti-Parallelogram Mechanism |
|
| Oh, Seungcheol | Kyungpook National University |
| Lim, Seungbum | Kyungpook National University |
| Suh, Jungwook | Kyungpook National University (KNU) |
Keywords: Mechanism Design, Tendon/Wire Mechanism, Actuation and Joint Mechanisms
Abstract: This paper proposes a novel anti-parallelogram mechanism (APM)-based cable-driven joint that achieves both the kinematic decoupling characteristic of rolling joints and the surface-contact robustness of link structures. Through the optimization of internal idlers, the cable path length remains nearly constant without complex gears or ligaments. Experimental validation using a 2-DOF prototype demonstrated negligible distal joint interference (0.0815 deg RMSE) during actuation. Furthermore, the system exhibited sub-millimeter positional repeatability (maximum RMSE of 0.2786 mm), establishing the proposed design as a robust, high-precision decoupling solution.
|
| |
| 09:00-10:30, Paper ThI1LB.14 | Add to My Program |
| Knowledge Synthesis in Dynamic Human-Swarm Interactions Using LLMs |
|
| Ballo, Boubacar | New York University Abu Dhabi |
| Yihunie, Absera | New York University Abu Dhabi |
| Schwarzenbach, Lilly | Vrije Universiteit Amsterdam |
| Salam, Hanan | Isir Upmc Cnrs |
| Ferrante, Eliseo | Vrije Universiteit Amsterdam |
Keywords: Distributed Robot Systems, Swarm Robotics, Environment Monitoring and Management
Abstract: Collectively exploring and understanding an environment is an open challenge, particularly in dynamic settings where agents must rely on limited information that may only be intermittently available. In this paper, we focus on how agents can maximize information capture in these contexts. As agents encounter an informant with information to communicate—such as a human collaborator sharing a transient observation—they aggregate this data into a textual description of the environment using an LLM. We show that agents capture environmental information faster when sharing information with other members of the swarm. While strictly ephemeral information may never be fully captured, social learning enables agents to acquire significantly more information, demonstrating the critical importance of information sharing between agents.
|
| |
| 09:00-10:30, Paper ThI1LB.15 | Add to My Program |
| Grasping Point Estimation for EA Suction Cup Grippers on Curved Objects |
|
| Catalano, Angelo | Politecnico Di Bari |
| De Carolis, Simone | Politecnico Di Bari |
| Vitucci, Gennaro | Politecnico Di Bari |
| Carbone, Giuseppe | Polytechnic of Bari |
| Dotoli, Mariagrazia | Politecnico Di Bari |
| Cacucciolo, Vito | Politecnico Di Bari |
Keywords: Grasping, Contact Modeling, Soft Robot Applications
Abstract: Electroadhesion suction cup (EASC) are fully-electrical grippers (no air flow needed) with very low power consumption that can grasp flat to curved objects from the top. They conform to the shape of the object by zipping from the central contact point to their edges, driven by Electroadhesion forces. Zipping requires deforming elastically the EASC membrane. The object surface curvature at contact point strongly affects zipping ability, and therefore grasp feasibility. We developed a model for grasping point selection that predicts the voltage required for full zipping on a point of given local curvature. Feasible points are the ones where the estimated zipping voltage is lower than the breakdown voltage of the EASC. The model is based on an energy balance between electrostatic work and elastic deformation, explicitly including in-plane stretching on doubly curved surfaces. Experiments on cylinders, spheres, and ellipsoids validate the predicted thresholds and curvature-dependent trends.
|
| |
| 09:00-10:30, Paper ThI1LB.16 | Add to My Program |
| Optimal Motion Planning for Object Picking in Industrial Contexts with Optimal Control |
|
| Dirckx, Dries | KU Leuven |
| Swevers, Jan | KU Leuven |
| Decre, Wilm | Katholieke Universiteit Leuven |
Keywords: Industrial Robots, Optimization and Optimal Control, Motion Control
Abstract: This work presents a hard-constrained optimal control based motion planner for robotic manipulators operating in cluttered industrial environments. The method targets object picking tasks where cycle time, accuracy of the final pose, and collision safety are critical. The planner formulates a time-optimal trajectory generation problem with explicit constraints on collision avoidance, robot kinematics, and the accuracy with which the end-effector reaches its grasp pose. Environments, grippers, and objects are modelled using capsules, cuboids, and planes - geometric primitives commonly available in industrial robotic software - allowing flexible reconfiguration of workcells without modifying the underlying problem transcription. Two complementary initialisation strategies are proposed to reduce the computational complexity of the non-linear, non-convex optimal control problem: a geometric best-guess initialisation and a near-optimal warm-starting approach that leverages previously computed trajectories during repeated task execution. Compared to a cuRobo, the proposed CPU-only planner exhibits higher computational times but produces trajectories with 0.7x lower execution time and guarantees constraint satisfaction due to its hard-constrained formulation.The near-optimal initialisation method is shown to reduce computation times by up to 2.3x relative to the best-guess approach while simultaneously improving the success rates.
|
| |
| 09:00-10:30, Paper ThI1LB.17 | Add to My Program |
| Intelligent Mechanical Characterization of Date Fruits for Automated Harvesting Grippers |
|
| Shami, Shahd | King Abdullah University of Science and Technology |
| Wali, Obadah | KAUST |
| Feron, Eric | King Abdullah University of Science and Technology |
| Park, Shinkyu | KAUST |
| Alam, Syed Muhammad | King Abdullah University of Science and Technology |
Keywords: Agricultural Automation, Grasping, Force and Tactile Sensing
Abstract: Robotic harvesting of date fruits requires precise grasping force control to prevent tissue damage, yet cultivar-specific biomechanical limits remain absent from the literature. This work presents the first continuous stress-strain characterization of three Saudi date cultivars across three hydration states, translated into validated robotic grasping constraints. A custom parallel-plate compression system emulates two-finger robotic grasping, while a Mask R-CNN vision model provides non-contact geometric measurement with below 5% relative error. Total of 500 samples are tested. Cyclic loading experiments establish elastic strain limits, with conservative operational thresholds of 7% for Ajwa and Barhi and 5% for Sagai. A linear calibration model maps gripper displacement to induced fruit strain, enabling strain-controlled robot commands. Validation using a UR10e manipulator confirms damage-free manipulation at these limits across all cultivars, with residual deformation below 1mm and strain tracking error below 1%. Future work will integrate vision and force feedback to train machine learning models on these experimentally derived limits, enabling real-time geometry-based gripper control for fully autonomous harvesting.
|
| |
| 09:00-10:30, Paper ThI1LB.18 | Add to My Program |
| Risk-Aware Control of Tendon-Driven Continuum Robots Via CVaR-MPPI with Residual Learning for Hysteresis Compensation : A Pilot Study |
|
| Lee, Dongjun | Daegu Gyeongbuk Institute of Science and Technology(DGIST) |
| Kim, DongWook | Daegu Gyeongbuk Institute of Science and Technology (DGIST) |
Keywords: Tendon/Wire Mechanism, Robot Safety, Collision Avoidance
Abstract: Tendon-driven Continuum Robots (TDCRs) are widely used in confined operating systems due to their thin shape, flexibility, and compliance making them easily deployable in narrow or contact-rich environments. However, real-time safe control near obstacles remains challenging. Computationally expensive dynamic models, such as the Cosserat rod model, are impractical for real-time control. Conventional model predictive control (MPC) methods require linearization of the dynamics, limiting their applicability to the complex nonlinear behavior of TDCRs, including hysteresis. In this paper, we adopt the Piecewise Constant Curvature (PCC) model, which assumes constant curvature for each link. While computationally cheap, this approximation contains modeling errors that, combined with mechanical friction, backlash, and misalignment at the rolling joints, result in unpredictable hysteresis. Also, we propose CVaR-MPPI(Conditional Value-at-Risk Model Predictive Path Integral), a controller that combines sampling based planning with probability safety under uncertainty environment, improving both worst-case risk managing and sampling efficiency. In simulation with 100 iterations, CVaR-MPPI improves the success rate from 80% to 85% and the mean safety clearance by 129%, while maintaining end-effector tracking error compared to standard MPPI, as detailed in the simulation results. The controller runs at 50Hz with 8192 samples, demonstrating real-time feasibility.
|
| |
| 09:00-10:30, Paper ThI1LB.19 | Add to My Program |
| Modelling and Digital Twin Framework for Highly Compliant Soft Robotic Systems |
|
| Arbatani, Siamak | McGill University |
| Kovecses, Jozsef | McGill University |
| |
| 09:00-10:30, Paper ThI1LB.20 | Add to My Program |
| Design and Development of a Spiral Chain Actuator |
|
| Subedi, Rakshya | University of Nevada Las Vegas |
| Gourdet, Shia | University of Nevada, Las Vegas |
| Harris, Leah | University of Nevada, Las Vegas |
| Bae, Andrew | University of Nevada, Las Vegas |
Keywords: Actuation and Joint Mechanisms, Kinematics
Abstract: In this paper, design and experimental validation of a novel Spiral Chain Actuator and its application in a three-degree-of-freedom positioning platform are presented. Unlike previous spiral zipper actuators that rely on flexible bands and face structural integrity limitations under tensions and moment loads, the proposed design employs rigid chain pieces that interlock during rotation. This rigid architecture enables an improved load-bearing capacity while maintaining the compact, lightweight advantages of spiral actuation. We developed a positioning platform equipped with three Spiral Chain Actuators arranged in a tetrahedral configuration and validated position control through experimental testing. The results demonstrated successful position tracking across all three translational axes, which establishes foundation to develop full VTT system and it's performance evaluation in the future.
|
| |
| 09:00-10:30, Paper ThI1LB.21 | Add to My Program |
| A Soft-Rigid Tendon-Driven Continuum Robot with Multi-Curvature Actuated by a Single Set of Tendons |
|
| Qiu, Liuming | The Hong Kong Polytechnic University |
| Labazanova, Luiza | The Hong Kong Polytechnic University |
| Navarro-Alarcon, David | The Hong Kong Polytechnic University |
| Zhang, Dan | The Hong Kong Polytechnic University |
Keywords: Soft Robot Materials and Design, Modeling, Control, and Learning for Soft Robots, Mechanism Design
Abstract: This work presents a novel design of achieving multiple curvatures in a tendon-driven continuum robot (TDCR) system with only a single set of acturation tendons. The TDCR used in this work is assembled from multiple sub-sections made of low-melting point alloy (LMPA), which each of them has independent binary stiffness by localized thermal phase transitions. Through localized thermal phase transitions, the robot can dynamically "lock" or "release" specific sub-sections, enabling multi-curvature configurations, enabling independent curvature control with only one set of actuation tendons. is approach eliminates the need for additional segments or complex locking mechanisms, significantly reducing mechanical complexity and control challenges. Experimental validation confirms the system’s ability to execute complex shapes (C/J/S-shape configurations), maintain structural rigidity in locked states, as well as different spatial movements can be achieved by changing the configuration of subsegments. The SR-TDCR demonstrates potential for confined-space applications merging dexterity with actuation efficiency.
|
| |
| 09:00-10:30, Paper ThI1LB.22 | Add to My Program |
| Design and Optimization of a Tensioner-Driven Compliant Pulley Mechanism for Supermicrosurgical Robot End-Effectors |
|
| Kim, Min Chul | Korea Institute of Science and Technology |
| Lee, Young Min | SKKU |
| Ihn, Yong Seok | Korea Institute of Science and Technology |
| |
| ThAT1 Regular Session, Hall A2 |
Add to My Program |
| Planning |
|
| |
| |
| 11:00-11:10, Paper ThAT1.1 | Add to My Program |
| Progress Constraints for Reinforcement Learning in Behavior Trees |
|
| Rietz, Finn | Orebro University |
| Kartašev, Mart | Royal Institute of Technology (KTH) |
| Ogren, Petter | Royal Institute of Technology (KTH) |
| Stork, Johannes A. | Orebro University |
Keywords: Integrated Planning and Learning, Reinforcement Learning
Abstract: Behavior Trees (BTs) provide a structured and reactive framework for decision-making, commonly used to switch between sub-controllers based on environmental conditions. Reinforcement Learning (RL), on the other hand, can learn near-optimal controllers but sometimes struggles with sparse rewards, safe exploration, and long-horizon credit assignment. Combining BTs with RL has the potential for mutual benefit: a BT design encodes structured domain knowledge that can simplify RL training, while RL enables automatic learning of the controllers within BTs. However, naive integration of BTs and RL can lead to some controllers counteracting other controllers, possibly undoing previously achieved subgoals, thereby degrading the overall performance. To address this, we propose progress constraints, a novel mechanism where feasibility estimators constrain the allowed action set based on theoretical BT convergence results. Empirical evaluations in a 2D proof-of-concept and a high-fidelity warehouse environment demonstrate improved performance, sample efficiency, and constraint satisfaction, compared to prior methods of BT-RL integration.
|
| |
| 11:10-11:20, Paper ThAT1.2 | Add to My Program |
| Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning |
|
| Kawabe, Tomoya | NEC Corporation |
| Takano, Rin | NEC Corporation |
Keywords: Agent-Based Systems, Task Planning, Multi-Robot Systems
Abstract: Multi-robot task planning requires decomposing natural-language instructions into executable actions for heterogeneous robot teams. Conventional Planning Domain Definition Language (PDDL) planners provide rigorous guarantees but struggle to handle ambiguous or long-horizon missions, while large language models (LLMs) can interpret instructions and propose plans but may hallucinate or produce infeasible actions. We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner. When plans fail, the system applies TextGrad-inspired textual-gradient updates to optimize each agent's prompt and thereby improve planning accuracy. In addition, meta-prompts are learned and shared across agents within the same layer, enabling efficient prompt optimization in multi-agent settings. On the MAT-THOR benchmark, our planner achieves success rates of 0.95 on compound tasks, 0.84 on complex tasks, and 0.60 on vague tasks, improving over the previous state-of-the-art LaMMA-P by 2, 7, and 15 percentage points respectively. An ablation study shows that the hierarchical structure, prompt optimization, and meta-prompt sharing contribute roughly +59, +37, and +4 percentage points to the overall success rate.
|
| |
| 11:20-11:30, Paper ThAT1.3 | Add to My Program |
| LoGoPlanner: Localization Grounded Navigation Policy with Metric-Aware Visual Geometry |
|
| Peng, Jiaqi | Tsinghua University |
| Cai, Wenzhe | Shanghai AI Laboratory |
| Yang, Yuqiang | Shanghai AI Laboratory |
| Wang, Tai | Shanghai AI Laboratory |
| Shen, Yuan | Tsinghua University |
| Pang, Jiangmiao | Shanghai AI Laboratory |
Keywords: Vision-Based Navigation, Motion and Path Planning, RGB-D Perception
Abstract: Trajectory planning in unstructured environments is a fundamental and challenging capability for mobile robots. Traditional modular pipelines suffer from latency and cascading errors across perception, localization, mapping, and planning modules. Recent end-to-end learning methods map raw visual observations directly to control signals or trajectories, promising greater performance and efficiency in open-world settings. However, most prior end-to-end approaches still rely on separate localization modules that depend on accurate sensor extrinsic calibration for self-state estimation, thereby limiting generalization across embodiments and environments. We introduce LoGoPlanner, a localization-grounded, end-to-end navigation framework that addresses these limitations by: (1) finetuning a long-horizon visual-geometry backbone to ground predictions with absolute metric scale, thereby providing implicit state estimation for accurate localization; (2) reconstructing surrounding scene geometry from historical observations to supply dense, fine-grained environmental awareness for reliable obstacle avoidance; and (3) conditioning the policy on implicit geometry bootstrapped by the aforementioned auxiliary tasks, thereby reducing error propagation. We evaluate LoGoPlanner in both simulation and real-world settings, where its fully end-to-end design reduces cumulative error while metric-aware geometry memory enhances planning consistency and obstacle avoidance, leading to more than a 27.3% improvement over oracle-localization baselines and strong generalization across embodiments and environments. The code and models will be made publicly available upon publication.
|
| |
| 11:30-11:40, Paper ThAT1.4 | Add to My Program |
| The One RING: A Robotic Indoor Navigation Generalist |
|
| Eftekhar, Ainaz | University of Washington |
| Hendrix, Rose | Allen Institute for AI |
| Weihs, Luca | Allen Institute for AI |
| Duan, Jiafei | University of Washington |
| Caglar, Ege | University of Washington - Seattle |
| Salvador, Jordi | Allen Institute for AI |
| Herrasti, Alvaro | Allen Institute for Artificial Intelligence |
| Han, Winson | Allen Institute for Artificial Intelligence |
| VanderBilt, Eli | Allen Institute for Artificial Intelligence |
| Kembhavi, Aniruddha | Allen Institute for AI |
| Farhadi, Ali | University of Washington |
| Krishna, Ranjay | University of Washington |
| Ehsani, Kiana | Allen Institute for Artificial Intelligence |
| Zeng, Kuo-Hao | Vercept AI |
Keywords: Vision-Based Navigation, Imitation Learning, Reinforcement Learning
Abstract: Modern robots vary significantly in shape, size, and sensor configurations used to perceive and interact with their environments. However, most navigation policies are embodiment-specific—a policy trained on one robot typically fails to generalize to another, even with minor changes in body size or camera viewpoint. As custom hardware becomes increasingly common, there is a growing need for a single policy that generalizes across embodiments, eliminating the need to (re-)train for each specific robot. In this paper, we introduce RING (Robotic Indoor Navigation Generalist), an embodiment-agnostic policy that turns any mobile robot into an effective indoor semantic navigator. Trained entirely in simulation, RING leverages large-scale randomization over robot embodiments to enable robust generalization to many real-world platforms. To support this, we augment the AI2-THOR simulator to instantiate robots with controllable configurations, varying in body size, rotation pivot point, and camera parameters. On the visual object-goal navigation task, RING achieves strong cross-embodiment (XE) generalization—72.1% average success rate across 5 simulated embodiments (a 16.7% absolute improvement on the Chores-S benchmark) and 78.9% across 4 real-world platforms, including Stretch RE-1, LoCoBot, and Unitree Go1—matching or even surpassing embodiment-specific policies. We further deploy RING on the RB-Y1 wheeled humanoid in a real-world kitchen environment, showcasing its out-of-the-box potential for mobile manipulation platforms.
|
| |
| 11:40-11:50, Paper ThAT1.5 | Add to My Program |
| Global Planning for Object Navigation Via a Weighted Traveling Repairman Problem Formulation |
|
| Liu, Ruimeng | Nanyang Technological University |
| Xu, Xinhang | Nanyang Technological University |
| Yuan, Shenghai | Nanyang Technological University |
| Xie, Lihua | NanyangTechnological University |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning, Task and Motion Planning
Abstract: Zero-Shot Object Navigation (ZSON) requires agents to navigate to objects specified via open-ended natural language without predefined categories or prior environmental knowledge. While recent methods leverage foundation models or multi-modal maps, they often rely on 2D representations and greedy strategies or require additional training or modules with high computation load, limiting performance in complex environments and real applications. We propose WTRP-Searcher, a novel framework that formulates ZSON as a Weighted Traveling Repairman Problem (WTRP), minimizing the weighted waiting time of viewpoints. Using a Vision-Language Model (VLM), we score viewpoints based on object-description similarity, projected onto a 2D map with depth information. An open-vocabulary detector identifies targets, dynamically updating goals, while a 3D embedding feature map enhances spatial awareness and environmental recall. WTRP-Searcher outperforms existing methods, offering efficient global planning and improved performance in complex ZSON tasks. Code and demos will be available on url{https://github.com/lrm20011/WTRP_Searcher}.
|
| |
| 11:50-12:00, Paper ThAT1.6 | Add to My Program |
| Robust Task Planning Via Failure Detection Using Scene Graph from Multi-View Images |
|
| Chong, Haechan | POSTECH |
| Lee, Jongwon | UNIST |
| Ahn, Hyemin | POSTECH |
Keywords: Task and Motion Planning, Manipulation Planning
Abstract: Recent robot task planners utilize large language models (LLMs) or vision-language models (VLMs) as a failure detector. These methods perform well by leveraging their semantic reasoning capabilities but often assume full understanding, which can lead to unreliable planning in complex scenes lacking explicit structural modeling. To address these limitations, we propose a novel multi-view scene understanding framework that explicitly models object-level relationships, enabling failure detection and effective task replanning. Our approach first captures multi-view images for comprehensive coverage, and generates local 2D scene graphs encoding object identities and relational information. Building on this, we introduce a model based on a graph neural network that merges the local 2D scene graphs into a unified representation. This process results in the unified scene graph, used to detect task success and identify failure causes. For each sub-task, our framework compares the unified scene graph with the expected scene graph predicted by the LLM during the task planning stage, identifying potential failure causes based on their deviations. These causes are then fed back into the LLM to facilitate effective replanning, thereby reducing repetitive failures and enhancing adaptability. We evaluate our framework on five real-world benchmark tasks to demonstrate its applicability. Separately, we compare failure detection and reasoning performance with other methods, showing the benefits of combining multi-view perception with explicit graph-based reasoning. More information can be found in https://sites.google.com/view/scrutinize-robot-manipulation
|
| |
| 12:00-12:10, Paper ThAT1.7 | Add to My Program |
| Congestion Mitigation Path Planning for Large-Scale Multi-Agent Navigation in Dense Environments |
|
| Kato, Takuro | National Institute of Advanced Industrial Science and Technology (Japan) |
| Okumura, Keisuke | National Institute of Advanced Industrial Science and Technology (AIST) |
| Sasaki, Yoko | National Institute of Advanced Industrial Science and Technology |
| Yokomachi, Naoya | National Institute of Advanced Industrial Science and Technology |
|
|
| |
| 12:10-12:20, Paper ThAT1.8 | Add to My Program |
| URPlanner: A Universal Paradigm for Collision-Free Robotic Motion Planning Based on Deep Reinforcement Learning |
|
| Ying, Fengkang | National University of Singapore |
| Zhang, Hanwen | National University of Singapore |
| Wang, Haozhe | Singapore-MIT Alliance for Research and Technology |
| Huang, Huishi | National University of Singapore |
| Ang Jr, Marcelo H | National University of Singapore |
Keywords: Motion and Path Planning, Collision Avoidance, AI-Based Methods, Industrial Robots
Abstract: Collision-free motion planning for redundant robot manipulators in complex environments is yet to be explored. Although recent advancements at the intersection of deep reinforcement learning (DRL) and robotics have highlighted its potential to handle versatile robotic tasks, current DRL-based collision-free motion planners for manipulators are highly costly, hindering their deployment and application. This is due to an overreliance on the minimum distance between the manipulator and obstacles, inadequate exploration and decision-making by DRL, and inefficient data acquisition and utilization. In this article, we propose URPlanner, a universal paradigm for collision-free robotic motion planning based on DRL. URPlanner offers several advantages over existing approaches: it is platform-agnostic, cost-effective in both training and deployment, and applicable to arbitrary manipulators without solving inverse kinematics. To achieve this, we first develop a parameterized task space and a universal obstacle avoidance reward that is independent of minimum distance. Second, we introduce an augmented policy exploration and evaluation algorithm that can be applied to various DRL algorithms to
|
| |
| 12:20-12:30, Paper ThAT1.9 | Add to My Program |
| An Analysis of Constraint-Based Multi-Agent Pathfinding Algorithms |
|
| Lee, Hannah | University of Illinois at Urbana-Champaign |
| Motes, James | University of Illinois Urbana-Champaign |
| Morales, Marco | University of Illinois Urbana-Champaign & Instituto Tecnológico Autónomo de México |
| Amato, Nancy | University of Illinois Urbana-Champaign |
|
|
| |
| ThAT2 Regular Session, Hall A3 |
Add to My Program |
| Medical Robotics II |
|
| |
| Chair: De Momi, Elena | Politecnico Di Milano |
| |
| 11:00-11:10, Paper ThAT2.1 | Add to My Program |
| ROOM: A Physics-Based Continuum Robot Simulator for Photorealistic Medical Datasets Generation |
|
| Esposito, Salvatore | University of Edinburgh |
| Mattamala, Matias | University of Edinburgh |
| Rebain, Daniel | University of British Columbia |
| Zhang, Francis Xiatian | University of Edinburgh |
| Dhaliwal, Kev | University of Edinburgh, Center for Inflammation Research, |
| Khadem, Mohsen | University of Edinburgh |
| Ramamoorthy, Subramanian | The University of Edinburgh |
Keywords: Medical Robots and Systems, Computer Vision for Medical Robotics
Abstract: Continuum robots are advancing bronchoscopy procedures by accessing complex lung airways and enabling targeted interventions. However, their development is limited by the lack of realistic testing environments: Real data is difficult to collect due to ethical constraints and patient safety concerns, and developing autonomy algorithms requires realistic imaging and physical feedback. We present ROOM (Realistic Optical Observation in Medicine), a comprehensive simulation framework designed for generating photorealistic bronchoscopy training data. By leveraging patient CT scans, our pipeline renders multi-modal sensor data including RGB images with realistic noise and light specularities, metric depth maps, surface normals, optical flow and point clouds at medically relevant scales. We validate the data generated by ROOM in two canonical tasks for medical robotics---multi-view pose estimation and monocular depth estimation, demonstrating diverse challenges that state-of-the-art methods must overcome to transfer to these medical settings. Furthermore, we show that the data produced by ROOM can be used to fine-tune existing depth estimation models to overcome these challenges, also enabling other downstream applications such as navigation. We expect that ROOM will enable large-scale data generation across diverse patient anatomies and procedural scenarios that are challenging to capture in clinical settings. Our code and data will be publicly released.
|
| |
| 11:10-11:20, Paper ThAT2.2 | Add to My Program |
| Intraoperative Tumor Localization Using Sweeping Palpation in Robot-Assisted Minimally Invasive Surgery (RMIS) |
|
| Hong, Jeongbin | DGIST |
| Lee, Yunjeong | DGIST |
| Ryu, Youngjun | DGIST |
| Lee, Hyoryong | DGIST |
| Park, Joowon | University of Ulsan |
| Park, Sukho | DGIST |
Keywords: Medical Robots and Systems, Surgical Robotics: Planning, Object Detection, Segmentation and Categorization
Abstract: Robot-assisted minimally invasive surgery (RMIS) provides superior visualization, precision, and flexibility, and it has gained recognition as a technology that enhances therapeutic outcomes, particularly in tumor resection. However, this technology has a limitation in that it predominantly relies on visual feedback, making it challenging for surgeons to accurately detect the location and edges of tumors during surgery. To address this issue, robotic palpation methods have been actively studied. Among these, the sweeping palpation method has the advantage of rapidly exploring a broad region. Nevertheless, conventional sweeping palpation methods can only roughly identify the tumor’s location and are limited in detecting tumor edges with precision. In this study, we introduce a novel sweeping palpation method to overcome the limitations of conventional sweeping palpation in RMIS and propose a precise tumor localization method based on this approach. The proposed method involves performing sweeping palpation on the tissue surface using the tip of the robotic end effector while utilizing a Laplacian edge detection algorithm to detect abrupt changes in contact force. This method reduces the reliance on preoperative imaging and enables tumor localization to be performed within a single robotic system. To validate the proposed tumor localization method, we conducted three phantom experiments and ex vivo experiment. These validations demonstrate the potential of our proposed method to contribute to precise tumor resection and the establishment of effective treatment plans.
|
| |
| 11:20-11:30, Paper ThAT2.3 | Add to My Program |
| Gaze-Guided Robotic Vascular Ultrasound Leveraging Human Intention Estimation |
|
| Bi, Yuan | TUM |
| Su, Yang | Technical University of Munich |
| Navab, Nassir | TU Munich |
| Jiang, Zhongliang | The University of Hong Kong |
Keywords: Medical Robots and Systems, Sensor-based Control, Computer Vision for Medical Robotics
Abstract: Medical ultrasound (US) has been widely used to examine vascular structure in modern clinical practice. However, the traditional US examination often faces challenges related to inter- and intra-operator variation. The robotic ultrasound system (RUSS) appears as a potential solution for such challenges because of its superiority in terms of stability and reproducibility. Given the complex anatomy of human vasculature, it is common for multiple vessels to appear in US images, or for a single vessel to bifurcate into multiple branches, complicating the examination process. To tackle this challenge, this work presents a gaze-guided RUSS for vascular applications. A gaze tracker is integrated to capture the eye movements of the human operator. The extracted gaze signal is utilized to guide the RUSS to follow the correct vessel when it bifurcates. Additionally, a gaze-guided segmentation network is proposed to enhance the segmentation robustness by exploiting the gaze information. However, gaze signals are often noisy, requiring interpretation to accurately discern the operator's true intentions. To this end, this study first proposed a stabilization module to process the raw gaze data. The inferred attention heatmap is then utilized as a region proposal to aid in segmentation and to serve as a trigger signal when the operator needs to adjust the scanning target, such as when a bifurcation appears in the current images. To ensure appropriate contact between the probe and the surface during the scanning, an automatic US confidence-based orientation correction method is developed as well. In the experiments, we demonstrated the efficiency of the proposed gaze-guided segmentation pipeline by comparing it with other segmentation methods. Besides, the performance of the proposed
|
| |
| 11:30-11:40, Paper ThAT2.4 | Add to My Program |
| STITCH 2.0: Extending Augmented Suturing with EKF Needle Estimation and Thread Management |
|
| Hari, Kush | UC Berkeley |
| Chen, Ziyang | UC Berkeley |
| Kim, Hansoul | Myongji University |
| Goldberg, Ken | UC Berkeley |
Keywords: Medical Robots and Systems, Surgical Robotics: Laparoscopy, Computer Vision for Medical Robotics
Abstract: Suturing is a high-precision task performed at the end of procedures when surgeon fatigue may increase errors, highlighting the need for robot assistance. Previous autonomous suturing works, such as STITCH 1.0 [1], struggle to fully close wounds due to inaccurate needle tracking, thread tangling, and poor insertion placement. To address these challenges, we present STITCH 2.0, an improved augmented dexterity pipeline over STITCH 1.0 using the da Vinci Research Kit (dVRK) [2] with seven improvements including an improved EKF needle pose estimation pipeline, new thread untangling methods, and an automated 3D suture alignment algorithm. Experimental results over 15 trials (maximum 450 individual suture trials) find that STITCH 2.0 achieves 74.4% wound closure with an average of 4.87 sutures per trial, representing 66% more completed sutures in 38% less time compared to STITCH 1.0 [1]. When two human interventions are allowed, STITCH 2.0 averages six sutures with a 100% wound closure rate.
|
| |
| 11:40-11:50, Paper ThAT2.5 | Add to My Program |
| A Robotic System with Path Planning and Visual Guidance for Teleoperated Left Atrial Appendage Closure |
|
| Peloso, Angela | Politecnico Di Milano |
| D'Alessandro, Nadia | Scuola Superiore Sant'Anna |
| Zhang, Xiu | Scuola Superiore Di Sant'Anna |
| Menciassi, Arianna | Scuola Superiore Sant'Anna - SSSA |
| De Momi, Elena | Politecnico Di Milano |
Keywords: Medical Robots and Systems, Surgical Robotics: Planning, Surgical Robotics: Steerable Catheters/Needles
Abstract: Percutaneous Left Atrial Appendage Closure (LAAC) is a minimally invasive procedure to prevent thromboembolic events in atrial fibrillation patients. The procedure’s success relies on precise navigation and occluder deployment, which is challenged by sheath movement in the dynamic cardiac environment, procedural complexity, and prolonged radiation R1-1 exposure. This study introduces a robotic-assisted navigation system for LAAC procedure, integrating a dedicated steerable sheath, customed planning algorithms, and an intuitive teleoperation interface. The path-planning framework generates collision-free routes based on patient-specific anatomy, adjusting for deviations in real-time. The teleoperation interface comprises a digital twin of the patient’s anatomy with real-time visual feedback to the user for precise and intuitive navigation. Bench-top validation demonstrated that navigation guidance reduced target position error by 2.03% with the planner and 2.85% with the replanner, compared to free navigation without planning assistance. Planning and replanning strategies also reduced collisions with cardiac structures, highlighting the platform’s potential to improve procedural precision and safety.
|
| |
| 11:50-12:00, Paper ThAT2.6 | Add to My Program |
| Real-Time SLAM-Guided Closed-Loop Photodynamic Therapy with Pixel-Accurate Light-Dose Control |
|
| Lee, Hyesung | Korea Institute of Science and Technology |
| Yang, Sungwook | Korea Institute of Science and Technology |
Keywords: Medical Robots and Systems, SLAM, Hardware-Software Integration in Robotics
Abstract: Precise light-dose delivery is essential for photodynamic therapy (PDT), yet current handheld systems remain operator-dependent and lose accuracy under motion. We present a SLAM-guided, closed-loop control framework that enables co-temporal and co-spatial photodynamic diagnosis (PDD) and PDT with a single handheld endomicroscopic probe, while enforcing pixel-level dose control. The probe integrates a fiber bundle that shares a common optical path for both PDD and PDT and is paired with a digital micromirror device (DMD) for (mu)m-scale pattern projection. An extended Kalman filter fuses optical-tracking measurements with texture-limited endomicroscopic images at 30 Hz, providing robust six-degree-of-freedom pose estimates that expand the probe’s effective field of view and drive real-time pattern updates. A dose-map SLAM algorithm accumulates light dose over the reconstructed lesion surface during handheld scanning, while pixel-level dose control is enforced by referencing previously accumulated light at each location. Quantitative evaluation shows a spatial registration error between diagnostic and therapeutic systems within 5.2~mathrm{mu m}. Experiments on fluorescence phantoms achieved sub-millimeter localization accuracy ((0.3~mathrm{mm}) RMSE), significantly outperforming vision-only and tracker-only baselines. Finally, tests on targets with quadrant-specific dose limits confirmed SLAM-based dose control, achieving dose uniformity within pm 0.186~mathrm{mJ/cm^{2}} across millimeter-scale regions.
|
| |
| 12:00-12:10, Paper ThAT2.7 | Add to My Program |
| OCT Imaging for Pose Estimation and Feedback Control of an Articulated Magnetic Surgical Tool |
|
| Fredin, Erik | University of Toronto |
| Pol, Nirmal | University of Toronto |
| Zaliznyi, Anton | University of Tartu |
| Fishman, Dmytro | University of Tartu |
| Diller, Eric D. | University of Toronto |
| Kahrs, Lueder Alexander | University of Toronto Mississauga |
Keywords: Computer Vision for Medical Robotics, Machine Learning for Robot Control, Deep Learning for Visual Perception
Abstract: Magnetically-driven surgical tools are a new class of millimetre-scale devices that could enable procedures such as minimally invasive neurosurgery due to their high dexterity at a small size. However, safe and effective control of these magnetic tools necessitates real-time observation of tool joint angles, which is challenging inside a surgical environment. Optical coherence tomography (OCT) is an emerging volumetric imaging technique offering 3D visualization of tissue and tools simultaneously, which we explore for joint angle estimation. While some previous studies have used OCT for estimating the pose of rigid instruments, those methods are specific to needle-like tools, and often have slow processing speed. In this work, we benchmark eight deep-learning models adapted from other 3D modalities to OCT data showing magnetic tools in a mock surgical environment. The models are tested in the presence of other objects, occlusion, noise, and the tool being partially outside of the OCT's field of view. The best performing model, VoxelNeXt, is adapted from 3D object detection in LiDAR scans, the first time a model of this kind is used on medical data. It infers tool pose with 0.6 mm position and 5° angular errors, with 40 ms inference time. We use this model to provide feedback for controlling a multi-jointed magnetic tool, demonstrating the robustness of OCT-based feedback control.
|
| |
| 12:10-12:20, Paper ThAT2.8 | Add to My Program |
| From Patient-Specific Digital Twin to Real-World Phantom: Autonomous Right Heart Catheterization |
|
| Wang, Yaxi | University College London |
| Xu, Mengzhe | University College London |
| Gaozhang, Wenlong | University College London |
| Wurdemann, Helge Arne | University College London |
Keywords: Medical Robots and Systems, Deep Learning Methods, Simulation and Animation
Abstract: Right heart catheterization (RHC) is a critical procedure for diagnosing and managing cardiovascular diseases such as heart failure, congenital heart disease, pulmonary edema, and pulmonary hypertension. However, currently prevalent manual RHC procedures requires continuous communication of clinicians between the main control room and the operating room, leading to navigation inaccuracies and increased physical workload for clinicians during prolonged procedure. To overcome these challenges, this paper introduces a robotic system that enables autonomous RHC (Auto-RHC) by transferring a catheter decision-making model from patient-specific digital twins to real-world robotic intervention using deep learning algorithms. By creating a high-fidelity digital twin using the Simulation Open Framework Architecture and conducting virtual RHC interventions, images capturing the catheter balloon's position and aligned behavioral datasets were collected and utilized as input for a convolutional neural network architecture. The trained catheter decision-making model derived from the digital twin was then transferred to real world implementations of robot-assisted Auto-RHC.
|
| |
| 12:20-12:30, Paper ThAT2.9 | Add to My Program |
| LUDO: Low-Latency Understanding of Deformable Objects Using Point Cloud Occupancy Functions |
|
| Henrich, Pit | Friedrich-Alexander-University Erlangen-Nurnberg (FAU) |
| Mathis-Ullrich, Franziska | Friedrich-Alexander-University Erlangen-Nurnberg (FAU) |
| Scheikl, Paul Maria | None |
Keywords: Surgical Robotics: Planning, Computer Vision for Medical Robotics, Deep Learning in Robotics and Automation, RGB-D Perception
Abstract: Accurately determining the shape of objects and the location of their internal structures within deformable objects is crucial for medical tasks that require precise targeting, such as robotic biopsies. We introduce LUDO, a method for accurate low-latency understanding of deformable objects. LUDO reconstructs objects in their deformed state, including their internal structures, from a single-view point cloud observation in under 30 ms using occupancy networks. LUDO provides uncertainty estimates for its predictions. Additionally, it provides explainability by highlighting key features in its input observations. Both uncertainty and explainability are important for safety-critical applications such as surgical interventions. We demonstrate LUDO's abilities for autonomous targeting of internal regions of interest (ROIs) in deformable objects. We evaluate LUDO in real-world robotic experiments, achieving a success rate of 98.9% for puncturing various ROIs inside deformable objects. LUDO demonstrates the potential to interact with deformable objects without the need for deformable registration methods.
|
| |
| ThAT3 Regular Session, Lehar 1-4 |
Add to My Program |
| Robot Perception II |
|
| |
| |
| 11:00-11:10, Paper ThAT3.1 | Add to My Program |
| RaCo-SLAM: A Physics-Informed 4D Radar SLAM with Co-Visibility Consistency Factor |
|
| Deng, Zishun | Nankai University |
| Lin, Wanbiao | Nankai University |
| Li, Can | Nankai University |
| Wang, Teng | Tsinghua University |
| Guo, Chao | JiangHuai Advanced Technology Center |
| Shen, Jiawei | Nankai University |
| Sun, Lei | Nankai University |
Keywords: SLAM, Localization, Mapping
Abstract: Robust all-weather localization is a critical capability for autonomous systems. While 4D mmWave radar offers superior resilience to adverse environmental conditions compared to LiDAR and cameras, its application in high-precision Simultaneous Localization and Mapping (SLAM) is hindered by significant challenges, including severe point cloud sparsity, complex noise characteristics, and the prevalence of dynamic objects. To address these issues, we propose RaCo-SLAM, a robust and real-time 4D mmWave radar SLAM framework with co-visibility consistency. This framework features a novel physics-informed probabilistic model for adaptive feature extraction from sparse and noisy point clouds. For global consistency, we introduce a co-visibility consistency factor (CoVC factor) into the global optimization, moving beyond conventional loop-closure methods. This factor directly minimizes point-to-point registration errors to enforce global consistency and is designed for parallel real-time execution on a standard CPU. Comprehensive evaluation on diverse and challenging real-world datasets demonstrates state-of-the-art accuracy and robustness, achieving real-time performance exceeding 40 Hz on a standard CPU. To benefit the community, the code and collected dataset will be released at https://github.com/sudo-robot0/RaCo-SLAM.
|
| |
| 11:10-11:20, Paper ThAT3.2 | Add to My Program |
| CETUS: Causal Event-Driven Temporal Modeling with Unified Variable-Rate Scheduling |
|
| Liang, Hanfang | Jianghan University |
| Wang, Bing | Jianghan University |
| Zhang, Shizhen | Jianghan University |
| Jiang, Wen | Beijing Institute of Technology |
| Yang, Yizhuo | Nangyang Technological Univercity |
| Guo, Weixiang | Nanyang Technological University |
| Yuan, Shenghai | Nanyang Technological University |
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, Visual Learning
Abstract: Event cameras capture asynchronous pixel-level brightness changes with microsecond temporal resolution, offering unique advantages for high-speed vision tasks. Existing methods often convert event streams into intermediate representations such as frames, voxel grids, or point clouds, which inevitably require predefined time windows and thus introduce window latency. Meanwhile, pointwise detection methods face computational challenges that prevent real-time efficiency due to their high computational cost. To overcome these limitations, we propose the Variable-Rate Spatial Event Mamba, a novel architecture that directly processes raw event streams without intermediate representations. Our method introduces a lightweight causal spatial neighborhood encoder to efficiently capture local geometric relations, followed by Mamba-based state space models for scalable temporal modeling with linear complexity. During inference, a controller adaptively adjusts the processing speed according to the event rate, achieving an optimal balance between window latency and inference latency.
|
| |
| 11:20-11:30, Paper ThAT3.3 | Add to My Program |
| SFCo-Nav: Efficient Zero-Shot Visual Language Navigation Via Collaboration of Slow LLM and Fast Attributed Graph Alignment |
|
| Xiong, Chaoran | Shanghai Jiao Tong University |
| Wei, Litao | ShangHai JiaoTong University |
| Hu, Xinhao | Shanghai Jiao Tong University |
| Ma, Kehui | Shanghai Jiao Tong University |
| Xia, Ziyi | Shanghai Jiao Tong University |
| Jiang, Zixin | Shanghai Jiao Tong University |
| Sun, Zhen | Shanghai Jiao Tong University |
| Pei, Ling | Shanghai Jiao Tong University |
Keywords: Vision-Based Navigation, Embodied Cognitive Science, Agent-Based Systems
Abstract: Recent advances in large vision-language models (VLMs) and large language models (LLMs) have enabled zero-shot approaches to Visual Language Navigation (VLN), where an agent follows natural language instructions using only ego perception and reasoning. However, existing zero‑shot methods typically construct a naive observation graph and perform per‑step VLM–LLM inference on it, resulting in high latency and computation costs that limit real‑time deployment. To address this, we present SFCo-Nav, an efficient zero-shot VLN framework inspired by the principle of slow–fast cognitive collaboration. SFCo-Nav integrates three key modules: 1) a slow LLM-based planner that produces a strategic chain of subgoals, each linked to an imagined object graph; 2) a fast reactive navigator for real-time object graph construction and subgoal execution; and 3) a lightweight asynchronous slow–fast bridge aligns advanced structured, attributed imagined and perceived graphs to estimate navigation confidence, triggering the slow LLM planner only when necessary. To the best of our knowledge, SFCo-Nav is the first slow-fast collaboration zero-shot VLN system supporting asynchronous LLM triggering according to the internal confidence. Evaluated on the public R2R and REVERIE benchmarks, SFCo‑Nav matches or exceeds prior state‑of‑the‑art zero‑shot VLN success rates while cutting total token consumption per trajectory by over 50% and running more than 3.5 times faster. Finally, we demonstrate SFCo‑Nav on a legged robot in a hotel suite, showcasing its efficiency and practicality in indoor environments.
|
| |
| 11:30-11:40, Paper ThAT3.4 | Add to My Program |
| A Semantic and Occlusion-Aware GM-PHD Filter |
|
| Menezes, Jovan | Cornell University |
| Campbell, Mark | Cornell University |
Keywords: Deep Learning for Visual Perception, Probabilistic Inference, Intelligent Transportation Systems
Abstract: This paper proposes a new birth model including semantic information derived from deep learning to create an occlusion-aware Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter. Unlike prior approaches that rely on simplistic or uniform assumptions, the proposed Semantic-Occlusion Aware (S-OA) birth model defines initialization terms by explicitly considering regions of occlusion and by leveraging semantic information about the environment. This enables the filter to accurately represent where new objects are more likely to appear, thereby improving tracking performance in complex and high-density driving scenarios. The method is evaluated through Monte Carlo simulations and experiments on the KITTI dataset. Performance is assessed by measuring the latency between first detection and track initiation, along with the mean absolute cardinality error and the Optimal Subpattern Assignment (OSPA) metric. Results demonstrate that the S-OA birth model reduces initialization delay in occlusion-heavy settings, matching or outperforming the strongest baseline in approximately 70% of cases. A sensitivity analysis of birth model weights is also provided. Overall, the findings underscore the benefits of integrating occlusion reasoning and semantic priors into Bayesian tracking frameworks for autonomous driving.
|
| |
| 11:40-11:50, Paper ThAT3.5 | Add to My Program |
| Y-MAP-Net: Learning from Foundation Models for Real-Time, Multi-Task Scene Perception |
|
| Qammaz, Ammar | FORTH and University of Crete |
| Vasilikopoulos, Nikolaos | FORTH and University of Crete |
| Oikonomidis, Iason | FORTH |
| Argyros, Antonis | University of Crete and FORTH |
Keywords: Deep Learning for Visual Perception, Recognition, Visual Learning
Abstract: We present Y-MAP-Net, a Y-shaped neural network architecture designed for real-time multi-task learning on RGB images. Y-MAP-Net simultaneously predicts depth, surface normals, human pose, semantic segmentation, and generates multi-label captions in a single forward pass. To achieve this, we adopt a multi-teacher, single-student training paradigm, where task-specific foundation models supervise the learning of the network, allowing it to distill their capabilities into a unified real-time inference architecture. Y-MAP-Net exhibits strong generalization, architectural simplicity, and computational efficiency, making it well-suited for resource-constrained robotic platforms. By providing rich 3D, semantic, and contextual scene understanding from low-cost RGB cameras, Y-MAP-Net supports key robotic capabilities such as object manipulation and human-robot interaction. To encourage future research and reproducibility, we make our code publicly available.
|
| |
| 11:50-12:00, Paper ThAT3.6 | Add to My Program |
| SSQA: Sibling-Selective Quadtree Attention for Hierarchical Modeling in Perception Tasks |
|
| Chen, Yufan | University of Southern California |
| Bali, Arnav | South River High School |
| Liu, Angela | University of Maryland, College Park |
| Zheng, Laura | University of Maryland, College Park |
| Lin, Ming C. | University of Maryland, College Park |
Keywords: AI-Based Methods, Computer Vision for Automation, RGB-D Perception
Abstract: Perception tasks for navigation in robotics, including aerial platforms such as drones and autonomous driving systems, are inherently structured. Drone-mounted cameras typically capture sky above, terrain below, and obstacles or man-made structures in between, while driving data often contains organized road layouts, lane markings, and surrounding agents. Motivated by these axis-aligned structural priors. These information are normally more structured than generic image tasks. We hypothesize that processing information in a quadtree-esque manner can not only model features effectively in a hierarchical manner, but also offers an efficient linear-time alternative to vanilla attention mechanisms, which run in quadratic time. In this paper, we propose Sibling-Selective Quadtree Attention (SSQA), which models image tokens hierarchically as a structured, full quadtree. We show analytical complexity analysis that guarantees linear-time feature modeling, in addition to empirical experiments comparing inference speeds with other popular modeling approaches, such as Mamba 2 and Quadtree Attention. Our results, benchmarked across several tasks, show that we achieve results at least as good, if not notably better, as others at a fraction of the computational costs.
|
| |
| 12:00-12:10, Paper ThAT3.7 | Add to My Program |
| TacTape: Real-Time High-Accuracy Tactile Fiducial System with Structured 3D Texture for Vision-Based Tactile Sensors |
|
| Wang, Meng | Beijing Institute for General Artificial Intelligence |
| Li, Wanlin | Beijing Institute for General Artificial Intelligence (BIGAI) |
| Chen, Qiuxuan | Beijing University of Posts and Telecommunications |
| Huang, Yuzhe | Beijing University of Aeronautics and Astronautics |
| Li, Hang | Beijing Institute for General Artificial Intelligence |
| Althoefer, Kaspar | Queen Mary University of London |
| Jiao, Ziyuan | Beijing Institute for General Artificial Intelligence |
| Su, Yao | Beijing Institute for General Artificial Intelligence |
| Liu, Hangxin | Beijing Institute for General Artificial Intelligence (BIGAI) |
Keywords: Force and Tactile Sensing
Abstract: Vision-based tactile sensors enable high-resolution tactile perception by capturing image-based contact data. However, their utility in tactile localization is limited by their inherently small and local sensing area, as well as their dependence on distinct object surface features. We propose TacTape, a novel tactile fiducial system that enables accurate and efficient tactile localization by attaching textured tape to object surfaces. A lightweight algorithm allows real-time estimation of contact position and orientation from partially observed structured 3D textures. Experiments demonstrate that TacTape achieves sub-millimeter positional and sub-degree angular localization accuracy, and operates significantly faster than classic tactile mapping methods.
|
| |
| 12:10-12:20, Paper ThAT3.8 | Add to My Program |
| Event Spectroscopy: Event-Based Multispectral and Depth Sensing Using Structured Light |
|
| Geckeler, Christian | ETH Zürich |
| Neugebauer, Niklas | ETH Zurich |
| Muglikar, Manasi | University of Zurich |
| Scaramuzza, Davide | University of Zurich |
| Mintchev, Stefano | ETH Zurich |
Keywords: RGB-D Perception, Computer Vision for Automation, Aerial Systems: Perception and Autonomy
Abstract: Uncrewed aerial vehicles (UAVs) are increasingly deployed in forest environments for tasks such as environmental monitoring and search and rescue, which require safe navigation through dense foliage and precise data collection. Traditional sensing approaches, including passive multispectral and RGB imaging, suffer from latency, poor depth resolution, and strong dependence on ambient light—especially under forest canopies. In this work, we present a novel event spectroscopy system that simultaneously enables high-resolution, low-latency depth reconstruction and multispectral imaging using a single sensor. Depth is reconstructed using structured light, and by modulating the wavelength of the projected structured light, our system captures spectral information in controlled bands between 650 nm and 850 nm. We demonstrate up to 60% improvement in RMSE over commercial depth sensors and validate the spectral accuracy against a reference spectrometer and commercial multispectral cameras, demonstrating comparable performance. A portable version limited to RGB is used to collect real-world depth and spectral data from a Masoala Rainforest. We demonstrate color image reconstruction and material differentiation between leaves and branches using this spectral and depth data. Our results show that adding depth (available at no extra effort with our setup) to material differentiation improves the accuracy by over 30% compared to color-only method. Our system, tested in both lab and real-world rainforest environments, shows strong performance in depth estimation, RGB reconstruction, and material differentiation—paving the way for lightweight, integrated, and robust UAV perception and data collection in complex natural environments.
|
| |
| 12:20-12:30, Paper ThAT3.9 | Add to My Program |
| Efficient Multi-Camera Tokenization with Triplanes for End-To-End Driving |
|
| Ivanovic, Boris | NVIDIA |
| Saltori, Cristiano | NVIDIA |
| You, Yurong | Cornell University |
| Wang, Yan | NVIDIA |
| Luo, Wenjie | University of Toronto |
| Pavone, Marco | Stanford University |
Keywords: Representation Learning, Deep Learning for Visual Perception, Computer Vision for Transportation
Abstract: Autoregressive Transformers are increasingly being deployed as end-to-end robot and autonomous vehicle (AV) policy architectures, owing to their scalability and potential to leverage internet-scale pretraining for generalization. Accordingly, tokenizing sensor data efficiently is paramount to ensuring the real-time feasibility of such architectures on embedded hardware. To this end, we present an efficient triplane-based multi-camera tokenization strategy that leverages recent advances in 3D neural reconstruction and rendering to produce sensor tokens that are agnostic to the number of input cameras and their resolution, while explicitly accounting for their geometry around an AV. Experiments on large-scale AV datasets and a state-of-the-art neural simulator demonstrate that our approach yields significant savings over current image patch-based tokenization strategies, producing up to 72% fewer tokens, resulting in up to 50% faster policy inference while achieving the same open-loop motion planning accuracy and improved offroad rates in closed-loop driving simulations.
|
| |
| ThAT4 Regular Session, Strauss 1-2 |
Add to My Program |
| Wearable and Rehabiltation Robotics |
|
| |
| Co-Chair: Zanotto, Damiano | Stevens Institute of Technology |
| |
| 11:00-11:10, Paper ThAT4.1 | Add to My Program |
| Enabling Automated and Personalized Motor Assessment in Neurorehabilitation: Generating Patient-Specific Reference Movements with a Virtual Humanoid Twin |
|
| Legrand, Mathilde | GIPSA-Lab, Univ. Grenoble Alpes, CNRS, Grenoble INP |
| Lambercy, Olivier | Rehabilitation Engineering Laboratory, ETH Zurich |
| Gassert, Roger | Rehabilitation Engineering Laboratory, ETH Zurich |
Keywords: Rehabilitation Robotics, Human and Humanoid Motion Analysis and Synthesis
Abstract: Recovering upper-limb motor functions impaired by trauma or neurological disease is a long and challenging process. To monitor a patient’s progress through the various stages of rehabilitation and guide therapy, regular movement assessment is essential. However, such evaluations are rarely conducted in clinical practice due to time constraints and the need for cumbersome equipment. A key limitation is the access to reference motion data, typically derived from averaged movements of unimpaired individuals, which requires new data collection for each task and lacks personalization (e.g., accounting for individual morphology or motor abilities). We present a novel method to generate patient-specific reference motions directly from the patient’s hand pose using a personalized model of the patient, the Virtual Humanoid Twin (VHT). By solving an ergonomic-based optimal control problem, our approach produces tailored reference motions without prior task-specific data. We validated this method on two motor tasks (reaching and pouring) using data from seven unimpaired participants, with and without an elbow orthosis restricting motion. Analysis of joint trajectories, range of motion, and normalized multi-dimensional Dynamic Time Warping confirmed that VHT-generated motions were more ergonomic than those with the orthosis and closely matched natural movements. The method’s rapid generation time can also enable real-time reference motion estimation, parallel to the patient’s movements. This innovation simplifies access to reference motions while providing personalization. It creates opportunities for automated motor assessment in neurorehabilitation, enhancing patient recovery tracking through regular evaluations.
|
| |
| 11:10-11:20, Paper ThAT4.2 | Add to My Program |
| Learning a Shape-Adaptive Assist-As-Needed Rehabilitation Policy from Therapist-Informed Input |
|
| Hou, Zhimin | Lingnan University |
| Hou, Jiacheng | Tongji University |
| Chen, Xiao | Technical University of Munich |
| Sadeghian, Hamid | Technical University of Munich |
| Ren, Tianyu | Technical University of Munich |
| Haddadin, Sami | Mohamed Bin Zayed University of Artificial Intelligence |
Keywords: Rehabilitation Robotics, Medical Robots and Systems, Human-Centered Robotics
Abstract: Therapist-in-the-loop robotic rehabilitation has shown the promise to enhance rehabilitation outcomes by integrating the strengths of therapists and robotic systems. However, its broader adoption is limited due to insufficient interaction and limited adaptation capability. This article proposes a novel telerobotics-mediated framework that enables therapists to deliver assist-as-needed~(AAN) therapy based on two primary contributions. First, the reference motion for movement therapy is generated to encourage the active participant of patients based on their motion preferences encoded using a probabilistic model. Second, the telerobotics-mediated system enable the therapist to inform the via-points, enabling minimal but effective assistance for AAN therapy by partially deforming the reference motion. The effectiveness of the proposed strategy was validated a telerobotic system through two representative rehabilitation tasks, demonstrating its potential for remote AAN therapy.
|
| |
| 11:20-11:30, Paper ThAT4.3 | Add to My Program |
| Reinforcement Learning Control Outperforms Iterative Learning in Exoskeleton-Assisted Gait Training |
|
| Li, Andy | Stevens Institute of Technology |
| Li, Haoran | Stevens Institute of Technology |
| Teker, Aytac | Stevens Institute of Technology |
| Hernandez-Rocha, Mariana | Stevens Institute of Technology |
| Gebre, Biruk | Stevens Institute of Technology |
| Nolan J., Karen | Kessler Foundation |
| Pochiraju, Kishore | Stevens Institute of Technology |
| Zanotto, Damiano | Stevens Institute of Technology |
Keywords: Rehabilitation Robotics, Prosthetics and Exoskeletons, Wearable Robotics
Abstract: Learning-based controllers are increasingly adopted in lower-extremity powered exoskeletons, yet their advantages over traditional adaptive approaches remain underexplored. We compared two adaptive assist-as-needed (AAN) controllers for gait training with an ankle exoskeleton: a reinforcement learning-based controller (RL-AAN) and a conventional iterative learning controller (ILC-AAN). Both adjusted assistance stride-by-stride, delivering torque as a percentage of the wearer's biological plantarflexion moment—estimated online with a subject-agnostic model—and progressively faded assistance as performance improved. Healthy participants walked on a self-paced treadmill under a perturbed-gait protocol. Performance was assessed as average percent stride-velocity (SV) improvement relative to unassisted perturbed walking (Δ%ε SV+) and percent of strides above the SV threshold (N% SV+). During training, RL-AAN and ILC-AAN elicited comparable gains in Δ%ε SV+ between the first and last training sessions, but RL-AAN yielded greater adherence across sessions, as indicated by larger N% SV+. After training, RL-AAN demonstrated superior retention in Δ%ε SV+ and N% SV+. These results support RL-AAN as a promising strategy for subject-tailored gait training, motivating future studies in neurological and musculoskeletal populations.
|
| |
| 11:30-11:40, Paper ThAT4.4 | Add to My Program |
| Development of a Mixed-Control Ankle Assist Device with Sensor-Fusion-Based Phase Recognition for Walking Exercise Promotion |
|
| Wang, Chang-Wen | Waseda University |
| Wang, Donglin | Waseda University |
| Wang, Huan | Waseda University |
| Yan, Shuo | Waseda University |
| Osawa, Keisuke | Kyushu University |
| Nakagawa, Kei | Hiroshima University |
| Tanaka, Eiichiro | Waseda University |
Keywords: Rehabilitation Robotics, Model Learning for Control, Sensor Fusion
Abstract: "Frail" elderly often experience walking impairments that limit independence and sustained physical activity. Although various assistive devices exist, many rely on single-mode control, limiting adaptability, responsiveness to gait variability, and voluntary motion. To improve, we developed a wearable ankle-assist device with real-time gait phase recognition and multi-mode control. Sensor fusion of inertial and plantar-pressure data enables robust five-phase segmentation, with optimal weights tuned by Particle Swarm Optimization. Based on detected gait phase, the controller dynamically switches between speed, torque, and free modes, adapting to cadence variations. Treadmill experiments showed that mixed control increased walking distance (251 m to 282 m (p < 0.05)), reduced heart rate change (20% to 10% (p < 0.01)). Gait analysis confirmed comfort and less resistance. These findings demonstrate that phase-aware adaptive assistance balances propulsion and natural motion, supporting mobility and reducing strain. This framework provides a practical basis for wearable ankle-assist systems in elderly rehabilitation and daily use.
|
| |
| 11:40-11:50, Paper ThAT4.5 | Add to My Program |
| A Novel Upper Limb Rehabilitation Framework Based on Dual-Arm Robotics for Therapist-Like Traction Training |
|
| Lin, Gao | Northeastern University |
| Wang, Fei | Northeastern University |
| Han, Shuai | China Medical University |
Keywords: Rehabilitation Robotics, Physical Human-Robot Interaction
Abstract: In this letter, we propose a novel upper limb rehabilitation framework based on dual-arm robotics for therapist-like traction training. Prioritizing patient safety, an 8-DOF kinematic model of the upper limb complex is derived to evaluate the reachable workspace of the end-of-arm and forearm during interaction with a dual-arm robot. Leveraging the characteristics of dual-arm rehabilitation, a non-redundant inverse kinematics method is proposed to constrain joint angles, thereby establishing a safety mechanism under dual constraints. Secondly, considering the training science and compliance, a potential field control strategy is introduced to enable the robot to learn the therapist's traction characteristics from a single demonstration. Combined with the master-slave control, it reproduces the therapist's assistance and allows for compliant interaction. Experimental results show that the proposed framework combines the strong adaptability and comfort of end-effector robots with the precise rehabilitation of exoskeleton robots. As dual-arm and humanoid robots become more widely adopted, the proposed scheme holds promise for delivering therapist-like safe, scientific, and compliant rehabilitation in clinical and home settings.
|
| |
| 11:50-12:00, Paper ThAT4.6 | Add to My Program |
| PAPRLE (Plug-And-Play Robotic Limb Environment): A Modular Ecosystem for Robotic Limbs |
|
| Kwon, Obin | University of Illinois Urbana-Champaign |
| Yamsani, Sankalp | University of Illinois Urbana-Champaign |
| Myers, Noboru | University of Illinois Urbana-Champaign |
| Taylor, Sean | University of Illinois at Urbana Champaign |
| Hong, Jooyoung | University of Illinois at Urbana-Champaign |
| Park, Kyungseo | Daegu Gyeongbuk Institute of Science and Technology (DGIST) |
| Alspach, Alex | Toyota Research Institute |
| Kim, Joohyung | University of Illinois Urbana-Champaign |
Keywords: Telerobotics and Teleoperation, Software-Hardware Integration for Robot Systems, Multi-Robot Systems
Abstract: We introduce PAPRLE (Plug-And-Play Robotic Limb Environment), a modular ecosystem that enables flexible placement and control of robotic limbs. With PAPRLE, a user can re-configure the arrangement of the robotic limbs, and control them with various kinds of devices, including puppeteers, handheld controllers, and VR device. To support diverse multi-limb setups, we also develop a pluggable puppeteer device that can be easily mounted and adapted to different robot configurations. PAPRLE unifies control signals across heterogeneous input devices and supports both task-space and joint-space control modalities, enabling control using puppeteers with different kinematic structures or devices without joint information, such as VR and handheld devices. It also offers feedback mechanisms for teleoperation, including force feedback between structurally dissimilar leader-follower pairs. The modular design of the system facilitates novel spatial arrangements of limbs and enables scalable data collection, thereby advancing research in embodied AI and learning-based control. We validate PAPRLE in real-world settings, demonstrating its versatility across diverse combinations of leader dev
|
| |
| 12:00-12:10, Paper ThAT4.7 | Add to My Program |
| Autonomous Dental Surgery for Root Canal Treatment: Compensating for Robot-Patient Misalignment and File Deflection (I) |
|
| Cheng, Hao-Fang | National Taiwan University |
| Ho, Yi-Ching | Taipei Veterans General Hospital |
| Chen, Cheng-Wei | National Taiwan University |
Keywords: Medical Robots and Systems, Reactive and Sensor-Based Planning, Force Control
Abstract: Robotic technologies are increasingly used in dentistry for their precision in delicate procedures. While most dental robots focus on implant surgery, automating root canal treatment (RCT) remains challenging due to the need to guide a thin, flexible endodontic file through a narrow, curved root canal without causing ledging or file fracture. Patient movements—particularly those that induce additional file bending during insertion—further complicate robot-assisted procedures. This study presents an autonomous approach for root canal cleaning and shaping by combining force admittance and position tracking. A novel Patient Tracking Module, which connects the patient’s dental brace to the robot end-effector via string potentiometers, is developed to estimate real-time robot-patient pose. Additionally, a file flexibility model is proposed to predict and compensate for file deflection during insertion. A hybrid position/force control strategy, which integrates these estimations, autonomously guides file manipulation, minimizes misalignment, and therefore reduces the risk of file fracture. Experimental validation demonstrates the system’s feasibility and potential for clinical application in precision endodontic procedures.
|
| |
| 12:10-12:20, Paper ThAT4.8 | Add to My Program |
| A Stable Model Reference Adaptive Controller Developed for a Prosthetic Hand Wrist (I) |
|
| Sulaiman, Shifa | Aalborg University |
| De Risi, Paolino | Università Degli Studi Di Napoli Federico II |
| Schetter, Francesco | University of Naples, Federico II, Naples |
| Ficuciello, Fanny | Università Di Napoli Federico II |
Keywords: Tendon/Wire Mechanism, Prosthetics and Exoskeletons, Motion Control
Abstract: Advanced control algorithms are essential for enhancing the functionality of prosthetic hands, enabling them to operate in diverse conditions. This paper presents a Model Reference Adaptive Controller (MRAC) developed for a tendondriven soft continuum wrist, integrated into the ‘PRISMA HAND II’ prosthetic hand. The primary objective of our research is to design an adaptive controller that facilitates wrist movements eliminating external disturbances while minimizing computational requirements. To achieve this, kinematic and dynamic models of the wrist are developed based on the Piece-wise Constant Curvature (PCC) hypothesis. The controller consists of a reference model generated using the PCC model, and state errors are evaluated by comparing the responses of the reference model to those of the wrist model. These errors are reduced using the MRAC approach to make the wrist’s behavior closely align with that of the reference model. Stability of the closed-loop system is ensured using the Lyapunov direct method, along with the ‘New Theorem of Stability’, a replacement for Barbalat’s lemma, ensuring that the error between the reference model and the actual system converges to zero and that the adaptive gains stabilize to fixed values.
|
| |
| 12:20-12:30, Paper ThAT4.9 | Add to My Program |
| Lightweight Fingernail Haptic Device: Unobstructed Fingerpad Force and Vibration Feedback for Enhanced Virtual Dexterous Manipulation (I) |
|
| Xu, Yunxiu | The University of Tokyo |
| Wang, Siyu | Institute of Science Tokyo |
| Hasegawa, Shoichi | Tokyo Institute of Technology |
| |
| ThI2I Interactive Session, Hall C |
Add to My Program |
| Interactive Session 6 |
|
| |
| |
| 15:00-16:30, Paper ThI2I.1 | Add to My Program |
| Multimodal Variational DeepMDP: An Efficient Approach for Industrial Assembly in High-Mix, Low-Volume Production |
|
| Bartyzel, Grzeogrz | AGH University of Science and Technology |
Keywords: Reinforcement Learning, Assembly, Representation Learning
Abstract: Transferability, along with sample efficiency, is a critical factor for a reinforcement learning (RL) agent's successful application in real-world contact-rich manipulation tasks, such as product assembly. For instance, in the case of the industrial insertion task on high-mix, low-volume (HMLV) production lines, transferability could eliminate the need for machine retooling, thus reducing production line downtimes. In our work, we introduce a method called Multimodal Variational DeepMDP (MVDeepMDP) that demonstrates the ability to generalize to various environmental variations not encountered during training. The key feature of our approach involves learning a multimodal latent dynamic representation. We demonstrate the effectiveness of our method in the context of an electronic parts insertion task, which is challenging for RL agents due to the diverse physical properties of the non-standardized components, as well as simple 3D-printed blocks insertion. Furthermore, we evaluate the transferability of MVDeepMDP and analyze the impact of the balancing mechanism of the generalized Product of Expert, which is used to combine observable modalities. Finally, we explore the influence of separately processing state modalities of different physical quantities, such as pose and 6D force/torque (F/T) data.
|
| |
| 15:00-16:30, Paper ThI2I.2 | Add to My Program |
| LLM-Based Multi-Agent Reinforcement Learning: Challenges and Future Directions |
|
| Sun, Chuanneng | Rutgers University |
| Huang, Songjun | Jilin University |
| Pompili, Dario | Rutgers University |
| |
| 15:00-16:30, Paper ThI2I.3 | Add to My Program |
| KiGRAS: Kinematic-Driven Generative Model for Realistic Agent Simulation |
|
| Zhao, Jianbo | University of Science and Technology of China |
| Zhuang, Jiaheng | Tsinghua University |
| Qibin, Zhou | Chongqing Afari Intelligent Drive Co., Ltd |
| Ban, Taiyu | University of Science and Technology of China |
| Xu, Ziyao | Moonshot AI |
| Zhou, Hangning | Chongqing Afari Intelligent Drive Co., Ltd |
| Wang, Junhe | Chongqing Afari Intelligent Drive Co., Ltd |
| Wang, Guoan | Chongqing Afari Intelligent Drive Co., Ltd |
| Li, Zhiheng | Tsinghua University |
| Li, Bin | University of Science and Technology of China |
Keywords: Intelligent Transportation Systems, Deep Learning Methods
Abstract: Trajectory generation is a pivotal task in autonomous driving. Recent studies have introduced the autoregressive paradigm, leveraging the state transition model to approximate future trajectory distributions. This paradigm closely mirrors the real-world trajectory generation process and has achieved notable success. However, its potential is limited by the ineffective representation of realistic trajectories within the redundant state space. To address this limitation, we propose the Kinematic-Driven Generative Model for Realistic Agent Simulation (KiGRAS). Instead of modeling in the state space, KiGRAS factorizes the driving scene into action probability distributions at each time step, providing a compact space to represent realistic driving patterns. By establishing physical causality from actions (cause) to trajectories (effect) through the kinematic model, KiGRAS eliminates massive redundant trajectories. All states derived from actions in the cause space are constrained to be physically feasible. Furthermore, redundant trajectories representing identical action sequences are mapped to the same representation, reflecting their underlying actions. This approach significantly reduces task complexity and ensures physical feasibility. KiGRAS achieves state-of-the-art performance in Waymo's SimAgents Challenge, ranking first on the WOMD leaderboard with significantly fewer parameters than other models. The video documentation is available at https://kigras-mach.github.io/KiGRAS/.
|
| |
| 15:00-16:30, Paper ThI2I.4 | Add to My Program |
| You Can't Always Get What You Want: Games of Ordered Preference |
|
| Lee, Dong Ho | The University of Texas at Austin |
| Peters, Lasse | Delft University of Technology |
| Fridovich-Keil, David | The University of Texas at Austin |
Keywords: Multi-Robot Systems, Optimization and Optimal Control, Constrained Motion Planning
Abstract: We study noncooperative games, in which each player's objective is composed of a sequence of ordered- and potentially conflicting-preferences. Problems of this type naturally model a wide variety of scenarios: for example, drivers at a busy intersection must balance the desire to make forward progress with the risk of collision. Mathematically, these problems possess a nested structure, and to behave properly players must prioritize their most important preference, and only consider less important preferences to the extent that they do not compromise performance on more important ones. We consider multi-agent, noncooperative variants of these problems, and seek generalized Nash equilibria in which each player's decision reflects both its hierarchy of preferences and other players' actions. We make two key contributions. First, we develop a recursive approach for deriving the first-order optimality conditions of each player's nested problem. Second, we propose a sequence of increasingly tight relaxations, each of which can be transcribed as a mixed complementarity problem and solved via existing methods. Experimental results demonstrate that our approach reliably converges to equilibrium solutions that strictly reflect players' individual ordered preferences.
|
| |
| 15:00-16:30, Paper ThI2I.5 | Add to My Program |
| FlightBench: Benchmarking Learning-Based Methods for Ego-Vision-Based Quadrotors Navigation |
|
| Yu, Shu'ang | Tsinghua University |
| Yu, Chao | Tsinghua University |
| Gao, Feng | Tsinghua University |
| Wu, Yi | Tsinghua University |
| Wang, Yu | Tsinghua University |
Keywords: Software Tools for Benchmarking and Reproducibility, Deep Learning Methods, Vision-Based Navigation
Abstract: Ego-vision-based navigation in cluttered environments is crucial for mobile systems, particularly agile quadrotors. While learning-based methods have shown promise recently, head-to-head comparisons with cutting-edge optimization-based approaches are scarce, leaving open the question of where and to what extent they truly excel. In this paper, we introduce FlightBench, the first comprehensive benchmark that implements various learning-based methods for ego-vision-based navigation and evaluates them against mainstream optimization-based baselines using a broad set of performance metrics. More importantly, we develop a suite of criteria to assess scenario difficulty and design test cases that span different levels of difficulty based on these criteria. Our results show that while learning-based methods excel in high-speed flight and faster inference, they struggle with challenging scenarios like sharp corners or view occlusion. Analytical experiments validate the correlation between our difficulty criteria and flight performance. revise{Moreover, we verify the trend in flight performance within real-world environments through full-pipeline and hardware-in-the-loop experiments.} We hope this benchmark and these criteria will drive future advancements in learning-based navigation for ego-vision quadrotors. Code and documentation are available at https://github.com/thu-uav/FlightBench.
|
| |
| 15:00-16:30, Paper ThI2I.6 | Add to My Program |
| Learning-Based Kinematic Modeling for Concentric Tube Robot: Addressing Its Nonlinearity and Snapping Behavior |
|
| Jeong, Gowoon | Chonnam National University |
| Ko, Seong Young | Chonnam National University |
Keywords: Machine Learning for Robot Control, Surgical Robotics: Steerable Catheters/Needles, Kinematics
Abstract: The Concentric Tube Robot (CTR) has great promise for minimally invasive surgery. However, accurately modeling nonlinear and history-dependent behaviors remains a significant challenge. This paper proposes a learning-based forward and inverse kinematics model that accounts for the history dependence and nonlinearities of CTR, including the snapping behavior. A lightweight LSTM-MLP hybrid neural network with an input buffer and directional parameters was used to train forward and inverse kinematics models for 4-degree-of-freedom(DOF) CTR. The model was validated by comparing its predictions with actual values and results from a conventional torsional-compliant model(TCM) across random points, rotational trajectories, and arbitrary paths. This validation successfully demonstrated the model’s ability to capture snapping behavior. For forward kinematics, the model achieved a Root Mean Square Error (RMSE) of 0.69 mm and 0.16° with a computation time of 0.831±0.200 ms. The inverse kinematics model achieved an RMSE of 1.22 mm and 2.46° with a computation time of 0.816±0.200 ms. The proposed method improves the accuracy and speed of kinematic modeling by capturing nonlinear behaviors, such as snapping and hysteresis. The lightweight system ensures accurate real-time control and offers a safer and more reliable solution for microsurgical applications.
|
| |
| 15:00-16:30, Paper ThI2I.7 | Add to My Program |
| Task-Driven Co-Design of Mobile Manipulators |
|
| Schneider, Raphael | University of Freiburg |
| Honerkamp, Daniel | Albert Ludwigs Universität Freiburg |
| Welschehold, Tim | Albert-Ludwigs-Universität Freiburg |
| Valada, Abhinav | University of Freiburg |
Keywords: Mobile Manipulation, Mechanism Design, Methods and Tools for Robot System Design
Abstract: Recent interest in mobile manipulation has resulted in a wide range of new robot designs. A large family of these designs focuses on modular platforms that combine existing mobile bases with static manipulator arms. They combine these modules by mounting the arm in a tabletop configuration. However, the operating workspaces and heights for common mobile manipulation tasks, such as opening articulated objects, significantly differ from tabletop manipulation tasks. As a result, these standard arm mounting configurations can result in kinematics with restricted joint ranges and motions. To address these problems, we present the first Concurrent Design approach for mobile manipulators to optimize key arm-mounting parameters. Our approach directly targets task performance across representative household tasks by training a powerful multitask-capable reinforcement learning policy in an inner loop while optimizing over a distribution of design configurations guided by Bayesian Optimization and HyperBand (BOHB) in an outer loop. This results in novel designs that significantly improve performance across both seen and unseen test tasks, and outperform designs generated by heuristic-based performance indices that are cheaper to evaluate but only weakly correlated with the motions of interest. We evaluate the physical feasibility of the resulting designs and show that they are practical and remain modular, affordable, and compatible with existing commercial components. We open-source the approach and generated designs to facilitate further improvements of these platforms.
|
| |
| 15:00-16:30, Paper ThI2I.8 | Add to My Program |
| Estimation of Slip Ratio and Side Slip Angle of Wheeled Planetary Rovers Based on Trace Imprint |
|
| Li, Nan | Harbin Institute of Technology |
| Guo, Junlong | Harbin Institute of Technology (Weihai) |
| Ding, Liang | Harbin Institute of Technology |
| Tian, Chenghua | Beijing Research Institute of Automation for Machinery Industry Co., Ltd. |
| Zhou, Chuan | Minzu University of China |
| Gao, Haibo | Harbin Institute of Technology |
|
|
| |
| 15:00-16:30, Paper ThI2I.9 | Add to My Program |
| OmniNet: Omnidirectional Jumping Neural Network with Height-Awareness for Quadrupedal Robots |
|
| Han, Yimin | The University of Hong Kong |
| Zhang, Jiahui | The University of Hong Kong |
| Luo, Zeren | The University of Hong Kong |
| Dong, Yinzhao | The University of Hong Kong |
| Lin, Jinghan | The University of Hong Kong |
| Zhao, Liu | The University of Hong Kong |
| Dong, Shihao | The University of Hong Kong |
| Lu, Peng | The University of Hong Kong |
Keywords: Reinforcement Learning, Machine Learning for Robot Control
Abstract: In the robotics community, it has been a longstanding challenge for quadrupeds to achieve highly explosive movements similar to their biological counterparts. In this work, we introduce a novel training framework that achieves height-aware and omnidirectional jumping for quadrupedal robots. To facilitate the precise tracking of the user-specified jumping height, our pipeline concurrently trains an estimator that infers the robot and its end-effector states in an online fashion. Besides, a novel reward is involved by solving the analytical inverse kinematics with pre-defined end-effector positions. Guided by this term, the robot is empowered to regulate its gestures during the aerial phase. In the comparative studies, we verify that this controller can not only achieve the longest relative forward jump distance, but also exhibit the most comprehensive jumping capabilities among all the existing jumping controllers. A video summarizing the methodology and the validation in both simulation and real hardware is submitted along with this paper.
|
| |
| 15:00-16:30, Paper ThI2I.10 | Add to My Program |
| Tightly Coupled SLAM with Imprecise Architectural Plans |
|
| Shaheer, Muhammad | University of Luxembourg |
| Millan Romera, Jose Andres | University of Luxembourg |
| Bavle, Hriday | University of Luxembourg |
| Giberna, Marco | University of Luxembourg |
| Sanchez-Lopez, Jose Luis | University of Luxembourg |
| Civera, Javier | Universidad De Zaragoza |
| Voos, Holger | University of Luxembourg |
Keywords: SLAM, Localization, Robotics and Automation in Construction
Abstract: Robots navigating indoor environments often have access to architectural plans, which can serve as prior knowledge to enhance their localization and mapping capabilities. While some SLAM algorithms leverage these plans for global localization in real-world environments, they typically overlook a critical challenge: the “as-planned” architectural designs frequently deviate from the “as-built” real-world environments. To address this gap, we present a novel algorithm that tightly couples LIDAR-based simultaneous localization and mapping with architectural plans under the presence of deviations. Our method utilizes a multi-layered semantic representation to not only localize the robot, but also to estimate global alignment and structural deviations between “as-planned” and “as-built” environments in real-time. To validate our approach, we performed experiments in simulated and real datasets demonstrating robustness to structural deviations up to 35 cm and 15◦. On average, our method achieves 43% less localization error than baselines in simulated environments, while in real environments, the “as-built” 3D maps show 7% lower average alignment error.
|
| |
| 15:00-16:30, Paper ThI2I.11 | Add to My Program |
| Variations of Augmented Lagrangian for Robotic Multi-Contact Simulation |
|
| Lee, Jeongmin | Seoul National University |
| Lee, Minji | Seoul National University |
| Park, Sunkyung | Seoul National University |
| Yun, Jinhee | Seoul National University |
| Lee, Dongjun | Seoul National University |
| |
| 15:00-16:30, Paper ThI2I.12 | Add to My Program |
| SVA: A Street-View-Aided GNSS Positioning Framework with 2DSDM and Likelihood Road for NLOS/Multipath Mitigation |
|
| Gu, Mingxiang | Shanghai Jiao Tong University |
| Li, Tao | Zhejiang University of Technology |
| Liu, Guoqing | Shanghai Jiao Tong University |
| Liu, Zongwei | Shanghai Jiao Tong University |
| Xiong, Chaoran | Shanghai Jiao Tong University |
| Wang, Chao | Hanghai Huace Navigation Technology Ltd |
| Duan, Rui | Shanghai Huace Navigation Technology Ltd |
| Meng, Fanchen | Beijing Institute of Aerospace Control Devices |
| Xiang, Yan | Shanghai Jiao Tong University |
| Pei, Ling | Shanghai Jiao Tong University |
| |
| 15:00-16:30, Paper ThI2I.13 | Add to My Program |
| A Survey on Soft Robot Adaptability: Implementations, Applications, and Prospects |
|
| Chen, Zixi | Scuola Superiore Sant'Anna |
| Wu, Di | University of Southern Denmark |
| Guan, Qinghua | EPFL |
| Hardman, David | University of Cambridge |
| Renda, Federico | Khalifa University of Science and Technology |
| Hughes, Josie | EPFL |
| George Thuruthel, Thomas | University College London |
| Della Santina, Cosimo | TU Delft |
| Mazzolai, Barbara | Istituto Italiano Di Tecnologia |
| Zhao, Huichan | Tsinghua University |
| Stefanini, Cesare | MBZUAI |
Keywords: Soft Robot Applications, Soft Robot Materials and Design, Modeling, Control, and Learning for Soft Robots
Abstract: Soft robots,compared to rigid robots,possess inherent advantages,including higher degrees of freedom, compliance,and enhanced safety,which have contributed to their increasing application across various fields. Among these benefits, adaptability is particularly noteworthy. In this paper, adaptability in soft robots is categorized into external and internal adaptability. External adaptability refers to the robot’s ability to adjust, either passively or actively, to variations in environments, object properties, geometries, and task dynamics. Internal adaptability refers to the robot’s ability to cope with internal variations, such as manufacturing tolerances or material aging, and to generalize control strategies across different robots. As the field of soft robotics continues to evolve, the significance of adaptability has become increasingly pronounced. In this review, we summarize various approaches to enhancing the adaptability of soft robots, including design, sensing, and control strategies. Additionally, we assess the impact of adaptability on applications such as surgery, wearable devices, locomotion, and manipulation. We also discuss the limitations of soft robotics adaptability and prospective directions for future research. By analyzing adaptability through the lenses of implementation, application, and challenges, this paper aims to provide a comprehensive understanding of this essential characteristic in soft robotics and its implications for diverse applications.
|
| |
| 15:00-16:30, Paper ThI2I.14 | Add to My Program |
| Agile Trajectory Planning and Large Obstacle Avoidance for Formation Flight Using a Virtual Core |
|
| Zhang, Jingsen | Xidian University |
| Biao, Hou | Xidian University |
| Huang, Rui | Xidian University |
Keywords: Collision Avoidance, Swarm Robotics, Motion and Path Planning
Abstract: Current methods for formation flight primarily focus on maintaining formations, often neglecting the swarm's agility. Furthermore, most of these approaches fail to leverage global information from the swarm for obstacle avoidance, making them incapable of generating efficient and safe trajectories in large obstacle scenarios. To address these limitations, this letter proposes a novel swarm trajectory planning framework that utilizes a virtual core to control the swarm. We employ virtual core penalties and dynamic maximum speed allocation to strike a balance between swarm flexibility and formation keeping, allowing the drones to avoid obstacles more smoothly and safely while maintaining formation stability. For large obstacle avoidance, we design a collaborative large obstacle boundary search strategy and a global swarm planning method to enable the rapid and safe generation of drone trajectories.To validate the performance of the proposed methods, we develop a comprehensive set of experimental scenarios that include both simulations and real-world environments. The experimental results confirm the effectiveness of our approach.
|
| |
| 15:00-16:30, Paper ThI2I.15 | Add to My Program |
| Bayesian NeRF: Quantifying Uncertainty with Volume Density for Neural Implicit Fields |
|
| Lee, Sibaek | Sungkyunkwan University (SKKU) |
| Kang, Kyeongsu | Sungkyunkwan University |
| Ha, Seongbo | Sungkyunkwan University |
| Yu, Hyeonwoo | SungKyunKwan University |
Keywords: Deep Learning for Visual Perception, Mapping, SLAM
Abstract: We present a Bayesian Neural Radiance Field (NeRF), which explicitly quantifies uncertainty in the volume density by modeling uncertainty in the occupancy, without the need for additional networks, making it particularly suited for challenging observations and uncontrolled image environments. NeRF diverges from traditional geometric methods by providing an enriched scene representation, rendering color and density in 3D space from various viewpoints. However, NeRF encounters limitations in addressing uncertainties solely through geometric structure information, leading to inaccuracies when interpreting scenes with insufficient real-world observations. While previous efforts have relied on auxiliary networks, we propose a series of formulation extensions to NeRF that manage uncertainties in density, both color and density, and occupancy, all without the need for additional networks. In experiments, we show that our method significantly enhances performance on RGB and depth images in the comprehensive dataset. Given that uncertainty modeling aligns well with the inherently uncertain environments of Simultaneous Localization and Mapping (SLAM), we applied our approach to SLAM systems and observed notable improvements in mapping and tracking performance. These results confirm the effectiveness of our Bayesian NeRF approach in quantifying uncertainty based on geometric structure, making it a robust solution for challenging real-world scenarios.
|
| |
| 15:00-16:30, Paper ThI2I.16 | Add to My Program |
| On the Computation of Sensitivity Tubes |
|
| Pupa, Andrea | University of Modena and Reggio Emilia |
| Belvedere, Tommaso | CNRS |
| Secchi, Cristian | Univ. of Modena & Reggio Emilia |
| Robuffo Giordano, Paolo | Irisa Cnrs Umr6074 |
Keywords: Planning under Uncertainty, Optimization and Optimal Control, Integrated Planning and Control
Abstract: Achieving robust and reliable control in robotic systems is crucial, especially in presence of model uncertainties. Over the last years the notion of closed-loop sensitivity has emerged as an effective tool for analyzing and quantifying how uncertainties in the model parameters affect the closed-loop system behavior. In particular, several previous works have shown how the sensitivity matrixes can be used to map uncertainty ellipsoids in the parameter space to the corresponding state/input ellipsoids (the so-called sensitivity tubes) that can be leveraged to robustify any system constraint. This paper extends these previous works based on an ellipsoidal modeling of the parametric uncertainty by proposing two new approaches that significantly improve the computation of the sensitivity tubes. The first method replaces the ellipsoids with hyperboxes for constructing the tubes: this solution avoids any approximation of the parameter set but yields a non-differentiable formulation of the resulting tube. The second method, instead, leverages superquadrics that can approximate a hyperbox with a tunable precision while retaining differentiability of the resulting tubes (as in the ellipsoid case). Both methods have been validated via a simulation campaign and compared with the previous approaches based on an ellipsoidal modeling of the uncertainties. The results confirm the effectiveness of the proposed techniques in producing state/input tubes that accurately envelope the perturbed system behavior, which is an important asset for providing a robustness layer to any online/offline trajectory generation algorithm.
|
| |
| 15:00-16:30, Paper ThI2I.17 | Add to My Program |
| Navigating Narrow Spaces: A Comprehensive Framework for Agricultural Robots |
|
| Kulathunga, Geesara | University of Lincoln |
| Yilmaz, Abdurrahman | Istanbul Technical University |
| Huang, Zhuoling | University of Lincoln |
| Hroob, Ibrahim | University of Lincoln |
| Guevara, Leonardo | University of Lincoln |
| Singh, Jaspreet | University of Lincoln |
| Cielniak, Grzegorz | University of Lincoln |
| Hanheide, Marc | University of Lincoln |
Keywords: Robotics and Automation in Agriculture and Forestry, Nonholonomic Motion Planning, Integrated Planning and Control
Abstract: Navigating within narrow spaces is a fundamental challenge in robotics, requiring precise localisation, localisation error recovery, dynamic path planning, and adaptive control for effective manoeuvring. This paper presents a modular and perception-driven navigation framework designed for constrained environments, focusing primarily on agricultural applications. The proposed method integrates a multi-step point cloud processing pipeline for robust local perception, including pole detection, boundary line estimation, and trajectory refinement to ensure safe and precise traversal by refining initial trajectories based on detected environmental constraints and dynamically adapting to kinematic limitations. Experimental validation in a real strawberry polytunnel demonstrates superior trajectory accuracy and control stability compared to state-of-the-art navigators, achieving an average lateral deviation of 0.08 ± 0.01 m. The adaptive trajectory tracking and regulated pure pursuit control of the framework contribute to consistent navigation, even under increased velocity constraints, outperforming the resilient timed elastic band (RTEB) and model predictive path integral (MPPI) methods. This modular and generalisable framework offers significant potential for advancing autonomous navigation in narrow-space applications.
|
| |
| 15:00-16:30, Paper ThI2I.18 | Add to My Program |
| TactileAloha: Learning Bimanual Manipulation with Tactile Sensing |
|
| Gu, Ningquan | Tohoku University |
| Kosuge, Kazuhiro | The University of Hong Kong |
| Hayashibe, Mitsuhiro | Tohoku University |
Keywords: Imitation Learning, Bimanual Manipulation, Hardware-Software Integration in Robotics
Abstract: Tactile texture is vital for robotic manipulation but challenging for camera vision-based observation. To address this, we propose TactileAloha, an integrated tactile-vision robotic system built upon Aloha, with a tactile sensor mounted on the gripper to capture fine-grained texture information and support real-time visualization during teleoperation, facilitating efficient data collection and manipulation. Using data collected from our integrated system, we encode tactile signals with a pre-trained ResNet and fuse them with visual and proprioceptive features. The combined observations are processed by a transformer-based policy with action chunking to predict future actions. We use a weighted loss function during training to emphasize near-future actions, and employ an improved temporal aggregation scheme at deployment to enhance action precision. Experimentally, we introduce two bimanual tasks: zip tie insertion and Velcro fastening, both requiring tactile sensing to perceive the object texture and align two object orientations by two hands. Our proposed method adaptively changes the generated manipulation sequence itself based on tactile sensing in a systematic manner. Results show that our system, leveraging tactile information, can handle texture-related tasks that camera vision-based methods fail to address. Moreover, our method achieves an average relative improvement of approximately 11.0% compared to state-of-the-art method with tactile input, demonstrating its performance.
|
| |
| 15:00-16:30, Paper ThI2I.19 | Add to My Program |
| RAMBO: RL-Augmented Model-Based Whole-Body Control for Loco-Manipulation |
|
| Cheng, Jin | ETH Zurich |
| Kang, Dongho | Robotics and AI Institute |
| Fadini, Gabriele | ZHAW |
| Shi, Guanya | Carnegie Mellon University |
| Coros, Stelian | ETH Zurich |
Keywords: Legged Robots, Reinforcement Learning, Mobile Manipulation
Abstract: Loco-manipulation, physical interaction of various objects that is concurrently coordinated with locomotion, remains a major challenge for legged robots due to the need for both precise end-effector control and robustness to unmodeled dynamics. While model-based controllers provide precise planning via online optimization, they are limited by model inaccuracies. In contrast, learning-based methods offer robustness, but they struggle with precise modulation of interaction forces. We introduce RAMBO, a hybrid framework that integrates model-based whole-body control within a feedback policy trained with reinforcement learning. The model-based module generates feedforward torques by solving a quadratic program, while the policy provides feedback corrective terms to enhance robustness. We validate our framework on a quadruped robot across a diverse set of real-world loco-manipulation tasks, such as pushing a shopping cart, balancing a plate, and holding soft objects, in both quadrupedal and bipedal walking. Our experiments demonstrate that RAMBO enables precise manipulation capabilities while achieving robust and dynamic locomotion.
|
| |
| 15:00-16:30, Paper ThI2I.20 | Add to My Program |
| Mobile Robot Navigation Using Hand-Drawn Maps: A Vision Language Model Approach |
|
| Tan, Aaron Hao | University of Toronto |
| Fung, Angus | University of Toronto |
| Wang, Haitong | University of Toronto |
| Nejat, Goldie | University of Toronto |
Keywords: AI-Enabled Robotics, Task and Motion Planning, Vision-Based Navigation
Abstract: Hand-drawn maps can be used to convey navigation instructions between humans and robots in a natural and efficient manner. However, these maps can often contain inaccuracies such as scale distortions and missing landmarks which present challenges for mobile robot navigation. This paper introduces a novel Hand-drawn Map Navigation (HAM-Nav) architecture that leverages pre-trained vision language models (VLMs) for robot navigation across diverse environments, hand-drawing styles, and robot embodiments, even in the presence of map inaccuracies. HAM-Nav integrates a unique Selective Visual Association Prompting approach for topological map-based position estimation and navigation planning as well as a Predictive Navigation Plan Parser to infer missing landmarks. Extensive experiments were conducted in photorealistic simulated environments, using both wheeled and legged robots, demonstrating the effectiveness of HAM-Nav in terms of navigation success rates and Success weighted by Path Length. Furthermore, a user study in real-world environments highlighted the practical utility of hand-drawn maps for robot navigation as well as successful navigation outcomes compared against a non-hand-drawn map approach.
|
| |
| 15:00-16:30, Paper ThI2I.21 | Add to My Program |
| MMD-OPT : Maximum Mean Discrepancy Based Sample Efficient Collision Risk Minimization for Autonomous Driving (I) |
|
| Sharma, Basant | University of Tartu |
| Singh, Arun Kumar | University of Tartu |
Keywords: Planning under Uncertainty, Autonomous Vehicle Navigation, Collision Avoidance
Abstract: We propose MMD-OPT: a sample-efficient approach for minimizing the risk of collision under arbitrary prediction distribution of the dynamic obstacles. MMD-OPT is based on embedding distribution in Reproducing Kernel Hilbert Space (RKHS) and the associated Maximum Mean Discrepancy (MMD). We show how these two concepts can be used to define a sample efficient surrogate for collision risk estimate. We perform extensive simulations to validate the effectiveness of MMD-OPT on both synthetic and real-world datasets. Importantly, we show that trajectory optimization with our MMD-based collision risk surrogate leads to safer trajectories at low sample regimes than popular alternatives based on Conditional Value at Risk (CVaR).
|
| |
| 15:00-16:30, Paper ThI2I.22 | Add to My Program |
| Human-Like Robot Action Policy through Game-Based Empathetic Inference for Human-Robot Collaboration |
|
| Sheng, Yubo | Huazhong University of Science and Technology |
| Wang, Yiwei | Huazhong University of Science and Technology |
| Cheng, Haoyuan | Huazhong University of Science and Technology |
| Zhao, Huan | Huazhong University of Science and Technology |
| Ding, Han | Huazhong University of Science and Technology |
Keywords: Physical Human-Robot Interaction, Cooperating Robots, Cognitive Human-Robot Interaction, Human Factors and Human-in-the-Loop
Abstract: Harmonious human-robot collaboration requires the robot to behave like a human partner, which raises the critical question of what factors make the robot do so. This paper proposes a series of policies based on empathetic and non-empathetic intent inference, proactive and reactive action planning, and ego and non-ego action styles to examine which modules enable robots to exhibit human-like behaviors. Two series of experiments are conducted with human subjects to test the performance of the proposed controllers. In Experiment 1, the participant must identify whether the collaborating partner is a human, similar to a Turing test. The classification results empirically verify that the designed empathetic proactive policies enable the robot to exhibit human-like behaviors. Experiment 2 indicates that the proposed policy can be applied to complex collaborative tasks, and this result is consistent with the findings of Experiment 1. From empirical evidence from the experiments, we believe that empathy and proactive policies are essential elements to enable robots to perform human-like actions.
|
| |
| 15:00-16:30, Paper ThI2I.23 | Add to My Program |
| MinNav: Minimalist Navigation for Active Tiny Aerial Robots |
|
| Patil, Aniket | Worcester Polytechnic Institute |
| Singh, Mandeep | Worcester Polytechnic Institute |
| Maradana, Uday Girish | Worcester Polytechnic Institute |
| J Sanket, Nitin | Worcester Polytechnic Institute |
Keywords: Aerial Systems: Perception and Autonomy, Perception-Action Coupling, Vision-Based Navigation
Abstract: Navigation using a monocular camera is pivotal for autonomous operation on tiny aerial robots due to their perfect balance of versatility, cost and accuracy. In this paper, we introduce MinNav, a navigation stack based on optical flow and its uncertainty to fly through a scene with static and dynamic obstacles and unknown-shaped gaps without any prior knowledge of the scene components and/or their locations/ordering. We further improve success rate by using the activeness of the robot to move around in an exploratory way to find obstacles and navigate. We successfully evaluate and demonstrate the proposed approach in many real-world experiments in various environments with static and dynamic obstacles and unknown-shaped gaps with an overall success rate of 70%. To the best of our knowledge, this is the first solution to tackle all the aforementioned navigation cases without prior knowledge using a monocular camera. Our approach is on par in performance with depth based methods with factors of magnitude less computation required and can readily run onboard tiny aerial robots.
|
| |
| 15:00-16:30, Paper ThI2I.24 | Add to My Program |
| Robust Sensitivity-Aware Chance-Constrained MPC for Efficient Handling of Multiple Uncertainty Sources |
|
| Zhu, James | LAAS-CNRS |
| Simeon, Thierry | LAAS-CNRS |
| Cognetti, Marco | LAAS-CNRS and Université De Toulouse |
Keywords: Planning under Uncertainty, Robust/Adaptive Control, Optimization and Optimal Control
Abstract: Robust motion planning under uncertainty is a critical challenge for applications involving real-world robotic deployments. This paper introduces SupeR-MPC, a computationally-efficient, sensitivity-aware, chance-constrained optimization framework that systematically accounts for multiple sources of uncertainty, including state estimation error, model parameter uncertainty, obstacle localization error, and process noise. This approach advances sensitivity-aware robust control by integrating chance-constrained optimization to handle the uncertainty models of Kalman-filtering methods. To demonstrate robustness against multiple uncertainty sources, SupeR-MPC was validated on a range of systems and environments, from a simple 2D example to a multi-agent dynamic obstacle avoidance scenario. Comparisons against existing MPC methods show that SupeR-MPC significantly improves constraint satisfaction and robustness while maintaining real-time computational efficiency. These results highlight the effectiveness of sensitivity-aware chance constraints in enhancing real-world robotic decision-making under uncertainty.
|
| |
| 15:00-16:30, Paper ThI2I.25 | Add to My Program |
| CMG3D: Compensation towards Modality Gap for Open-Vocabulary Indoor 3D Object Detection |
|
| Zhang, Sheng | BOE Technology Group Co., Ltd., Beijing, China |
| Huai, Lian | BoE |
| Liu, Yuyu | BoE |
| Jiang, Xingqun | BoE |
Keywords: Deep Learning for Visual Perception, RGB-D Perception, Computer Vision for Automation
Abstract: Open-vocabulary indoor three-dimensional object detection (OVI3DOD) is used to detect any class of objects in indoor scenes with prompts. Owing to the relatively limited three-dimensional (3D) data, most of the OVI3DOD algorithms perform training with pseudo labels transformed from the openvocabulary 2D detection results. For indoor scenes, point clouds are sparse and incomplete. Moreover, there is a gap between different modalities, especially for distant objects. However, existing OVI3DOD algorithms ignore this problem, which weakens the detection performance. Therefore, we propose the Compensation towards Modal Gap for open-vocabulary indoor 3D object detection (CMG3D) approach. CMG3D consists of three modules: multimodal compensation (MC), object proposal filtering (OPF) and pseudo label refinement and generation (PLRG). For the MC, features from images are converted into pseudo voxel space and then summed with the voxel space of the point cloud, which is used to compensate for the modality gap. For the OPF, we filter the object proposals to avoid confusion between the foreground and background. For the PLRG, the predictions from the two-dimensional (2D) detector are refined by the multimodal large language model (LLM) SIGLIP and then transformed to 3D pseudo labels for the training process. Finally, we evaluate CMG3D on two indoor datasets, SUN RGB-D and ScanNet, and achieve state-of-the-art results.
|
| |
| 15:00-16:30, Paper ThI2I.26 | Add to My Program |
| High-Speed Scooping through Dynamic Manipulation: Model and Practice |
|
| Cha, Hyeonje | Pusan National University |
| Lee, Inho | Pusan National University |
| Seo, Jungwon | Pusan National University |
Keywords: In-Hand Manipulation, Grasping, Dexterous Manipulation
Abstract: This study introduces a robotic high-speed scooping technique, an effective solution for rapidly picking thin objects from a hard support surface. High-speed scooping involves dynamic and impactful manipulation using a two-fingered gripper. One digit dynamically penetrates beneath the object lying on a support surface while the other digit helps form a cage and subsequently secures a firm grip. This entire process is executed within a fractional-second time frame. We develop a theoretical model of manipulation for high-speed scooping and implement it using our custom direct-drive gripper designed for enhanced environment-adaptability. Extensive experiments verify the viability of our high-speed scooping approach.
|
| |
| 15:00-16:30, Paper ThI2I.27 | Add to My Program |
| RealTraj: Towards Real-World Pedestrian Trajectory Forecasting |
|
| Fujii, Ryo | Keio University |
| Saito, Hideo | Keio University |
| Hachiuma, Ryo | NVIDIA |
Keywords: Human-Aware Motion Planning, Motion and Path Planning, Autonomous Vehicle Navigation
Abstract: This paper jointly addresses three key limitations in conventional pedestrian trajectory forecasting: pedestrian perception errors, real-world data collection costs, and person ID annotation costs. We propose a novel framework, RealTraj, that enhances the real-world applicability of trajectory forecasting. Our approach includes two training phases--self-supervised pretraining on synthetic data and weakly-supervised fine-tuning with limited real-world data--to minimize data collection efforts. To improve robustness to real-world errors, we focus on both model design and training objectives. Specifically, we present Det2TrajFormer, a trajectory forecasting model that remains invariant to tracking noise by using past detections as inputs. Additionally, we pretrain the model using multiple pretext tasks, which enhance robustness and improve forecasting performance based solely on detection data. Unlike previous trajectory forecasting methods, our approach fine-tunes the model using only ground-truth detections, reducing the need for costly person ID annotations. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art trajectory forecasting methods on multiple datasets.
|
| |
| 15:00-16:30, Paper ThI2I.28 | Add to My Program |
| Time-Optimal Trajectory Planning and Model Predictive Control of Morphing Quadrotors |
|
| Wang, Qiuyu | Dalian Maritime University |
| Zhao, Na | Dalian Maritime University |
| Qin, Chaojun | Dalian Maritime University |
| Ke, Xiyu | Rutgers, the State University of New Jersey |
| Luo, Yudong | Dalian Maritime University |
| Shen, Yantao | University of Nevada, Reno |
Keywords: Autonomous Vehicle Navigation, Process Control
Abstract: Morphing quadrotors offer enhanced maneuverability and adaptability in confined spaces, while their structural variations pose challenges to trajectory planning and control. This paper presents a time-optimal trajectory planning and model predictive control framework for the morphing quadrotor. The trajectory generator computes time-optimal trajectories by dynamically adjusting arm lengths, allowing the quadrotor to traverse waypoints as quickly as possible while satisfying constraints. The generated trajectory is then brought into the designed dual-loop model predictive control architecture to achieve autonomous flight, in which the outer loop tracks the desired trajectory and the inner loop synchronously regulates attitude and arm length of the morphing quadrotor. Experimental validation demonstrates that the proposed framework achieves high-precision trajectory tracking, robust dynamic response, and superior adaptability in confined environments.
|
| |
| 15:00-16:30, Paper ThI2I.29 | Add to My Program |
| Rapid and Hierarchical UAV Exploration Via Adaptive Regional Viewpoint Generation |
|
| Bu, Ningbo | Ningbo Institute of Materials Technology & Engineering,Chinese Academy of Sciences |
| Xu, Gen | Ningbo Institute of Materials Technology & Engineering ,Chinese Academy of Sciences. |
| Zheng, Hao | Ningbo Institute of Materials Technology&Engineering |
| Wei, Xuehang | Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences |
| Chen, Wenshi | Ningbo Institute of Materials Technology&Engineering,Chinese Academy of Sciences |
| Zhang, Xiaolu | Ningbo Institute of Materials Technology & Engineering,Chinese Academy of Sciences |
| Lv, Li | Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences |
| Xiao, Jiangjian | Ningbo Institute of Industrial Technology, Chinese Academy of Science |
| Li, Zhiqiang | Ningbo Institute of Materials Technology & Engineering,Chinese Academy of Sciences |
|
|
| |
| 15:00-16:30, Paper ThI2I.30 | Add to My Program |
| VINGS-Mono: Visual-Inertial Gaussian Splatting Monocular SLAM in Large Scenes |
|
| Wu, Ke | Fudan University |
| Zhang, Zicheng | Fudan University |
| Tie, Muer | Fudan University |
| Ai, Ziqing | Fudan University |
| Gan, Zhongxue | Fudan University |
| Ding, Wenchao | Fudan University |
Keywords: SLAM, Sensor Fusion, Mapping, Gaussian Splatting
Abstract: VINGS-Mono is a monocular inertial Gaussian Splatting (GS) SLAM framework designed for large-scale scenes. It integrates four main components: VIO Front End, 2D Gaussian Map, NVS Loop Closure, and Dynamic Eraser. The VIO Front End processes RGB frames with dense bundle adjustment and uncertainty estimation to extract scene geometry and poses. The mapping module incrementally builds a 2D Gaussian map with up to 50 million Gaussian ellipsoids. Key components like a Sample-based Rasterizer, Score Manager, and Pose Refinement enhance mapping efficiency and localization accuracy for large-scale urban environments. To ensure global consistency, the NVS Loop Closure uses Novel View Synthesis for loop detection and map correction, while the Dynamic Eraser addresses dynamic objects in outdoor scenes. Evaluations demonstrate localization performance comparable to Visual-Inertial Odometry and surpass GS/NeRF SLAM methods in mapping and rendering. A mobile app further verifies real-time capability, generating high-quality Gaussian maps using a smartphone camera and low-frequency IMU. VINGS-Mono is the first monocular Gaussian SLAM framework for outdoor, kilometer-scale scenes.
|
| |
| 15:00-16:30, Paper ThI2I.31 | Add to My Program |
| Dual Arm Steering of Flexible Linear Objects in 2-D and 3-D Environments Using Eulers Elastica Solutions |
|
| Levin, Aharon | Technion - Israel Institute of Technology |
| Grinberg, Itay | Technion - Israel Institute of Technology |
| Rimon, Elon | Technion - Israel Institute of Technology |
| Shapiro, Amir | Ben Gurion University of the Negev |
Keywords: Dual Arm Manipulation, Optimization and Optimal Control, Soft Robot Applications
Abstract: This paper describes a method for steering flexible linear objects using two robot hands in environments populated by sparsely spaced obstacles. The approach involves manipulating an elastic inextensible rod by varying the gripping endpoint positions and tangents. Closed form solutions that describe the flexible linear object shape in planar environments, Euler’s elastica, are described. The paper uses the elastica solutions to formulate criteria for non self-intersection, stability and obstacle avoidance in analytic closed form manner. These criteria are formulated as constraints in the flexible object six-dimensional configuration space that represents the robot gripping endpoint positions and tangents. In particular, this paper introduces a novel criterion that ensures the flexible object stability during steering. All safety criteria are integrated into a scheme for steering flexible linear objects in planar environments, which is lifted into a steering scheme in three-dimensional environments populated by sparsely spaced obstacles. Experiments with a dual-arm robot demonstrate the method.
|
| |
| 15:00-16:30, Paper ThI2I.32 | Add to My Program |
| Brain-Inspired Visual Topometric Localization Via Roadnetwork-Constraint Hidden Markov Model |
|
| Li, Jinyu | Beijing Normal University |
| Zeng, Taiping | Fudan University |
| Si, Bailu | Beijing Normal University |
Keywords: Biologically-Inspired Robots, Localization, Neurorobotics
Abstract: Accurate localization in GPS-denied environments remains a critical challenge for autonomous robot navigation. Animals exhibit remarkable navigational abilities in complex, dynamic environments by relying on mental cognitive maps. Inspired by neural representations such as head direction cells and grid cells, numerous robotic cognitive mapping systems can efficiently cover large areas; however, they often lack the precise metric information required for accurate localization. To address this challenge, we propose a neurodynamically driven monocular visual topometric localization approach based on road network constraints. We introduce the Roadnetwork-Constraint Hidden Markov Model (RC-HMM) to enhance semi-metric maps by incorporating road network constraints, forming a coherent topometric map that maintains vertex relationships and improves localization accuracy. Experimental results in the CARLA Town07 environment demonstrate the remarkable efficiency of our topometric cognitive map. Compared to semi-metric maps, our approach achieves a 95% reduction in Absolute Pose Error (APE) and an 81% reduction in Relative Pose Error (RPE). Compared to binocular ORB-SLAM3, our monocular approach reduces CPU usage by 96.7% and map storage by 77.7%, with an APE of 3.6 m and RPE of 1.4 m—closely matching ORB-SLAM3’s 3.86 m APE and 0.96 m RPE. Furthermore, by leveraging neurodynamics of grid cells and head direction cells, our monocular topometric localization robustly delivers a localization accuracy of 3.86 meters, comparable to binocular ORB-SLAM3. This approach integrates road network metrics into topological maps, enhancing brain-inspired navigation with topometric maps in complex environments.
|
| |
| 15:00-16:30, Paper ThI2I.33 | Add to My Program |
| SM-NMPC: Sliding Mode-Based Nonlinear Model Predictive Control for UAVs under Degraded Motor on Microcontrollers |
|
| Nguyen, Van Chung | University of Nevada, Reno |
| Nguyen, An | University of Nevada, Reno |
| Walunj, Pratik | University of Nevada Reno |
| Le, Chuong | University of Nevada, Reno |
| La, Hung | University of Nevada at Reno |
Keywords: Embedded Systems for Robotic and Automation, Optimization and Optimal Control, Aerial Systems: Mechanics and Control
Abstract: This paper presents a novel Sliding Mode-Based Nonlinear Model Predictive Control (SM-NMPC) for controlling Unmanned Aerial Vehicles (UAVs) such as Quadrotors and a 10-propeller drone (Cube-Drone). The proposed method combines Aggregated Hierarchical Sliding Mode Control (AHSMC) strategies with Nonlinear Model Predictive Control (NMPC), designed to operate on resource-constrained microcontrollers. First, an AHSMC that provided a virtual input reference is introduced to ensure the UAV's robustness, which is then leveraged by the NMPC to solve the optimization problem. A comprehensive comparison to existing approaches in terms of stability and computational efficiency demonstrates that the SM-NMPC framework excels, enabling quadrotor UAVs to accurately track reference trajectories even in the presence of a degraded motor. The proposed method also showcases the capability to implement robust optimal control on a microcontroller. Extensive experiments, both on real UAVs and their physical models in Gazebo/ROS2, are conducted to validate the effectiveness of the approach. A comparison to other state-of-the-art controllers further highlights the feasibility and superior performance of the proposed methodology. The open-source code has also been made available for further investigation.
|
| |
| 15:00-16:30, Paper ThI2I.34 | Add to My Program |
| HRI-DGDM: Dual-Graph Guided Diffusion Model for Uncertain Human Motion Modeling in HRI |
|
| Gui, Hongquan | The Hong Kong Polytechnic University |
| Li, Ming | The Hong Kong Polytechnic University |
Keywords: Human Factors and Human-in-the-Loop, Human-Robot Collaboration, Human-Centered Robotics
Abstract: Human motion in human-robot interaction (HRI) is inherently uncertain, even when performing the same task repeatedly. This variability poses a significant challenge for prediction, as models must capture a distribution of plausible futures rather than a single deterministic trajectory. Traditional graph convolutional network based models, while effective at capturing spatial temporal dependencies, are fundamentally limited by their deterministic nature and struggle to represent this inherent motion uncertainty. To address this, diffusion models have emerged as a powerful framework for modeling uncertainty. However, their direct application to HRI is hindered by two key limitations: they often prioritize motion diversity over prediction accuracy, potentially generating physically implausible results, and they fail to adequately model the complex, multi-scale spatial temporal coupling between human and robot motions. To overcome these challenges, we propose HRI-DGDM, a HRI motion prediction framework based on a dual-graph guided diffusion model. Our method introduces a dual-graph structure—comprising a structural graph for kinematic priors and a collaboration graph learned from motion dynamics—to guide the denoising process with strong structural priors. A dedicated spatial temporal denoising network (STDN) fuses multi-scale features from both graphs through adaptive fusion and hierarchical spatial temporal modeling. Furthermore, a masking-based conditioning mechanism anchors the observed history during denoising, ensuring temporal consistency and preventing drift. Experiments on HRI scenarios demonstrate that HRI-DGDM outperforms baselines in prediction accuracy.
|
| |
| 15:00-16:30, Paper ThI2I.35 | Add to My Program |
| Aleatoric Uncertainty from AI-Based 6D Object Pose Predictors for Object-Relative State Estimation |
|
| Jantos, Thomas | University of Klagenfurt |
| Weiss, Stephan | University of Klagenfurt |
| Steinbrener, Jan | University of Klagenfurt |
Keywords: AI-Based Methods, Deep Learning Methods, Sensor Fusion
Abstract: Deep Learning (DL) has become essential in various robotics applications due to excelling at processing raw sensory data to extract task specific information from semantic objects. For example, vision-based object-relative navigation relies on a DL-based 6D object pose predictor to provide the relative pose between the object and the robot as measurements to the robot's state estimator. Accurately knowing the uncertainty inherent in such Deep Neural Network (DNN) based measurements is essential for probabilistic state estimators subsequently guiding the robot's tasks. Thus, in this letter, we show that we can extend any existing DL-based object-relative pose predictor for aleatoric uncertainty inference simply by including two multi-layer perceptrons detached from the translational and rotational part of the DL predictor. This allows for efficient training while freezing the existing pre-trained predictor. We then use the inferred 6D pose and its uncertainty as a measurement and corresponding noise covariance matrix in an extended Kalman filter (EKF). Our approach induces minimal computational overhead such that the state estimator can be deployed on edge devices while benefiting from the dynamically inferred measurement uncertainty. This increases the performance of the object-relative state estimation task compared to a fix-covariance approach. We conduct evaluations on synthetic data and real-world data to underline the benefits of aleatoric uncertainty inference for the object-relative state estimation task.
|
| |
| 15:00-16:30, Paper ThI2I.36 | Add to My Program |
| Gaze-Based Teleoperation with Intent Inference Model for Robotic Manipulators |
|
| Yuan, Yanjia | Southeast University |
| Peng, Chong | Southeast University |
| Chu, Dihui | Southeast University |
| Wang, Qianqian | Southeast University |
| Gao, Qiang | Southeast University |
| Tang, Yunlong | Monash University |
| Wang, Xiaoyu | Southeast University |
Keywords: Human-Centered Robotics, Intention Recognition, Human-Robot Collaboration
Abstract: Eye gaze-based control interfaces provide a non-invasive means of enhancing human-robot collaboration for activities of daily living and can reduce the cognitive burden on operators performing complex tasks. Eye gaze has traditionally been used for "gaze triggering," where fixating on an object activates pre-programmed robotic movements. In this work, we propose a gaze-based robotic teleoperation approach that utilizes real-time gaze data to guide the freeform movement of robotic manipulators. The proposed approach incorporates a Gaussian Mixture Regression (GMR)-based intent inference model to capture the nonlinear relationship between gaze data and the operator’s intended robotic movements. For benchmarking, we further implemented a Gaussian Hidden Markov Model (G-HMM) to provide a comparable probabilistic framework for intent inference. Experimental results demonstrate that the GMR-based approach achieves a statistically significant improvement over G-HMM in terms of control efficiency, trajectory smoothness against involuntary eye fluctuations, as well as enhancing the user’s sense of involvement and control.
|
| |
| 15:00-16:30, Paper ThI2I.37 | Add to My Program |
| MATrack: Efficient Multiscale Adaptive Tracker for Real-Time Nighttime UAV Operations |
|
| Li, Xuzhao | Nanyang Technological University |
| Li, Xuchen | Nanyang Technological University |
| Hu, Shiyu | Nanyang Technological University |
Keywords: Visual Tracking, Aerial Systems: Applications, Aerial Systems: Perception and Autonomy
Abstract: Nighttime UAV tracking faces significant challenges in real-world robotics operations. Low-light conditions not only limit visual perception capabilities, but cluttered backgrounds and frequent viewpoint changes also cause existing trackers to drift or fail during deployment. To address these difficulties, researchers have proposed solutions based on low-light enhancement and domain adaptation. However, these methods still have notable shortcomings in actual UAV systems: low-light enhancement often introduces visual artifacts, domain adaptation methods are computationally expensive and existing lightweight designs struggle to fully leverage dynamic object information. Based on an in-depth analysis of these key issues, we propose MATrack—a multiscale adaptive system designed specifically for nighttime UAV tracking. MATrack tackles the main technical challenges of nighttime tracking through the collaborative work of three core modules: Multiscale Hierarchy Blende (MHB) enhances feature consistency between static and dynamic templates. Adaptive Key Token Gate accurately identifies object information within complex backgrounds. Nighttime Template Calibrator (NTC) ensures stable tracking performance over long sequences. Extensive experiments show that MATrack achieves a significant performance improvement. On the UAVDark135 benchmark, its precision, normalized precision and AUC surpass state-of-the-art (SOTA) methods by 5.9%, 5.4% and 4.2% respectively, while maintaining a real-time processing speed of 81 FPS. Further tests on a real-world UAV platform validate the system's reliability, demonstrating that MATrack can provide stable and effective nighttime UAV tracking support for critical robotics applications such as nighttime search and rescue and border patrol.
|
| |
| 15:00-16:30, Paper ThI2I.38 | Add to My Program |
| Rendering Multi-Human and Multi-Object with 3D Gaussian Splatting |
|
| Wang, Weiquan | Zhejiang University |
| Xiao, Jun | Zhejiang University |
| Shao, Feifei | Zhejiang University |
| Yang, Yi | ReLER, CCAI, Zhejiang University |
| Zhuang, Yueting | Zhejiang University |
| Chen, Long | The Hong Kong University of Science and Technology |
Keywords: Semantic Scene Understanding, Deep Learning for Visual Perception, Human-Centered Automation
Abstract: Reconstructing dynamic scenes with multiple interacting humans and objects from sparse-view inputs is a critical yet challenging task, essential for creating high-fidelity digital twins for robotics and VR/AR. This problem, which we term Multi-Human Multi-Object (MHMO) rendering, presents two significant obstacles: achieving view-consistent representations for individual instances under severe mutual occlusion, and explicitly modeling the complex and combinatorial dependencies that arise from their interactions. To overcome these challenges, we propose MM-GS, a novel hierarchical framework built upon 3D Gaussian Splatting. Our method first employs a Per-Instance Multi-View Fusion module to establish a robust and consistent representation for each instance by aggregating visual information across all available views. Subsequently, a Scene-Level Instance Interaction module operates on a global scene graph to reason about relationships between all participants, refining their attributes to capture subtle interaction effects. Extensive experiments on challenging datasets demonstrate that our method significantly outperforms strong baselines, producing state-of-the-art results with high-fidelity details and plausible inter-instance contacts.
|
| |
| 15:00-16:30, Paper ThI2I.39 | Add to My Program |
| Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction |
|
| Jiang, Shuo | Tongji University |
| Li, Haonan | Tongji University |
| Ren, Ruochen | Tongji University |
| Zhou, Yanmin | Tongji University |
| Wang, Zhipeng | Tongji University |
| He, Bin | Tongji University |
Keywords: Datasets for Human Motion, Perception-Action Coupling, Human-Robot Collaboration
Abstract: Cutting-edge robot learning techniques including foundation models and imitation learning from humans all pose huge demands on large-scale and high-quality datasets which constitute one of the bottleneck in the general intelligent robot fields. This paper presents the Kaiwu multimodal dataset to address the missing real-world synchronized multimodal data problems in the sophisticated assembling scenario, especially with dynamics information and its fine-grained labelling. The dataset first provides an integration of human, environment and robot data collection framework with 20 subjects and 30 interaction objects resulting in totally 11,664 instances of integrated actions. For each of the demonstration, hand motions, operation pressures, sounds of the assembling process, multi-view videos, high-precision motion capture information, eye gaze with first-person videos, electromyography signals are all recorded. Fine-grained multi-level annotation based on absolute timestamp, and semantic segmentation labelling are performed. Kaiwu dataset aims to facilitate robot learning, dexterous manipulation, human intention investigation and human-robot collaboration research.
|
| |
| 15:00-16:30, Paper ThI2I.40 | Add to My Program |
| Pc-dbCBS: Kinodynamic Motion Planning of Physically-Coupled Robot Teams |
|
| Wahba, Khaled | Technical University of Berlin |
| Hoenig, Wolfgang | TU Berlin |
Keywords: Constrained Motion Planning, Multi-Robot Systems, Motion and Path Planning
Abstract: Motion planning problems for physically-coupled multi-robot systems in cluttered environments are challenging due to their high dimensionality. Existing methods combining sampling-based planners with trajectory optimization produce suboptimal results and lack theoretical guarantees. We propose Physically-coupled discontinuity-bounded Conflict-Based Search (pc-dbCBS), an anytime kinodynamic motion planner that extends discontinuity-bounded CBS to rigidly-coupled systems. Our approach proposes a tri-level conflict detection and resolution framework that includes the physical coupling between the robots. Moreover, pc-dbCBS alternates iteratively between state space representations, thereby preserving probabilistic completeness and asymptotic optimality while relying only on single-robot motion primitives. Across 25 simulated and six real-world problems involving multirotors carrying a cable-suspended payload and differentialdrive robots linked by rigid rods, pc-dbCBS solves up to 90% more instances than a state-of-the-art baseline, planning trajectories up to 60% faster with significantly reduced planning time.
|
| |
| 15:00-16:30, Paper ThI2I.41 | Add to My Program |
| PS-ECBS: A New Algorithm for Multi-Agent Path Finding with Optimal Task Assignment |
|
| Zhang, Cheng | Northeastern University |
| Liu, Shixin | Northeastern University |
Keywords: Field Robots, Motion and Path Planning, Logistics
Abstract: Multi-agent path finding (MAPF) problem in warehouse automation consists of optimal task assignment and path planning, where small runtime is necessary. In this paper, we present a new MAPF algorithm related to dynamic start and end positions of the robots, called Position-Selection Enhanced Conflict-Based Search (PS-ECBS). Conflict-Based Search (CBS) is a well-known framework that has been used to find collision-free paths for a given fixed task assignment, while ECBS is a bounded-suboptimal variant of CBS that uses focal search to speed up CBS. The mixed integer linear programming (MILP) is introduced to formulate the dynamic model for optimal task assignment, and the successful combination of MILP and ECBS results in PS-ECBS algorithm. The solving process of the PS-ECBS consists of multiple iterations, and in each iteration an additional constraint is added to modify the model. In the computational experiment, the processes of picking up and putting back shelves in the warehouse could occur at the same time by PS-ECBS. We also analyzed the iterative principle of PS-ECBS, and compared its performance with that of ECBS-TA. The computational results demonstrate that PS-ECBS runs significantly faster and has an obvious advantage in jointly optimizing task assignment and path planning for large-scale warehouse.
|
| |
| 15:00-16:30, Paper ThI2I.42 | Add to My Program |
| Self-Supervised Consistency Enhanced Disentangled Learning for Neural Decoding Generalization in Brain-Machine Interfaces |
|
| Wei, Jiyu | Zhejiang University |
| Hong, Di | Zhejiang University |
| Zhang, Zhanjie | Zhejiang University |
| Rong, Dazhong | Zhejiang University |
| He, Qinming | Zhejiang University |
| Wang, Yueming | Zhejiang University |
Keywords: Brain-Machine Interfaces, Representation Learning, Long term Interaction
Abstract: Brain–Machine Interfaces (BMIs) provide a direct communication pathway between the brain and external devices, enabling humans to control assistive and robotic technologies, with potential applications in rehabilitation, human motor augmentation, and human-centered robotics. However, due to neural drift, the performance of BMIs decreases over time, posing challenges for long-term viability, particularly for invasive BMIs (iBMIs). Existing solutions suffer from two main drawbacks: (i) difficulty in learning robust neural representations, and (ii) neglecting that neural drift varies across motor parameters (textit{e.g.,} velocity, direction, and speed). To overcome these limitations, we propose Self-Supervised Consistency enhanced Disentangled Learning (SSCDL), a neural decoding generalization framework built on two key innovations. We first design a backbone model named Consistency enhanced Neural Decoder (CND), using a novel teacher-student consistency constraint with simulated neural signal perturbations to learn robust representations invariant to neural drift. Then, we employ three dedicated CNDs under Complementary-Disentangled Generalization (CDG) mechanism, which disentangles motor signals into velocity, direction and speed with the inspiration of neural preference theory. This disentangled learning enables SSCDL to capture invariant neural representations from diverse neural preference perspectives, significantly enhancing cross-day generalization. Extensive experimental results show that SSCDL delivers state-of-the-art decoding performance, exhibiting high robustness and cross-day stability. These capabilities underscore its strong potential for long-term interaction in human-centric robotic and fine-grained assistive applications.
|
| |
| 15:00-16:30, Paper ThI2I.43 | Add to My Program |
| Depth Completion in Unseen Field Robotics Environments Using Extremely Sparse Depth Measurements |
|
| Job, Marco | NTNU |
| Stastny, Thomas | ETH Zurich |
| Kelasidi, Eleni | NTNU |
| Siegwart, Roland | ETH Zurich |
| Pantic, Michael | ETH Zurich |
Keywords: Field Robots, Data Sets for Robotic Vision, Deep Learning for Visual Perception
Abstract: Autonomous field robots operating in unstructured environments require robust perception to ensure safe and reliable operations. Recent advances in monocular depth estimation have demonstrated the potential of low-cost cameras as depth sensors; however, their adoption in field robotics remains limited due to the absence of reliable scale cues, ambiguous or low-texture conditions, and the scarcity of large-scale datasets. To address these challenges, we propose a depth completion model that trains on synthetic data and uses extremely sparse measurements from depth sensors to predict dense metric depth in unseen field robotics environments. A synthetic dataset generation pipeline tailored to field robotics enables the creation of multiple realistic datasets for training purposes. This dataset generation approach utilizes textured 3D meshes from Structure from Motion and photorealistic rendering with novel viewpoint synthesis to simulate diverse field robotics scenarios. Our approach achieves an end-to-end latency of 53 ms per frame on an Nvidia Jetson AGX Orin, enabling real-time deployment on embedded platforms. Extensive evaluation demonstrates competitive performance across diverse real-world field robotics scenarios.
|
| |
| 15:00-16:30, Paper ThI2I.44 | Add to My Program |
| Autonomous Balloon Based Adaptive Sliding Mode Control and Infinite-Horizon POMDP |
|
| Nguyen, Van Chung | University of Nevada, Reno |
| Nguyen, An | University of Nevada, Reno |
| Le, Chuong | University of Nevada, Reno |
| Srikar, Gaurav | University of Nevada, Reno |
| Do, Thanh Nho | University of New South Wales |
| La, Hung | University of Nevada at Reno |
Keywords: Aerial Systems: Applications, Planning under Uncertainty, Robust/Adaptive Control
Abstract: This paper presents a novel infinite-horizon Partially Observable Markov Decision Process (POMDP) framework with adaptive sliding mode control (ASMC) for autonomous navigation of the balloons. The proposed method integrates an altitude controller designed to account for thermodynamic and real-wind field constraints with an infinite-horizon POMDP for wind-optimal navigation. First, an adaptive sliding mode control is developed to ensure the balloon’s internal stability under uncertainties in pressure, external wind fields, and temperature. Subsequently, a reference strategy is formulated using the infinite-horizon POMDP to exploit wind dynamics for station-keeping. The system estimates wind direction in real time and computes actions based on these observations. Experimental results demonstrate the framework’s ability to converge on efficient navigation policies while compensating for partial observability of wind dynamics. This approach is particularly suited for aerial or underwater vehicles operating in stratified flow environments, offering a computationally tractable solution for real-world deployment.
|
| |
| 15:00-16:30, Paper ThI2I.45 | Add to My Program |
| TIGeR: Text-Instructed Generation and Refinement for Template-Free Hand-Object Interaction |
|
| Huang, Yiyao | National University of Singapore |
| Zheng, Zhedong | University of Macau |
| Yu, Ziwei | National University of Singapore |
| Wang, Yaxiong | Hefei University of Technology |
| Tse, Tze Ho Elden | National University Singpore |
| Yao, Angela | National University of Singapore |
Keywords: Contact Modeling, Deep Learning in Grasping and Manipulation, Grasping
Abstract: Pre-defined 3D object templates are widely used in 3D reconstruction of hand-object interactions. However, they often require substantial manual efforts to capture or source, and inherently restrict the adaptability of models to unconstrained interaction scenarios, e.g., heavily-occluded objects. To overcome this bottleneck, we propose a new Text-Instructed Generation and Refinement (TIGeR) framework, harnessing the power of intuitive text-driven priors to steer the object shape refinement and pose estimation. We use a two-stage framework: a text-instructed prior generation and vision-guided refinement. As the name implies, we first leverage off-the-shelf models to generate shape priors according to the text description without tedious 3D crafting. Considering the geometric gap between the synthesized prototype and the real object interacted with the hand, we further calibrate the synthesized prototype via 2D-3D collaborative attention. TIGeR achieves competitive performance, i.e., 1.979 and 5.468 object Chamfer distance on the widely-used Dex-YCB and Obman datasets, respectively, surpassing existing template-free methods. Notably, the proposed framework shows robustness to occlusion, while maintaining compatibility with heterogeneous prior sources, e.g., retrieved hand-crafted prototypes, in practical deployment scenarios. Our code will be available at https://github.com/huangyiyNUS/TIGeR.
|
| |
| 15:00-16:30, Paper ThI2I.46 | Add to My Program |
| SignBot: Learning Human-To-Humanoid Sign Language Interaction |
|
| Qiao, Guanren | The Chinese University of Hong Kong, Shenzhen |
| Lin, Sixu | Harbin Institute of Technology, Shenzhen |
| Zuo, Ronglai | Imperial College London |
| Wu, Zhizheng | The Chinese University of Hong Kong, Shenzhen |
| Jia, Kui | The Chinese University of Hong Kong, Shenzhen |
| Liu, Guiliang | Chinese University of Hong Kong, Shenzhen |
Keywords: Gesture, Posture and Facial Expressions, Humanoid Robot Systems, Reinforcement Learning
Abstract: Sign language is a natural and visual form of language that uses movements and expressions to convey meaning, serving as a crucial means of communication for individuals who are deaf or hard-of-hearing (DHH). However, the number of people proficient in sign language remains limited, highlighting the need for technological advancements to bridge communication gaps and foster interactions with minorities. Based on recent advancements in embodied humanoid robots, we propose SignBot, a novel framework for human-robot sign language interaction. SignBot integrates a cerebellum-inspired motion control component and a cerebral-oriented module for comprehension and interaction. Specifically, SignBot consists of: 1) Motion Retargeting, which converts human sign language datasets into robot-compatible kinematics; 2) Motion Control, which leverages a learning-based paradigm to develop a robust humanoid control policy for tracking sign language gestures; and 3) Generative Interaction, which incorporates translator, responser, and generator of sign language, thereby enabling natural and effective communication between robots and humans. Simulation and real-world experimental results demonstrate that SignBot can effectively facilitate human-robot interaction and perform sign language motions with diverse robots and datasets. SignBot represents a significant advancement in automatic sign language interaction on embodied humanoid robot platforms, providing a promising solution to improve communication accessibility for the DHH community.
|
| |
| 15:00-16:30, Paper ThI2I.47 | Add to My Program |
| DMRP-Bench: An Integrated, Unified Multi-Robot Motion Planning Benchmark in Dynamic Environments |
|
| Hu, Zhijie | NanKai University |
| Zhang, Xuebo | Nankai University |
| Wang, Runhua | NanKai University |
| Li, Yichen | Nankai University |
| Bi, Qingchen | NanKai University |
|
|
| |
| 15:00-16:30, Paper ThI2I.48 | Add to My Program |
| Deep Reinforcement Learning Based Autonomous Drift System for Abrupt Obstacle Avoidance |
|
| Liu, Yang | The Hong Kong University of Science and Technology |
| Mei, Xiaodong | HKUST |
| Wu, Jin | University of Science and Technology Beijing |
| Xue, Bohuan | South China Normal University |
| |
| 15:00-16:30, Paper ThI2I.49 | Add to My Program |
| Extended Force and Velocity Prediction in Human-Robot Collaborative Transportation through Future Environment Representation Estimation |
|
| Dominguez-Vidal, Jose Enrique | Polytechnic University of Catalonia |
Keywords: Physical Human-Robot Interaction, Intention Recognition, Deep Learning Methods
Abstract: In this work, we address the challenge of predicting human-applied force and velocity during collaborative object transportation over extended distances (5–8 m). We enhance state-of-the-art predictors by refining their input data processing, which significantly improves prediction accuracy. Furthermore, we extend the temporal prediction horizon from 1 s to 2 s without compromising performance, by introducing an extra environmental prediction module that conditions force and velocity estimations based on anticipated sensory input. This integration captures the contextual dependency of human behaviour during joint transport. Experimental evaluations, both on dataset and in real-world settings, validate the effectiveness of our approach. Specifically, our best model manages to achieve success rates in testset of up to 90.4% in predicting the human’s exerted force and up to 93.0% in the velocity of the human-robot pair during the next 2 s, and up to 87.1% and 91.3% respectively in real experiments.
|
| |
| 15:00-16:30, Paper ThI2I.50 | Add to My Program |
| Benchmarking Multi-View BEV Object Detection with Mixed Pinhole and Fisheye Cameras |
|
| Liu, Xiangzhong | Technical University of Munich |
| Shen, Hao | Technische Universität München |
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, Omnidirectional Vision
Abstract: Modern autonomous driving systems increasingly rely on mixed camera configurations with pinhole and fisheye cameras for full view perception. However, Bird's-Eye View(BEV) 3D object detection models are predominantly designed for pinhole cameras, leading to performance degradation under fisheye distortion. To bridge this gap, we introduce a multi‑view BEV detection benchmark with mixed cameras by converting KITTI‑360 into nuScenes format. Our study encompasses three adaptations: rectification for zero-shot evaluation and fine-tuning of nuScenes-trained models, distortion-aware view transformation modules(VTMs) via the MEI camera model, and polar coordinate representations to better align with radial distortion. We systematically evaluate three representative BEV architectures, BEVFormer, BEVDet and PETR, across these strategies. We demonstrate that projection-free architectures are inherently more robust and effective against fisheye distortion than other VTMs. This work establishes the first real-data 3D detection benchmark with fisheye and pinhole images and provides systematic adaptation and practical guidelines for designing robust and cost-effective 3D perception systems. The code is available at https://github.com/CesarLiu/FishBEVOD.git.
|
| |
| 15:00-16:30, Paper ThI2I.51 | Add to My Program |
| Informative Object-Centric Next Best View for Object-Aware 3D Gaussian Splatting in Cluttered Scenes |
|
| Jeong, Seung Hoon | Seoul National University |
| Lee, Eunho | Seoul National University |
| Kim, Jeongyun | SNU |
| Kim, Ayoung | Seoul National University |
Keywords: Perception for Grasping and Manipulation, Manipulation Planning, Deep Learning in Grasping and Manipulation
Abstract: In cluttered scenes with inevitable occlusions and incomplete observations, selecting informative viewpoints is essential for building a reliable representation. In this context, 3D Gaussian Splatting (3DGS) offers a distinct advantage, as it can explicitly guide the selection of subsequent viewpoints and then refine the representation with new observations. However, existing approaches rely solely on geometric cues, neglect manipulation-relevant semantics, and tend to prioritize exploitation over exploration. To tackle these limitations, we introduce a instance-aware Next Best View (NBV) policy that prioritizes underexplored regions by leveraging object features. Specifically, our object-aware 3DGS distills instance-level information into one-hot object vectors, which are used to compute confidence-weighted information gain that guides the identification of regions associated with erroneous and uncertain Gaussians. Furthermore, our method can be easily adapted to an object-centric NBV, which focuses view selection on a target object, thereby improving reconstruction robustness to object placement. Experiments demonstrate that our NBV policy reduces depth error by up to 77.14% on the synthetic dataset and 34.10% on the real-world GraspNet dataset compared to baselines. Moreover, compared to targeting the entire scene, performing NBV on a specific object yields an additional reduction of 25.60% in depth error for that object. We further validate the effectiveness of our approach through real-world robotic manipulation tasks.
|
| |
| 15:00-16:30, Paper ThI2I.52 | Add to My Program |
| LH-DETR: A Lightweight Hybrid Architecture for End-To-End Object Detection in UAV Images |
|
| Xu, Feifei | Shanghai University of Electric Power |
| Sun, Lupeng | Shanghai University of Electric Power |
| Li, Dongyang | Shanghai University of Electric Power |
| Wu, Guoxiang | Shanghai University of Electric Power |
| Lv, Chenchuan | Shanghai University of Electric Power |
Keywords: Object Detection, Segmentation and Categorization, Computer Vision for Manufacturing, Deep Learning for Visual Perception
Abstract: Object detection in unmanned aerial vehicle (UAV) has become a research highlight at the intersection of computer vision and robotics technology, and its applications in security inspection, agricultural monitoring, disaster relief and others are becoming increasingly widespread. The key to achieving autonomous perception and decision-making of UAV lies in precise and real-time object detection. However, objects from the perspective of UAV often have characteristics such as small scale and dense distribution, coupled with limited onboard computing resources, which poses significant challenges to traditional detection algorithms. To address the trade-offs, this paper proposes LH-DETR, a lightweight hybrid architecture for end-to-end object detection, referring to three specialized innovations. We first put in the Wavelet-Mamba Hybrid Block (WMHB), a novel backbone component that synergistically combines the linear-complexity of Mamba state-space model for capturing long-range dependencies with the multi-scale feature extraction capabilities of wavelet transforms. To better identify small objects, a Frequency-Aware Dynamic FFN (FAD-FFN) is designed to selectively amplify critical high-frequency components—like edges and textures—by analyzing features in the frequency domain. Additionally, AutoSliding Varifocal Loss (ASVLoss) is defined to stabilize the model's optimization, which is an adaptive loss function that dynamically shifts its focus from medium-quality to high-quality predictions as training progresses. Experiments on public aerial datasets demonstrate that LH-DETR achieves an outstanding balance between accuracy and speed, significantly improving detection performance for small objects while greatly reducing the computational complexity.
|
| |
| 15:00-16:30, Paper ThI2I.53 | Add to My Program |
| PCASim: Promptable Closed-Loop Adversarial Simulation for Urban Traffic Environment |
|
| Zhang, Chuancheng | Shandong University |
| Wang, Zhenhao | Shandong University |
| Li, Kaizheng | Shandong University |
| Lin, Yaran | Shandong University |
| Guo, Qiang | Shandong University of Finance and Economics |
| Jiang, Bin | Shandong University |
Keywords: Motion and Path Planning, Intelligent Transportation Systems, Autonomous Agents
Abstract: Real-world autonomous driving, particularly in urban environments with numerous corner cases, requires rigorous testing to ensure product safety and robustness. However, few studies have explored integrating adversarial scenario generation with the training of safety agents in closed-loop testing, enabling efficient co-evolution and mutual enhancement of both. To address this challenge, an adversarial behavior knowledge repository is constructed by applying rule-based filtering to an open-source dataset, combined with knowledge retrieval modules tailored for simulation environments. A large language model (LLM) is employed to integrate knowledge-, data-, and adversarial-driven approaches, generating safety-critical traffic scenarios customized to user needs. Additionally, while evaluating the generated scenarios, we employ reinforcement learning models to train the behaviors of different types of vehicles, thereby enriching scenario diversity beyond existing datasets while preserving realism. Experimental results demonstrate that the proposed framework improves the accuracy of domain-specific language generation by 12%. Moreover, the success rate of newly generated scenario transformations increases by 8%, while obstacle-avoidance capability is enhanced by 30%. For the complete manuscript, please refer to: https://zhenhaooo.github.io/PCASim.github.io/
|
| |
| 15:00-16:30, Paper ThI2I.54 | Add to My Program |
| Distracted Robot: How Visual Clutter Undermine Robotic Manipulation |
|
| Rasouli, Amir | Huawei Technologies Canada |
| Alban, Montgomery Tucker | Huawei Technologies Canada |
| Pakdamansavoji, Sajjad | Huawei |
| Li, Zhiyuan | University of Toronto |
| Zhang, Zhanguang | Huawei Noah's Ark Lab |
| Wu, Yangzheng | Huawei |
| Zhao, Xuan | Huawei |
Keywords: Performance Evaluation and Benchmarking, Manipulation Planning, AI-Based Methods
Abstract: In this work, we propose an evaluation protocol for examining the performance of robotic manipulation policies in cluttered scenes. Contrary to prior works, we approach evaluation from a psychophysical perspective, therefore we use a unified measure of clutter that accounts for environmental factors as well as the distractors quantity, characteristics, and arrangement. Using this measure, we systematically construct evaluation scenarios in both hyper-realistic simulation and real-world and conduct extensive experimentation on manipulation policies, in particular vision-language-action (VLA) models. Our experiments highlight the significant impact of scene clutter, lowering the performance of the policies, by as much as 34% and show that despite achieving similar average performance across the tasks, different VLA policies have unique vulnerabilities and a relatively low agreement on success scenarios. We further show that our clutter measure is an effective indicator of performance degradation and analyze the impact of distractors in terms of their quantity and occluding influence. At the end, we show that finetuning on enhanced data, although effective, does not equally remedy all negative impacts of clutter on performance.
|
| |
| 15:00-16:30, Paper ThI2I.55 | Add to My Program |
| Prompt-To-State Stable Vision-Language MPC for Approximated Neural Network Dynamics a Case Study on Soft Robot Control |
|
| Emanuele, Nicotra | UNSW Sydney |
| Davies, James J. | University of New South Wales |
| Zhu, Kefan | UNSW Sydney |
| Bibhu, Sharma | UNSW Sydney |
| Ji, Adrienne | University of New South Wales |
| Phan, Phuoc Thien | University of New South Wales |
| La, Hung Manh | University of Nevada |
| Lovell, Nigel Hamilton | University of New South Wales |
| Do, Thanh Nho | University of New South Wales |
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Robot Applications, Machine Learning for Robot Control
Abstract: The integration of large-scale foundation models in control loops has shown strong potential for executing complex tasks directly from natural language inputs. However, achieving stability and real-time performance remains a sig- nifi cant challenge, particularly for systems with hard-to-model dynamics. In this paper, we introduce Prompt-to-State Stability (PSS) and propose the Prompt-to-State Stable Vision-Language Model Predictive Control (PSS-VLMPC) framework, which couples a vision-language model (VLM) with a robust model predictive control (MPC) scheme. The VLM interprets user commands and visual feedback, converting them into control- relevant parameters for the MPC. System dynamics are fully learned by a neural network and then approximated to enable real-time MPC performance. Building on prediction error bounds, we provide rigorous closed-loop stability guarantee and validate the effectiveness of PSS-VLMPC through both simulations and real-world experiments on a soft continuum robot, demonstrating its ability to robustly execute tasks from natural language instructions.
|
| |
| 15:00-16:30, Paper ThI2I.56 | Add to My Program |
| Real-Is-Sim: Bridging the Sim-To-Real Gap with a Dynamic Digital Twin |
|
| Abou-Chakra, Jad | Robotics and AI Institute |
| Sun, Lingfeng | Robotics and AI Institute |
| Rana, Krishan | Queensland University of Technology |
| May, Brandon | Robotics and AI Institute |
| Schmeckpeper, Karl | Robotics and AI Institute |
| Sünderhauf, Niko | Queensland University of Technology |
| Minniti, Maria Vittoria | Robotics and AI Institute |
| Herlant, Laura | Robotics and AI Institute |
Keywords: Software Architecture for Robotic and Automation, Imitation Learning, Hardware-Software Integration in Robotics
Abstract: We introduce real-is-sim, a new approach to integrating simulation into behavior cloning pipelines. In contrast to real-only methods, which lack the ability to safely test policies before deployment, and sim-to-real methods, which require complex adaptation to cross the sim-to-real gap, our framework allows policies to seamlessly switch between running on real hardware and running in parallelized virtual environments. At the center of real-is-sim is a dynamic digital twin, powered by the Embodied Gaussian simulator, that synchronizes with the real world at 60Hz. This twin acts as a mediator between the behavior cloning policy and the real robot. Policies are trained using representations derived from simulator states and always act on the simulated robot, never the real one. During deployment, the real robot simply follows the simulated robot’s joint states, and the simulation is continuously corrected with real world measurements. This setup, where the simulator drives all policy execution and maintains real-time synchronization with the physical world, shifts the responsibility of crossing the sim-to-real gap to the digital twin's synchronization mechanisms, instead of the policy itself. We demonstrate real-is-sim on a long-horizon manipulation task (PushT), showing that virtual evaluations are consistent with real-world results. We further show how real-world data can be augmented with virtual rollouts and compare to policies trained on different representations derived from the simulator state including object poses and rendered images from both static and robot-mounted cameras. Our results highlight the flexibility of the real-is-sim framework across training, evaluation, and deployment stages.
|
| |
| 15:00-16:30, Paper ThI2I.57 | Add to My Program |
| V-MORALS: Visual Morse Graph-Aided Estimation of Regions of Attraction in a Learned Latent Space |
|
| Aladin, Faiz | University of Southern California |
| Balasubramanian, Ashwin | University of Southern California |
| Lindemann, Lars | University of Southern California |
| Seita, Daniel | University of Southern California |
Keywords: Robot Safety, AI-Based Methods, Visual Learning
Abstract: Reachability analysis has become increasingly important in robotics to distinguish safe from unsafe states. Unfortunately, existing reachability and safety analysis methods often fall short, as they typically require known system dynamics or large datasets to estimate accurate system models, are computationally expensive, and assume full state information. A recent method, called MORALS, aims to address these shortcomings by using topological tools to estimate Regions of Attraction (ROA) in a low-dimensional latent space. However, MORALS still relies on full state knowledge and has not been studied when only sensor measurements are available. This paper presents Visual Morse Graph-Aided Estimation of Regions of Attraction in a Learned Latent Space (V-MORALS). V-MORALS takes in a dataset of image-based trajectories of a system under a given controller, and learns a latent space for reachability analysis. Using this learned latent space, our method is able to generate well-defined Morse Graphs, from which we can compute ROAs for various systems and controllers. V-MORALS provides capabilities similar to the original MORALS architecture without relying on state knowledge, and using only high-level sensor data. Our project website is at: https://v-morals.onrender.com.
|
| |
| 15:00-16:30, Paper ThI2I.58 | Add to My Program |
| DriveCritic: Towards Context-Aware, Human-Aligned Evaluation for Autonomous Driving with Vision-Language Models |
|
| Song, Jingyu | University of Michigan |
| Li, Zhenxin | Fudan University, NVIDIA |
| Lan, Shiyi | NVIDIA |
| Sun, Xinglong | Stanford & UIUC |
| Chang, Nadine | Nvidia |
| Shen, Maying | Nvidia |
| Chen, Jingde | NVIDIA Corporation |
| Skinner, Katherine | University of Michigan |
| Alvarez, Jose | NVIDIA |
Keywords: Autonomous Vehicle Navigation, AI-Based Methods, Intelligent Transportation Systems
Abstract: Benchmarking autonomous driving planners to align with human judgment remains a critical challenge, as state-of-the-art metrics like the Extended Predictive Driver Model Score (EPDMS) lack context awareness in nuanced scenarios. To address this, we introduce DriveCritic, a novel framework featuring two key contributions: the DriveCritic dataset, a curated collection of challenging scenarios where context is critical for correct judgment and annotated with pairwise human preferences, and the DriveCritic model, a Vision-Language Model (VLM) based evaluator. Fine-tuned using a two-stage supervised and reinforcement learning pipeline, the DriveCritic model learns to adjudicate between trajectory pairs by integrating visual and symbolic context. Experiments show DriveCritic significantly outperforms existing metrics and baselines in matching human preferences and demonstrates strong context awareness. Overall, our work provides a more reliable, human-aligned foundation to evaluating autonomous driving systems. The project page for DriveCritic is https://song-jingyu.github.io/DriveCritic.
|
| |
| 15:00-16:30, Paper ThI2I.59 | Add to My Program |
| D-CAT: Decoupled Cross-Attention Knowledge Transfer between Sensor Modalities for Unimodal Inference |
|
| Daher, Leen | Ecole Polytechnique Fédérale De Lausanne |
| Wang, Zhaobo | École Polytechnique Fédérale De Lausanne |
| Mielle, Malcolm | Schindler |
Keywords: Multi-Modal Perception for HRI, Gesture, Posture and Facial Expressions, Sensor Fusion
Abstract: Cross-modal transfer learning is used to improve multi-modal classification models (e.g., for human activity recognition in human-robot collaboration). However, existing methods require paired sensor data at both training and inference, limiting deployment in resource-constrained environments where full sensor suites are not economically and technically usable. To address this, we propose Decoupled Cross-Attention Knowledge Transfer (D-CAT), a framework that aligns modality-specific representations without requiring joint sensor modality during inference. Our approach combines a self-attention module for feature extraction with a novel cross-attention alignment loss, which enforces the alignment of sensors' feature spaces without requiring the coupling of the classification pipelines of both modalities. We evaluate D-CAT on three multi-modal human activity datasets (IMU, video, and audio) under both in-distribution and out-of-distribution scenarios, comparing against uni-modal models. Results show that in in-distribution scenarios, transferring from high-performing modalities (e.g., video to IMU) yields up to +10% F1-score gains over uni-modal training. In out-of-distribution scenarios, even weaker source modalities (e.g., IMU to video) improve target performance, as long as the target model isn't overfitted on the training data. By enabling single-sensor inference with cross-modal knowledge, D-CAT reduces hardware redundancy for perception systems while maintaining accuracy, which is critical for cost-sensitive or adaptive deployments (e.g., assistive robots in homes with variable sensor availability). Code is available at [url will be added upon acceptance].
|
| |
| 15:00-16:30, Paper ThI2I.60 | Add to My Program |
| Fast ECoT: Efficient Embodied Chain-Of-Thought Via Thoughts Reuse |
|
| Duan, Zhekai | University College London |
| Zhang, Yuan | University of Freiburg |
| Geng, Shikai | University College London |
| Liu, Gaowen | Cisco Systems |
| Boedecker, Joschka | University of Freiburg |
| Lu, Chris Xiaoxuan | University College London |
Keywords: Imitation Learning, Engineering for Robotic Systems
Abstract: Embodied Chain-of-Thought (ECoT) reasoning enhances vision-language-action (VLA) models by improving performance and interpretability through intermediate reasoning steps. However, its sequential autoregressive token generation introduces significant inference latency, limiting real-time deployment. We propose Fast ECoT, an inference-time acceleration method that exploits the structured and repetitive nature of ECoT to (1) cache and reuse high-level reasoning across timesteps and (2) parallelise the generation of modular reasoning steps. Additionally, we introduce an asynchronous scheduler that decouples reasoning from action decoding, further boosting responsiveness. Fast ECoT requires no model changes or additional training and easily integrates into existing VLA pipelines. Experiments in both simulation (LIBERO) and real-world robot tasks show up to a 7.5× reduction in latency with comparable or improved task success rate and reasoning faithfulness, bringing ECoT policies closer to practical real-time deployment.
|
| |
| 15:00-16:30, Paper ThI2I.61 | Add to My Program |
| Design and Validation of a Novel Quadruple-Disk CYcloidal Compact-Cam Reducer for Robotic Applications: Q-CYC |
|
| Bezzini, Riccardo | Scuola Superiore Sant'Anna |
| Bassani, Giulia | Scuola Superiore Sant'Anna |
| Avizzano, Carlo Alberto | Scuola Superiore Sant'Anna |
| Filippeschi, Alessandro | Scuola Superiore Sant'Anna |
Keywords: Mechanism Design, Actuation and Joint Mechanisms
Abstract: Reduction gearboxes play a key role in robotics actuation. Among the existing designs, cycloidal gears are gaining popularity for their efficiency, torque density, and robustness. The compact-cam architecture is a variant of the classical cycloidal drive that employs two rigidly coupled cycloidal disks to achieve high reduction ratios within a minimized radial profile. However, this design tends to suffer from low regularity, which can degrade performance in robotics applications. Building on the concept of the double disks with a phase offset to increase regularity, in this work, we present a novel Quadruple-disk CYcloidal Compact-cam (Q- CYC) reducer that applies the phase-offset principle to the compact-cam architecture. By incorporating two additional coupled disks, the proposed design enhances load distribution and motion regularity. Two open-source, 3D-printed prototypes (one implementing the conventional compact-cam transmission and one featuring the presented quadruple-disk architecture) are designed and experimentally evaluated. The analysis focuses on friction, gear play, backdriveability, and speed regularity, demonstrating that the quadruple-disk design offers significant improvements. Therefore, the results validate the effectiveness of the proposed approach in addressing known performance limitations of cycloidal compact-cam reducers, reducing gear play and improving both speed regularity and backdrivability.
|
| |
| 15:00-16:30, Paper ThI2I.62 | Add to My Program |
| Advancing Off-Road Autonomous Driving: The Large-Scale ORAD-3D Dataset and Comprehensive Benchmarks |
|
| Min, Chen | Chinese Academy of Sciences |
| Mei, Jilin | Institute of Computing Technology, Chinese Academy of Sciences |
| Zhai, Heng | Shanghaitech University |
| Wang, Shuai | Institute of Computing Technology, University of Chinese Academy of Sciences |
| Sun, Tong | University of Chinese Academy of Sciences |
| Kong, Fanjie | Xi'an Jiaotong University |
| Li, Haoyang | Nanchang University |
| Mao, Fangyuan | Institute of Computing Technology, Chinese Academy of Sciences |
| Liu, Fuyang | Institute of Computing Technology, Chinese Academy of Sciences |
| Wang, Shuo | Institute of Computing Technology, Chinese Academy of Sciences |
| Chen, Liang | Institute of Computing Technology: Beijing, CN |
| Zhao, Fangzhou | Institute of Computing Technology, Chinese Academy of Sciences |
| Xiao, Zhipeng | Defense Innovation Institute |
| Xue, Hanzhang | National University of Defense Technology |
| Fu, Hao | National University of Defense Technology |
| Nie, Yiming | National Innovation Institute of Defense Technology |
| Zhu, Qi | NIIDT |
| Xiao, Liang | Defense Innovation Institute |
| Zhao, Dawei | DII |
| Hu, Yu | Institute of Computing Technology Chinese Academy of Sciences |
Keywords: Data Sets for Robot Learning, Vision-Based Navigation, Field Robots
Abstract: A major bottleneck in off-road autonomous driving research lies in the scarcity of large-scale, high-quality datasets and benchmarks. To bridge this gap, we present ORAD-3D, which, to the best of our knowledge, is the largest dataset specifically curated for off-road autonomous driving. ORAD-3D covers a wide spectrum of terrains—including woodlands, farmlands, grasslands, riversides, gravel roads, cement roads, and rural areas—while capturing diverse environmental variations across weather conditions (sunny, rainy, foggy, and snowy) and illumination levels (bright daylight, daytime, twilight, and nighttime). Building upon this dataset, we establish a comprehensive suite of benchmark evaluations spanning five fundamental tasks: 2D free-space detection, 3D occupancy prediction, rough GPS-guided path planning, vision–language model–driven autonomous driving, and world model for off-road environments. Together, the dataset and benchmarks provide a unified and robust resource for advancing perception and planning in challenging off-road scenarios. The dataset is publicly available at https://github.com/chaytonmin/ORAD-3D-Dataset-For-Off-Road-AD.
|
| |
| 15:00-16:30, Paper ThI2I.63 | Add to My Program |
| CAPE: Context-Aware Diffusion Policy Via Proximal Mode Expansion for Collision Avoidance |
|
| Yang, Rui Heng | Huawei Technologies Canada |
| Zhao, Xuan | Huawei |
| Brunswic, Léo Maxime | Huawei Technologies Canada |
| Alban, Montgomery Tucker | Huawei Technologies Canada |
| Clémente, Matéo | Huawei |
| Cao, Tongtong | Noah's Ark Lab, Huawei Technologies |
| Jin, Jun | University of Alberta |
| Rasouli, Amir | Huawei Technologies Canada |
Keywords: Imitation Learning, Integrated Planning and Learning, Motion and Path Planning
Abstract: In robotics, diffusion models can capture multi-modal trajectories from demonstrations, making them a transformative approach in imitation learning. However, achieving optimal performance following this regiment requires a large-scale dataset, which is costly to obtain, especially for challenging tasks, such as collision avoidance. In such tasks, generalization at test time demands coverage of many obstacle types and their spatial configurations, which are impractical to acquire purely via data. Recent works ease this burden with training-free guidance by injecting environmental context at inference, however, it only works when paired with a sufficiently diverse training dataset that yields a conditional trajectory distribution with rich multimodal coverage. To remedy this problem, we propose Context-Aware diffusion policy via Proximal mode Expansion (CAPE), a framework that expands trajectory distribution modes with context-aware prior and guidance at inference via a novel prior-seeded iterative guided refinement procedure for motion replanning. The framework generates an initial trajectory plan and executes a short prefix trajectory, and then the remaining trajectory segment is perturbed to an intermediate noise level, forming a context-aware trajectory prior that preserves goal consistency and previously expanded modes. Repeating the process with context-aware guided denoising iteratively expands mode support to allow finding smoother, less collision-prone trajectories. We evaluate CAPE on reaching and pick-and-place tasks in cluttered unseen simulated and real-world settings and show that our proposed approach achieves up to 80% higher success rate and 4x improvement in replanning frequency compared to state-of-the-art, demonstrating better generalization to unseen environments.
|
| |
| 15:00-16:30, Paper ThI2I.64 | Add to My Program |
| TI-3DGS: 3D Thermal Reconstruction Via Thermal Imaging-Guided 3D Gaussian Splatting |
|
| Tang, Yong | Harbin Institute of Technology |
| Li, Yunhao | Westlake University |
| Wang, Xiaodong | Westlake University |
| Song, Qiang | Hunan University |
| Qin, Bing | Harbin Institute of Technology |
| Feng, Xiaocheng | Harbin Institute of Technology |
| Yuan, Xin | Westlake University |
Keywords: Computer Vision for Automation, Computer Vision for Transportation, Visual Learning
Abstract: Thermal imaging, with its all-weather capabilities and strong penetration, enables 3D reconstruction in low- light and adverse conditions. In this paper, we investigate RGB-independent pure 3D thermal reconstruction, aiming to overcome the challenges of 3D reconstruction in extreme environments where RGB images are unavailable. However, directly applying visible-light 3D reconstruction methods to thermal images often leads to severe artifacts due to two key challenges: (i) thermal images lack rich textures, hindering detail reconstruction, and (ii) heat conduction causes intensity diffusion, resulting in blurred edges. To address these issues, we propose TI-3DGS, a novel 3D Gaussian Splatting framework guided by thermal imaging. We introduce a Thermal Imaging Field (TIF) to model radiance in thermal domains and a Thermal Attenuation-aware Density Control (TADC) strategy to densify sparse point clouds from low-texture thermal inputs. Additionally, we incorporate an edge-enhancement constraint to mitigate blur from heat diffusion. Extensive experiments on the TI-NSD dataset, covering indoor and outdoor scenarios, show that our TI-3DGS achieves state-of-the-art performance, effectively overcoming texture sparsity and edge degradation in thermal reconstruction.
|
| |
| 15:00-16:30, Paper ThI2I.65 | Add to My Program |
| Towards Efficient Semi-Supervised Semantic Segmentation for Solid-State LiDAR Point Clouds |
|
| Abla, Mardanjan | Xinjiang University |
| Firkat, Eksan | Tsinghua University |
| Xie, Bangquan | Great Bay University |
| Suleyman, Eliyas | School of Computing Science, University of Glasgow |
| Gao, Jiazhan | Xinjiang University |
| Zhu, Bin | Tsinghua University |
| Hamdulla, Askar | Xinjiang University |
Keywords: Semantic Scene Understanding
Abstract: LiDAR-based 3D semantic segmentation is a critical task in autonomous driving, but its scalability is limited by the reliance on large-scale labeled datasets. Semi-supervised learning (SSL) offers a potential solution by leveraging unlabeled data. However, most existing SSL segmentation methods are designed for mechanical spinning LiDAR (MSLR) and fail to generalize well to solid-state LiDAR (SSLR) due to different scanning patterns and point cloud distributions. To address this challenge, we propose SSLiMix, a novel semi-supervised segmentation method with checkerboard mixing for solid-state LiDAR. Unlike prior MSLR-oriented methods, SSLiMix employs 2D grid partitioning with checkerboard mixing to adapt to SSLR’s dense and uniform point clouds, thereby preserving spatial consistency even when beam-based augmentations fail. Additionally, we introduce a hierarchical confidence-aware pseudo-labeling mechanism (HCAP), which classifies pseudolabels by confidence and applies targeted processing to enhance pseudo-label reliability. Experiments on the PandaSet dataset show that SSLiMix improves mIoU by 11.3% over the fullysupervised baseline using only 1% labeled data, demonstrating its effectiveness in low-label regimes and providing a strong benchmark for semi-supervised SSLR segmentation.
|
| |
| 15:00-16:30, Paper ThI2I.66 | Add to My Program |
| Occlusion-Aware Consistent Model Predictive Control for Robot Navigation in Occluded Obstacle-Dense Environments |
|
| Zheng, Minzhe | The Hong Kong University of Science and Technology (Guangzhou) |
| Zheng, Lei | National University of Singapore |
| Zhu, Lei | The Hong Kong University of Science and Technology (Guangzhou) |
| Ma, Jun | The Hong Kong University of Science and Technology |
Keywords: Integrated Planning and Control, Planning under Uncertainty, Constrained Motion Planning
Abstract: Ensuring safety and motion consistency for robot navigation in occluded, obstacle-dense environments is a critical challenge. In this context, this study presents an occlusion-aware Consistent Model Predictive Control (CMPC) strategy. To account for the occluded obstacles, it incorporates adjustable risk regions that represent their potential future locations. Subsequently, dynamic risk boundary constraints are developed online to ensure safety. The CMPC then constructs multiple locally optimal trajectory branches (each tailored to different risk regions) to strike a balance between safety and performance. A shared consensus segment is generated to ensure smooth transitions between branches without significant velocity fluctuations, further preserving motion consistency. To facilitate high computational efficiency and ensure coordination across local trajectories, we use the alternating direction method of multipliers (ADMM) to decompose the CMPC into manageable sub-problems for parallel solving. The proposed strategy is validated through simulations and real-world experiments on an Ackermann-steering robot platform. The results demonstrate the effectiveness of the proposed CMPC strategy through comparisons with baseline approaches in occluded, obstacle-dense environments.
|
| |
| 15:00-16:30, Paper ThI2I.67 | Add to My Program |
| Integrated Exploration and Sequential Manipulation on Scene Graph with LLM-Based Situated Replanning |
|
| Yang, Heqing | Beihang University |
| Jiao, Ziyuan | Beihang University |
| Wang, Shu | UCLA |
| Niu, Yida | Peking University |
| Liu, Si | Beihang University |
| Liu, Hangxin | Beijing Institute for General Artificial Intelligence (BIGAI) |
Keywords: Service Robotics, Task Planning, Planning under Uncertainty
Abstract: In partially known environments, robots must combine exploration to gather information with task planning for efficient execution. To address this challenge, we propose EPoG, an Exploration-based sequential manipulation Planning framework on Graph-based representations. EPoG integrates a graph-based global planner with a Large Language Model (LLM)-based situated local planner, continuously updating a belief graph using observations and the LLM predictions to represent known and unknown objects. Action sequences are generated by computing graph edit operations between the goal and belief graphs, ordered by temporal dependencies and movement costs. This approach seamlessly combines exploration and sequential manipulation planning. In ablation studies across 46 realistic household scenes and 5 long-horizon daily object transportation tasks, EPoG achieved a success rate of 91.3%, reducing travel distance by 36.1% on average. Furthermore, a physical mobile manipulator successfully executed complex tasks in unknown and dynamic environments, demonstrating EPoG’s potential for real-world applications.
|
| |
| 15:00-16:30, Paper ThI2I.68 | Add to My Program |
| AI-IO: An Aerodynamics-Inspired Real-Time Inertial Odometry for Quadrotors |
|
| Cui, Jiahao | Shanghai Jiao Tong University |
| Yu, Feng | Shanghai Jiao Tong University |
| Zhang, Linzuo | Shanghai Jiao Tong University |
| Hu, Yu | Shanghai Jiao Tong University |
| Zou, Danping | Shanghai Jiao Ton University |
Keywords: Aerial Systems: Perception and Autonomy, Localization, Deep Learning Methods
Abstract: Inertial Odometry (IO) has gained attention in quadrotor applications due to its sole reliance on inertial measurement units (IMUs), attributed to its lightweight design, low cost, and robust performance across diverse environments. However, most existing learning-based inertial odometry systems for quadrotors either use only IMU data or include additional dynamics-related inputs such as thrust, but still lack a principled formulation of the underlying physical model to be learned. This lack of interpretability hampers the model’s ability to generalize and often limits its accuracy. In this work, we approach the inertial odometry learning problem from a different perspective. Inspired by the aerodynamics model and IMU measurement model, we identify the key physical quantity—rotor speed measurements required for inertial odometry and design a transformer-based inertial odometry. By incorporating rotor speed measurements, the proposed model improves velocity prediction accuracy by 36.9%. Furthermore, the transformer architecture more effectively exploits temporal dependencies for denoising and aerodynamic modeling, yielding an additional 22.4% accuracy gain over previous results. To support evaluation, we also provide a real-world quadrotor flight dataset capturing IMU measurements and rotor speed for high-speed motion. Finally, combined with an uncertaintyaware extended Kalman filter (EKF), our framework is validated across multiple datasets and real-time systems, demonstrating superior accuracy, generalization, and real-time performance. We share the code and data to promote further research (https://github.com/SJTU-ViSYS-team/AI-IO).
|
| |
| 15:00-16:30, Paper ThI2I.69 | Add to My Program |
| The Better You Learn, the Smarter You Prune: Towards Efficient Vision-Language-Action Models Via Differentiable Token Pruning |
|
| Jiang, Titong | Tsinghua University |
| Jiang, Xuefeng | Institute of Computing Technology, Chinese Academy of Sciences |
| Ma, Yuan | Li Auto |
| Wen, Xin | Li Auto |
| Li, Bailin | Li Auto |
| Zhan, Kun | LiAuto |
| Jia, Peng | Li Auto |
| Liu, Yahui | Tsinghua University |
| Sun, Sheng | Institute of Computing Technology, Chinese Academy of Sciences |
| Lang, Xianpeng | LiAuto |
Keywords: Deep Learning Methods, Machine Learning for Robot Control, Imitation Learning
Abstract: We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: it generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously. Notably, LightVLA requires no heuristic “magic numbers” and introduces no additional trainable parameters, making it compatible with modern inference frameworks. Experimental results demonstrate that LightVLA outperforms different VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with substantially reduced computational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.9% improvement in task success rate. Meanwhile, we also investigate the training-aware token pruning method LightVLA* with additional trainable parameters, which also achieves satisfactory performance. Our work reveals that as VLA pursues optimal performance, it spontaneously learns to prune tokens from a performance-driven perspective. To the best of our knowledge, LightVLA is the first work to apply adaptive visual token pruning to VLA tasks with the collateral goals of efficiency and performance, marking a significant step toward more efficient, powerful and practical real-time robotic systems. Project site: https://liauto-research.github.io/LightVLA.
|
| |
| 15:00-16:30, Paper ThI2I.70 | Add to My Program |
| Real-Time BEVFormer: Fast Transformer-Based BEV Perception Network on Edge Device |
|
| Yang, Juyoung | Hyundai Motor Company |
| Baek, Seoha | Hyundai Motor Company |
| Seo, Eunbin | Hyundai Motor Company |
| Jeon, Wonseok | Hyundai Motor Company |
| Kim, Doyeon | Hyundai Motors Company |
| Kim, Jongsun | Hyundai Motor Company |
| Nah, Heeyeon | Hyundai Motor Company |
Keywords: Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization, Vision-Based Navigation
Abstract: The development of camera-based real-time 3D perception network for edge devices is essential for embodied systems such as autonomous vehicles and robots. However, existing methods often demand substantial computational resources and tend to overlook performance on resource-constrained devices. In this paper, we propose RT-BEVFormer, a simple yet effective multi-task 3D perception framework designed for efficiency. Based on BEVFormer, RT-BEVFormer enhances the feature extraction capability of the backbone and redesigns the spatial cross-attention module in the encoder, guided by two key observations: 1) the computational load and total number of parameters are dominated by the backbone, and 2) the sampling process within the deformable attention module is a primary bottleneck. Specifically, we leverage powerful foundation models to distill their rich and comprehensive knowledge, thereby crafting a highly efficient student backbone. This allows RT- BEVFormer to achieve significant performance gains without incurring additional latency. Furthermore, we introduce an efficient static sampling method. This approach replaces the dynamic and deployment unfriendly nature of standard spatial cross-attention, allowing the model to focus on salient image features with minimal overhead. On the widely-used edge device, NVIDIA Jetson Orin, RT-BEVFormer outperforms the previous state-of-the-art model in both accuracy and inference speed. Extensive experiments on the nuScenes dataset show that each component of our framework is effective in both inference speed and overall accuracy. Finally, as RT-BEVFormer is implemented without any model-specific custom plugin, it ensures superior flexibility and ease of deployment.
|
| |
| 15:00-16:30, Paper ThI2I.71 | Add to My Program |
| SafeFlowMPC: Predictive and Safe Trajectory Planning for Robot Manipulators with Learning-Based Policies |
|
| Oelerich, Thies | TU Wien |
| Ebmer, Gerald | TU Wien |
| Hartl-Nesic, Christian | TU Wien |
| Kugi, Andreas | TU Wien |
Keywords: Deep Learning in Grasping and Manipulation, Robot Safety, Learning from Demonstration
Abstract: The emerging integration of robots into everyday life brings several major challenges. Compared to classical industrial applications, more flexibility is needed in combination with real-time reactivity. Learning-based methods can train powerful policies based on demonstrated trajectories, such that the robot generalizes a task to similar situations. However, these black-box models lack interpretability and rigorous safety guarantees. Optimization-based methods provide these guarantees but lack the required flexibility and generalization capabilities. This work proposes SafeFlowMPC, a combination of flow matching and online optimization to combine the strengths of learning and optimization. This method guarantees safety at all times and is designed to meet the demands of real-time execution by using a suboptimal model-predictive control formulation. SafeFlowMPC achieves strong performance in three real-world experiments on a KUKA 7-DoF manipulator, namely two grasping experiment and a dynamic human-robot object handover experiment. A video of the experiments is available at https://ww.acin.tuwien.ac.at/en/42d6. The code is available at https://github.com/TU-Wien-ACIN-CDS/SafeFlowMPC.
|
| |
| 15:00-16:30, Paper ThI2I.72 | Add to My Program |
| Deep Visual Odometry for Stereo Event Cameras |
|
| Zhong, Sheng | Hunan University |
| Niu, Junkai | Hunan University |
| Zhou, Yi | Hunan University |
Keywords: SLAM, Localization, Deep Learning for Visual Perception
Abstract: Event-based cameras are bio-inspired sensors with pixels that independently and asynchronously respond to brightness changes at microsecond resolution, offering the potential to handle state estimation tasks involving motion blur and high dynamic range (HDR) illumination conditions. However, the versatility of event-based visual odometry (VO) relying on handcrafted data association (either direct or indirect methods) is still unreliable, especially in field robot applications under low-light HDR conditions, where the dynamic range can be enormous and the signal-to-noise ratio is spatially-and-temporally varying. Leveraging deep neural networks offers new possibilities for overcoming these challenges. In this paper, we propose a learning-based stereo event visual odometry. Building upon Deep Event Visual Odometry (DEVO), our system (called Stereo-DEVO) introduces a novel and efficient static-stereo association strategy for sparse depth estimation with almost no additional computational burden. By integrating it into a tightly coupled bundle adjustment (BA) optimization scheme, and benefiting from the recurrent network’s ability to perform accurate optical flow estimation through voxel-based event representations to establish reliable patch associations, our system achieves high-precision pose estimation in metric scale. In contrast to the offline performance of DEVO, our system can process event data of Video Graphics Array (VGA) resolution in real time. Extensive evaluations on multiple public real-world datasets and self-collected data justify our system’s versatility, demonstrating superior performance compared to state-of-the-art event-based VO methods. More importantly, our system achieves stable pose estimation even in large-scale nighttime HDR scenarios.
|
| |
| 15:00-16:30, Paper ThI2I.73 | Add to My Program |
| Robust and Real-Time Surface Normal Estimation from Stereo Disparities Using Affine Transformations |
|
| Faisal, Muhammad Rafi | Eötvös Loránd University |
| Karikó, Csongor Csanád | Eötvös Loránd University |
| Hajder, Levente | Eötvös Loránd University |
Keywords: Computational Geometry, Calibration and Identification, Computer Vision for Automation
Abstract: This work introduces a novel method for surface normal estimation from rectified stereo image pairs, leveraging affine transformations derived from disparity values to achieve fast and accurate results. We demonstrate how the rectification of stereo image pairs simplifies the process of surface normal estimation by reducing computational complexity. To address noise reduction, we develop a custom algorithm inspired by convolutional operations, tailored to process disparity data efficiently. We also introduce adaptive heuristic techniques for efficiently detecting connected surface components within the images, further improving the robustness of the method. By integrating these methods, we construct a surface normal estimator that is both fast and accurate, producing a dense, oriented point cloud as the final output. Our method is validated using both simulated environments and real-world stereo images from the Middleburyfootnote{For the Middlebury datasets, disparity values are published.} and Cityscapes datasets, demonstrating significant improvements in real-time performance and accuracy when implemented on a GPU. The source code is available at https://github.com/mrafifaisal/Surface-Normal-Estimation/.
|
| |
| 15:00-16:30, Paper ThI2I.74 | Add to My Program |
| WATCHDOG: Autonomous Elderly Assistance Via Attention-Based Fall Detection and Trajectory Prediction |
|
| Longo, Antonello | Polytechnic University of Bari |
| Bono, Annaclaudia | Polytechnic University of Bari, CNR |
| Guaragnella, Giovanna | CNR |
| Boccadoro, Pietro | Polytechnic University of Bari |
| Petitti, Antonio | Consiglio Nazionale Delle Ricerche (CNR) |
| Rana, Arianna | Consiglio Nazionale Delle Ricerche (CNR) |
| D'Orazio, Tiziana | CNR |
Keywords: Human Detection and Tracking, Deep Learning Methods, Surveillance Robotic Systems
Abstract: Service robots designed to assist elderly people are receiving significant attention since they can improve their quality of life, promote their independence, and provide daily support. These mobile platforms can observe people moving around their homes, recognise dangerous events, and detect them promptly. This paper introduces a novel framework to perform fall detection and people following on board an autonomous legged robotic platform. The system operates on the Unitree Go2 robot and comprises two main building blocks. The first component consists of a Body Landmarks extractor and a Transformer-based network that performs binary classification, distinguishing between Fall behaviours and Activities of Daily Living (ADL). The second component is a target-driven path planner that enables the robot to follow and maintain a full-body view of the target in complex environments. Experiments on public datasets and comparison with state-of-the-art works have been conducted to demonstrate the reliability of both blocks. Real experiments in a cluttered environment have been performed to illustrate how the mobile platform is able to follow people moving around obstacles, detect falls in occluded areas, and predict people’s trajectories to maintain a full-body view.
|
| |
| 15:00-16:30, Paper ThI2I.75 | Add to My Program |
| I-FailSense: Towards General Robotic Failure Detection with Vision-Language Models |
|
| Grislain, Clémence | Sorbonne University |
| Rahimi, Hamed | Sorbonne University |
| Sigaud, Olivier | Sorbonne Université |
| Chetouani, Mohamed | Sorbonne University |
Keywords: Failure Detection and Recovery, Deep Learning in Grasping and Manipulation
Abstract: Language-conditioned robotic manipulation in open-world settings requires not only accurate task execution but also the ability to detect failures for robust deployment in real-world environments. Although recent advances in vision-language models (VLMs) have significantly improved the spatial reasoning and task-planning capabilities of robots, they remain limited in their ability to recognize their own failures. In particular, a critical yet underexplored challenge lies in detecting semantic misalignment errors, where the robot executes a task that is semantically meaningful but inconsistent with the given instruction. To address this, we propose a method for building datasets targeting Semantic Misalignment Failures detection, from existing language-conditioned manipulation datasets. We also present I-FailSense, an open-source VLM framework with grounded arbitration designed specifically for failure detection. Our approach relies on fine-tuning a base VLM, followed by training lightweight classification heads, called FS blocks, attached to different internal layers of the VLM and whose predictions are aggregated using an ensembling mechanism. Experiments show that I-FailSense outperforms state-of-the-art VLMs, both comparable in size and larger, in detecting semantic misalignment errors. Notably, despite being trained only on semantic misalignment detection, I-FailSense generalizes to broader robotic failure categories and effectively transfers to other simulation environments. These findings highlight that leveraging VLM representations at multiple levels and training for semantic misalignment failure detection, are key to effective and generalizable robotic failure detection. The datasets and models are publicly released on HuggingFace.
|
| |
| 15:00-16:30, Paper ThI2I.76 | Add to My Program |
| Autonomous Rotating Cameras Boost 3D Wildlife MoCap Yield without Human Operators |
|
| Vally, Amaan | University of Cape Town |
| Joska, Daniel | University of Cape Town |
| Muramatsu, Naoya | University of Cape Town |
| Amayo, Paul | University of Cape Town |
| Patel, Amir | University College London |
Keywords: Robotics and Automation in Life Sciences, Sensor Fusion
Abstract: We present a low-cost, autonomous, rotating-camera system that increases the usable data yield for 3D markerless motion capture of animals in uncontrolled outdoor settings. A lightweight detector (YOLOv4-Tiny) locates the subject at 10 Hz; an Extended Kalman Filter bridges sparse detections to a 50 Hz full-state feedback (FSF) controller, keeping the subject centered without a human operator. The 3D reconstruction backend uses existing markerless 2D keypoints and Full Trajectory Estimation (FTE) with a simple rotation compensation for moving cameras. On field videos of a running human and free-running cheetahs, the rotating cameras captured substantially more usable frames than fixed cameras: +52% for the human sequence (6593 vs. 4333 frames) and +135% across cheetah sequences (2419 vs. 1031 frames). Centering also shifts subject pixel distribution toward the image center, which theoretically lowers 2D keypoint error and thus 3D reprojection error for any pose-estimation backend. We detail the EKF design for sparse/noisy detections, the FSF controller with an integral state, and practical deployment considerations. Results show autonomous centering is a simple, deployable lever to scale outdoor animal mocap without changing downstream reconstruction methods.
|
| |
| 15:00-16:30, Paper ThI2I.77 | Add to My Program |
| Point Cloud Segmentation for Autonomous Clip Positioning in Laparoscopic Cholecystectomy on a Phantom |
|
| Gyenes, Balazs | Karlsruhe Institute of Technology |
| Franke, Nikolai | Karlsruhe Institute of Technology |
| Scheikl, Paul Maria | None |
| Henrich, Pit | FAU Erlangen-Nürnberg, Germany |
| Younis, Rayan | University Hospital and Medical Faculty Carl Gustav Carus, TU Dresden |
| Neumann, Gerhard | Karlsruhe Institute of Technology |
| Wagner, Martin | University Hospital and Faculty of Medicine Carl Gustav Carus at TUD Dresden University of Technology |
| Mathis-Ullrich, Franziska | Friedrich-Alexander-University Erlangen-Nurnberg (FAU) |
Keywords: Computer Vision for Medical Robotics, Surgical Robotics: Laparoscopy, Transfer Learning
Abstract: High-risk applications in robotics, such as robot-assisted surgery, present unique challenges. These systems must be both highly precise and interpretable in order to be deployed in environments with very low tolerance for error or unsafe exploration. We present the first robotic system to demonstrate autonomous clip positioning on a physical phantom in laparoscopic surgery, one of the most common interventions in general surgery. After segmentation of a colorless point cloud from a single camera, target positions for the clips are extracted using spline interpolation, and can then be adjusted by the human operator. The segmentation model is trained on only 60 hand-labeled real point clouds, reflecting data scarcity in the surgical domain. We overcome this with a combination of pre-training on 128,000 synthetic point clouds and two novel data augmentation techniques. The motion of the end-effector to each target is visualized for the operator, satisfying the unique motion constraints of minimally-invasive surgery while ensuring that the robot's actions are verifiable and interpretable. In real robot experiments, our system localizes targets with the required precision of 0.75mm at a 95% success rate and executes autonomous clip positioning with a 100% success rate. We provide insights that are applicable to many other surgical and non-surgical tasks that require identifying and navigating to a precise target. Our source code is available at https://github.com/balazsgyenes/kirurc
|
| |
| 15:00-16:30, Paper ThI2I.78 | Add to My Program |
| Tuning ROS 2 for Energy-Efficient Navigation: Empirical Insights from Costmap 2D Configurations |
|
| Albonico, Michel | UTFPR, Brazil/SDU, Denmark |
| Wortmann, Andreas | University of Stuttgart |
| Malavolta, Ivano | Vrije Universiteit Amsterdam |
Keywords: Software, Middleware and Programming Environments, Autonomous Vehicle Navigation, Motion and Path Planning
Abstract: Robots are increasingly used in diverse application areas, where autonomous navigation plays a central role. As these systems become more widespread, improving their energy efficiency is critical to extending operational time and reducing environmental impact. The Robot Operating System (ROS) is a widely adopted middleware for robotics, offering a rich set of configurable packages. However, this flexibility can result in suboptimal software configurations in dynamic environments, negatively affecting both performance and energy consumption. This paper investigates the impact of ROS 2 package re-configurations on the energy efficiency of mobile robot navigation. We conduct a controlled experiment in two warehouse-like scenarios (small and large) with varying obstacle layouts and Costmap 2D configurations (essential to the Nav2 stack). Through repeated trials, we measure energy usage, power profile, CPU load, memory consumption, and navigation performance. Results show that configurations must be carefully chosen for the specific robotic environment, and we were able to identify critical settings that lead to good and poor performance and energy consumption.
|
| |
| 15:00-16:30, Paper ThI2I.79 | Add to My Program |
| MAGNIFIED: RL Fine-Tuning of Multimodal Large Language Models for Motion Planning |
|
| Chen, Letian | Waymo |
| Lu, Yiren | Waymo |
| Fu, Justin | Waymo |
| Xie, Yichen | Waymo |
| Xu, Runsheng | Waymo |
| Hwang, Jyh-Jing | Waymo |
| Sapp, Benjamin | Waymo |
| Anguelov, Dragomir | Waymo |
Keywords: Motion and Path Planning, Reinforcement Learning, Autonomous Vehicle Navigation
Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in semantic understanding and common sense reasoning, making them promising candidates for solving planning problems in autonomous driving. However, the next-token text prediction objectives traditionally used in pre-training and supervised fine-tuning (SFT) of MLLMs may fall short of fulfilling the planning objectives for autonomous vehicles. The next-token prediction objective merely encourages per-token imitation in text, often irrespective of multi-step consequences and the alignment with crucial planning considerations such as giving space to other road actors. To overcome these limitations, we propose a reinforcement learning fine-tuning (RLFT) approach, MAGNIFIED, that aligns the MLLM-based driving agent with planning objectives by learning from token-level rewards. By mapping a sequence of predicted tokens to corresponding vehicle trajectories and learning from planning rewards, MAGNIFIED optimizes for the true planning objectives rather than focusing solely on token prediction accuracy, enabling the model to refine its understanding of the planning task beyond simple imitation. We validate our approach on the Waymo Open Motion Dataset with a novel setup incorporating rasterized birds-eye views and tokenized trajectories as inputs and planning-oriented outputs. An initial SFT phase establishes a strong baseline in outputting plan trajectories as sequences of X-Y coordinates in text, while subsequent RL fine-tuning substantially enhances planning performance relative to the SFT baseline (demonstrating over a 10.5% reduction in overlap rate and a 38.9% reduction in off-road rate), underscoring the potential of RLFT on MLLMs to achieve vehicle planning that is better aligned with compliant, comfortable, and efficient driving.
|
| |
| 15:00-16:30, Paper ThI2I.80 | Add to My Program |
| DyRef: Dynamic Reflection Framework Via Graph-Based Complexity for Robotic Planning |
|
| Zhang, Jiatao | Zhejiang University |
| Liang, QingMiao | University of Chinese Academy of Sciences |
| Hu, Tuocheng | Zhejiang University |
| Song, Yufan | Zhejiang University |
| Song, Wei | Zhejiang Lab |
| Zhu, Shiqiang | Zhejiang University |
Keywords: Agent-Based Systems, AI-Based Methods, Integrated Planning and Learning
Abstract: Robotic planning tasks often involve diverse complexities, which make adaptive improvement through reflection particularly challenging. Existing LLM-based approaches typically rely on fixed routines, lacking the ability to adjust to task-specific complexity and often leading to redundant reflections. To address this, we propose DyRef, a dynamic reflection framework that models tasks as a Diagnostic Graph, measures task complexity through structural factors, and routes them through a Reflection Toolkit via a learned Routing Policy network. This design enables tailored reflection strategies that reduce redundancy and improve reasoning efficiency. Experiments in AlfWorld and on real-world robotic platforms show that DyRef improves first trial success rates by 16.1%, while reducing redundant reflections by 64.4%.
|
| |
| 15:00-16:30, Paper ThI2I.81 | Add to My Program |
| Symmetry-Aware Fusion of Vision and Tactile Sensing Via Bilateral Force Priors for Robotic Manipulation |
|
| Lee, Wonju | Analog Devices Inc |
| Grimaldi, Matteo | Analog Devices |
| Yu, Tao | Analog Devices Inc |
Keywords: Force and Tactile Sensing, Sensor Fusion, Deep Learning in Grasping and Manipulation
Abstract: Insertion tasks in robotic manipulation demand precise, contact-rich interactions that vision alone cannot resolve. While tactile feedback is intuitively valuable, existing studies have shown that naive visuo-tactile fusion often fails to deliver consistent improvements. In this work, we propose a Cross-Modal Transformer (CMT) for visuo-tactile fusion that integrates wrist-camera observations with tactile signals through structured self- and cross-attention. To stabilize tactile embeddings, we further introduce a physics-informed regularization that encourages bilateral force balance, reflecting principles of human motor control. Experiments on the TacSL benchmark show that CMT with symmetry regularization achieves a 96.59% insertion success rate, surpassing naive and gated fusion baselines and closely matching the privileged “wrist + contact force” configuration (96.09%). These results highlight two central insights: (i) tactile sensing is indispensable for precise alignment, and (ii) principled multimodal fusion, further strengthened by physics-informed regularization, unlocks complementary strengths of vision and touch, approaching privileged performance under realistic sensing.
|
| |
| 15:00-16:30, Paper ThI2I.82 | Add to My Program |
| Galaxy Open-World Dataset and G0 Dual-System VLA Model |
|
| Jiang, Tao | Tsinghua |
| Yuan, Tianyuan | Tsinghua University |
| Liu, Yicheng | Tsinghua University |
| Lu, Chenhao | Tsinghua University |
| Cui, Jianning | Galaxea AI |
| Liu, Xiao | National University of Singapore |
| Cheng, Shuiqi | The University of Hong Kong |
| Gao, Jiyang | Galaxea.ai |
| Xu, Huazhe | Tsinghua University |
| Zhao, Hang | Tsinghua University |
Keywords: AI-Enabled Robotics, Data Sets for Robot Learning, Learning from Demonstration
Abstract: We present Galaxy Open-World Dataset, a large-scale, diverse collection of robot behaviors recorded in authentic human living and working environments. All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations to facilitate both training and evaluation. Building on this dataset, we introduce G0, a dual-system framework that couples a Vision-Language Model (VLM) for multimodal planning with a Vision-Language-Action (VLA) model for fine-grained execution. G0 is trained using a three-stage curriculum: cross-embodiment pre-training, single-embodiment pre-training, and task-specific post-training. A comprehensive benchmark—spanning tabletop manipulation, few-shot learning, and long-horizon mobile manipulation—demonstrates the effectiveness of our approach. In particular, we find that the single-embodiment pre-training stage, together with the Galaxy Open-World Dataset, plays a critical role in achieving strong performance. Dataset, code and pretrained weights will be made publicly available.
|
| |
| 15:00-16:30, Paper ThI2I.83 | Add to My Program |
| FAR-Dex: Few-Shot Data Augmentation and Adaptive Residual Policy Refinement for Dexterous Manipulation |
|
| Bai, Yushan | Institute of Automation, Chinese Academy of Sciences |
| Chen, Fulin | Shanghai University of Engineering Science |
| Sun, Hongzheng | Institute of Automation, Chinese Academy of Sciences |
| Tong, Yuchuang | The Institute of Automation of the Chinese Academy of Sciences |
| Li, En | Institute of Automation, Chinese Academy of Sciences |
| Zhang, Zhengtao | Institute of Automation, Chinese Academy of Sciences |
Keywords: Dexterous Manipulation, Multifingered Hands, Imitation Learning
Abstract: Achieving human-like dexterous manipulation through the collaboration of multi-fingered hands with robotic arms remains a longstanding challenge in robotics, primarily due to the scarcity of high-quality demonstrations and the complexity of high-dimensional action spaces. To address these challenges, we propose FAR-Dex, a hierarchical framework that integrates few-shot data augmentation with adaptive residual refinement to enable robust and precise arm–hand coordination in dexterous tasks. First, FAR-DexGen leverages the IsaacLab simulator to generate diverse and physically-constrained trajectories from a few demonstrations, providing a data foundation for policy training. Second, FAR-DexRes introduces an adaptive residual module that refines policies by combining multi-step trajectory segments with observation features, thereby enhancing accuracy and robustness in manipulation scenarios. Experiments in both simulation and real-world demonstrate that FAR-Dex improves data quality by 13.4% and task success rates by 7% over state-of-the-art methods. It further achieves over 80% success in real-world tasks, enabling fine-grained dexterous manipulation with strong positional generalization.
|
| |
| 15:00-16:30, Paper ThI2I.84 | Add to My Program |
| UnIRe: Unsupervised Instance Decomposition for Dynamic Urban Scene Reconstruction |
|
| Mao, Yunxuan | Zhejiang University |
| Xiong, Rong | Zhejiang University |
| Wang, Yue | Zhejiang University |
| Liao, Yiyi | Zhejiang University |
Keywords: Computer Vision for Automation, Recognition
Abstract: Reconstructing and decomposing dynamic urban scenes is crucial for autonomous driving, urban planning, and scene editing. However, existing methods fail to perform instance-aware decomposition without manual annotations, which is crucial for instance-level scene editing. We propose UnIRe, a 3D Gaussian Splatting (3DGS) based approach that decomposes a scene into a static background and individual dynamic instances using only RGB images and LiDAR point clouds. At its core, we introduce 4D superpoints, a novel representation that clusters multi-frame LiDAR points in 4D space, enabling unsupervised instance separation based on spatiotemporal correlations. These 4D superpoints serve as the foundation for our decomposed 4D initialization, i.e., providing spatial and temporal initialization to train a dynamic 3DGS for arbitrary dynamic classes without requiring bounding boxes or object templates. Furthermore, we introduce a smoothness regularization strategy in both 2D and 3D space, further improving the temporal stability. Experiments on benchmark datasets show that our method outperforms existing methods in decomposed dynamic scene reconstruction while enabling accurate and flexible instance-level editing, making it a practical solution for real-world applications.
|
| |
| 15:00-16:30, Paper ThI2I.85 | Add to My Program |
| Leveraging Geometric Priors for Unaligned Scene Change Detection |
|
| Liu, Ziling | Southern University of Science and Technology |
| Chen, Ziwei | Southern University of Science and Technology |
| Gao, Mingqi | University of Sheffield |
| Yang, Jinyu | Harbin Institute of Technology, Shenzhen |
| Zheng, Feng | SUSTech |
Keywords: Computer Vision for Automation
Abstract: Unaligned Scene Change Detection aims to detect scene changes between image pairs captured at different times without assuming viewpoint alignment. To handle viewpoint variations, current methods rely solely on 2D visual cues to establish cross-image correspondence to assist change detection. However, large viewpoint changes can alter visual observations, causing appearance-based matching to drift or fail. Additionally, supervision limited to 2D change masks from small-scale SCD datasets restricts the learning of generalizable multi-view knowledge, making it difficult to reliably identify visual overlaps and handle occlusions. This lack of explicit geometric reasoning represents a critical yet overlooked limitation. In this work, we introduce geometric priors for the first time to address the core challenges of unaligned SCD, for reliable identification of visual overlaps, robust correspondence establishment, and explicit occlusion detection. Building on these priors, we propose a training-free framework that integrates them with the powerful representations of a visual foundation model to enable reliable change detection under viewpoint misalignment. Through extensive evaluation on the PSCD, ChangeSim, and PASLCD datasets, we demonstrate that our approach achieves superior and robust performance. Our code will be released at https://github.com/ZilingLiu/GeoSCD.
|
| |
| 15:00-16:30, Paper ThI2I.86 | Add to My Program |
| Re^3Sim: Generating High-Fidelity Simulation Data Via 3D-Photorealistic Real-To-Sim for Robotic Manipulation |
|
| Han, Xiaoshen | Shanghai Jiao Tong University, Shanghai AI Laboratory |
| Yu, Junqiu | Shanghai AI Laboratory |
| Liu, Minghuan | Shanghai Jiao Tong University |
| Chen, Yilun | Shanghai AI Laboratory |
| Lyu, Xiaoyang | The University of Hong Kong |
| Tian, Yang | Shanghai AI Laboratory |
| Wang, Bolun | Shanghai AI Laboratory |
| Zhang, Weinan | Shanghai Jiao Tong University |
| Pang, Jiangmiao | Shanghai AI Laboratory |
Keywords: Deep Learning in Grasping and Manipulation
Abstract: Real-world data collection for robotics is costly and resource-intensive, requiring skilled operators and expensive hardware. Simulations offer a scalable alternative but often fail to achieve sim-to-real generalization due to geometric and visual gaps. To address these challenges, we propose a 3D-photorealistic real-to-sim system, namely, Re^3Sim, addressing geometric and visual sim-to-real gaps. Re^3Simemploys advanced 3D reconstruction and rendering techniques to faithfully recreate real-world scenarios, enabling real-time rendering of simulated cross-view cameras within a physics-based simulator. By utilizing privileged information to collect expert demonstrations efficiently in simulation, and train robot policies with imitation learning, we validate the effectiveness of the real-to-sim-to-real system across various manipulation task scenarios. Notably, with only simulated data, we can achieve zero-shot sim-to-real transfer with an average success rate exceeding 58%. To push the limit of real-to-sim, we further generate a large-scale simulation dataset, demonstrating how a robust policy can be built from simulation data that generalizes across various objects.
|
| |
| 15:00-16:30, Paper ThI2I.87 | Add to My Program |
| Flexible and Foldable: Workspace Analysis and Object Manipulation Using a Soft, Interconnected, Origami-Inspired Actuator Array |
|
| Dacre, Bailey | IT University of Copenhagen |
| Moreno, Rodrigo | IT University of Copenhagen |
| Demirtas, Serhat | EPFL |
| Wang, Ziqiao | EPFL |
| Jiang, Yuhao | Ecole Polytechnique Federale De Lausanne |
| Paik, Jamie | Ecole Polytechnique Federale De Lausanne |
| Stoy, Kasper | IT University of Copenhagen |
| Faiña, Andres | IT University of Copenhagen |
Keywords: Soft Robot Applications, Modeling, Control, and Learning for Soft Robots, Multi-Robot Systems
Abstract: Object manipulation is a fundamental challenge in robotics, where systems must balance trade-offs among manipulation capabilities, system complexity, and throughput. Distributed manipulator systems (DMS) use the coordinated motion of actuator arrays to perform complex object manipulation tasks, seeing widespread exploration within the literature and in industry. However, existing DMS designs typically rely on high actuator densities and impose constraints on object-to-actuator scale ratios, limiting their adaptability. We present a novel DMS design utilizing an array of 3-DoF, origami-inspired robotic tiles interconnected by a compliant surface layer. Unlike conventional DMS, our approach enables manipulation not only at the actuator end effectors but also across a flexible surface connecting all actuators; creating a continuous, controllable manipulation surface. We analyse the combined workspace of such a system, derive simple motion primitives, and demonstrate its capabilities to translate simple geometric objects across an array of tiles. By leveraging the inter-tile connective material, our approach significantly reduces actuator density, increasing the area over which an object can be manipulated by aproximately X1.84 without an increase in the number of actuators. This design offers a lower cost and complexity alternative to traditional high-density arrays and introduces new opportunities for manipulation strategies that leverage the flexibility of the interconnected surface.
|
| |
| 15:00-16:30, Paper ThI2I.88 | Add to My Program |
| Collaborative Quadruped Transportation in 3D Terrain with Constrained Diffusion |
|
| Jose, Williard Joshua | University of Massachusetts Amherst |
| Chen, Li | University of Massachusetts Amherst |
| Zhang, Hao | University of Massachusetts Amherst |
Keywords: Multi-Robot Systems
Abstract: Recently, multi-robot systems have gained significant attention for their promise of scalable efficiency, reliability, and cost savings. A crucial capability is collaborative transportation, where a team of robots works together to transport a payload. However, key challenges remain, such as potential conflicts between team-level decisions and individual-level robot controls, team kinematic constraints imposed by the robot-payload coupling, and diverse obstacles encountered in 3D terrain. We present Collaborative Quadruped Transportation with Constrained Diffusion (CQTD), enabling a team of closely coupled quadruped robots to collaboratively transport a payload across 3D terrain. A diffusion-based upper level learns terrain-aware team-level trajectories satisfying team kinematic constraints due to the payload coupling, while a lower level optimizes velocity controls of individual robots satisfying collision and anisotropic velocity constraints. Experiments in high-fidelity simulations and on real-world quadruped robot teams demonstrate that CQTD outperforms baseline methods in challenging 3D terrain scenarios requiring closely-coupled collaboration between the quadruped robots.
|
| |
| 15:00-16:30, Paper ThI2I.89 | Add to My Program |
| Distribution Estimation for Global Data Association Via Approximate Bayesian Inference |
|
| Jia, Yixuan | Massachusetts Institute of Technology |
| Peterson, Mason B. | Massachusetts Institute of Technology |
| Li, Qingyuan | Massachusetts Institute of Technology |
| Tian, Yulun | University of Michigan |
| How, Jonathan | Massachusetts Institute of Technology |
Keywords: Localization, Probability and Statistical Methods, RGB-D Perception
Abstract: Global data association is an essential prerequisite for robot operation in environments seen at different times or by different robots. Repetitive or symmetric data creates significant challenges for existing methods, which typically rely on maximum likelihood estimation or maximum consensus to produce a single set of associations. However, in these ambiguous scenarios, the distribution of solutions to global data association problems is often highly multimodal, and such single-solution approaches frequently fail. In this work, we introduce a data association framework that leverages approximate Bayesian inference to capture multiple solution modes to the data association problem, thereby avoiding premature commitment to a single solution under ambiguity. Our approach represents hypothetical solutions as particles that evolve via deterministic or randomized updates, naturally parallelizable on GPUs, to cover the modes of the underlying solution distribution. Simulated and real-world experiments with highly ambiguous data show that our method correctly estimates the distribution over transformations when registering point clouds or object maps. Code is available at: https://github.com/mit-acl/mmda.
|
| |
| 15:00-16:30, Paper ThI2I.90 | Add to My Program |
| Boosting Zero-Shot VLN Via Abstract Obstacle Map-Based Waypoint Prediction with TopoGraph-And-VisitInfo-Aware Prompting |
|
| Li, Boqi | University of Michigan, Ann Arbor |
| Li, Siyuan | University of Michigan |
| Wang, Weiyi | University of Michigan - Ann Arbor |
| Li, Anran | University of Michigan |
| Cao, Zhong | University of Michigan |
| Liu, Henry | University of Michigan, Ann Arbor |
Keywords: Vision-Based Navigation, Semantic Scene Understanding
Abstract: With the rapid progress of foundation models and robotics, vision-language navigation (VLN) has emerged as a key task for embodied agents with broad practical applications. We address VLN in continuous environments, a particularly challenging setting where an agent must jointly interpret natural language instructions, perceive its surroundings, and plan low-level actions. We propose a zero-shot framework that integrates a simplified yet effective waypoint predictor with a multimodal large language model (MLLM). The predictor operates on an abstract obstacle map, producing linearly reachable waypoints, which are incorporated into a dynamically updated topological graph with explicit visitation records. The graph and visitation information are encoded into the prompt, enabling reasoning over both spatial structure and exploration history to encourage exploration and equip MLLM with local path planning for error correction. Extensive experiments on R2R-CE and RxR-CE show that our method achieves state-of-the-art zero-shot performance, with success rates of 41% and 36%, respectively, outperforming prior state-of-the-art methods. The source code is available at: https://github.com/michigan-traffic-lab/OMAP-VLN.
|
| |
| 15:00-16:30, Paper ThI2I.91 | Add to My Program |
| PLOP: Particle Filtering for Learning Object Physics from Robot Interaction Videos |
|
| Nan, Junyu | Carnegie Mellon University |
| Zakharov, Sergey | Toyota Research Institute |
| Kitani, Kris | Carnegie Mellon University |
Keywords: Computer Vision for Automation, Visual Learning
Abstract: Learning the dynamics of deformable objects, such as dough or a sponge, from RGB-D videos is challenging due to insufficient visual cues and complex deformations. We introduce PLOP (Particle Filtering for Learning Object Physics), a novel framework to learn the dynamics model of deformable objects using a particle filter over 3D Gaussians. Our method learns (1) a dynamics function to predict the object state at the next time step and (2) a resampling function to split and merge Gaussians to handle complex deformations such as cutting. Within PLOP, we propose I2N (Implicit Particle Interaction Network), a dynamics model that leverages a mixed particle-grid representation inspired by the Material Point Method (MPM). By transferring particle features to grid nodes, solving for grid dynamics, and then projecting solutions back to particles, our approach avoids explicit pairwise interaction reasoning between particles and significantly reduces computational cost when the number of particles is large. While PLOP is applicable to general robot-object interactions, we evaluate it on cutting sequences in both simulation and the real world, which induce challenging topological changes and expose previously occluded surfaces. On these benchmarks, PLOP achieves a 53.15% improvement in 3D reconstruction accuracy and a 6.84% improvement in 2D reconstruction accuracy on the simulation benchmark, as well as 28.41% and 24.45% improvements in 3D and 2D reconstruction metrics, respectively, on the real-world dataset.
|
| |
| 15:00-16:30, Paper ThI2I.92 | Add to My Program |
| Reducing Oracle Feedback with Vision Language Embeddings for Preference Based RL |
|
| Ghosh, Udita | University of California Riverside |
| Raychaudhuri, Dripta | AWS |
| Li, Jiachen | University of California, Riverside |
| Karydis, Konstantinos | University of California, Riverside |
| Roy-Chowdhury, Amit | University of California, Riverside |
Keywords: Deep Learning Methods, Deep Learning for Visual Perception
Abstract: Preference-based reinforcement learning (RL) offers a promising approach for aligning policies with human intent but is often constrained by the high cost of human feedback. In this work, we introduce ROVED, a framework that integrates Vision-Language Models (VLMs) with selective human feedback to significantly reduce annotation requirements while maintaining performance. Our method leverages VLMs to generate initial preference labels, which are then filtered to identify uncertain cases for targeted human annotation. Additionally, we adapt VLMs using a self-supervised inverse dynamics loss to improve alignment with evolving policies. Experiments on Meta-World manipulation tasks demonstrate that ROVED achieves comparable or superior success rates to state-of-the-art methods while using up to 3x fewer human annotations. Furthermore, we show that adapted VLMs enable efficient knowledge transfer across tasks, further minimizing feedback needs. Our results highlight the potential of combining VLMs with selective human supervision to make preference-based RL more scalable and practical.
|
| |
| 15:00-16:30, Paper ThI2I.93 | Add to My Program |
| Trajectory Planning for UAV-Based Smart Farming Using Imitation-Based Triple Deep Q-Learning |
|
| Mao, Wencan | National Institute of Informatics |
| Zhou, Quanxi | The University of Tokyo |
| Couso Coddou, Tomás | Pontificia Universidad Católica De Chile |
| Tsukada, Manabu | The University of Tokyo |
| Yunling, Liu | China Agricultural University |
| Ji, Yusheng | National Institute of Informatics |
Keywords: Reinforcement Learning, Path Planning for Multiple Mobile Robots or Agents, Aerial Systems: Applications
Abstract: Unmanned aerial vehicles (UAVs) have emerged as a promising auxiliary platform for smart agriculture, capable of simultaneously performing weed detection, recognition, and data collection from wireless sensors. However, trajectory planning for UAV-based smart agriculture is challenging due to the high uncertainty of the environment, partial observations, and limited battery capacity of UAVs. To address these issues, we formulate the trajectory planning problem as a Markov decision process (MDP) and leverage multi-agent reinforcement learning (MARL) to solve it. Furthermore, we propose a novel imitation-based triple deep Q-network (ITDQN) algorithm, which employs an elite imitation mechanism to reduce exploration costs and utilizes a mediator Q-network over a double deep Q-network (DDQN) to accelerate and stabilize training and improve performance. Experimental results in both simulated and real-world environments demonstrate the effectiveness of our solution. Moreover, our proposed ITDQN outperforms DDQN by 4.43% in weed recognition rate and 6.94% in data collection rate.
|
| |
| 15:00-16:30, Paper ThI2I.94 | Add to My Program |
| Conflict-Based Search As a Protocol: A Multi-Agent Motion Planning Protocol for Heterogeneous Agents, Solvers, and Independent Tasks |
|
| Veerapaneni, Rishi | Carnegie Mellon University |
| Tang, Ho Kwan Alvin | Carnegie Mellon University |
| Cen, Yidai | Carnegie Mellon University |
| He, Haodong | Tongji University |
| Zhao, Sophia | Carnegie Mellon University |
| Shah, Viraj | Carnegie Mellon University |
| Ji, Ziteng | University of California, Berkeley |
| Olin, Gabriel | Carnegie Mellon University |
| Arrizabalaga, Jon | Massachusetts Institute of Technology (MIT) |
| Shaoul, Yorai | Carnegie Mellon University |
| Li, Jiaoyang | Carnegie Mellon University |
| Likhachev, Maxim | Carnegie Mellon University |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning
Abstract: Imagine the future construction site, hospital, or office with dozens of robots bought from different manufacturers. How can we enable these different systems to effectively move in a shared environment, given that each robot may have its own independent motion planning system? This work show how we can get efficient collision-free movements between algorithmically heterogeneous agents by using Conflict-Based Search (Sharon et al. 2015) as a protocol. At its core, the CBS Protocol requires one specific single-agent motion planning API; finding a collision-free path that satisfies certain space-time constraints. Given such an API, CBS uses a central planner to find collision-free paths - independent of how the API is implemented. We demonstrate how this protocol enables multi-agent motion planning for a heterogeneous team of agents completing independent tasks with a variety of single-agent planners including: Heuristic Search (e.g., A*), Sampling Based Search (e.g., RRT), Optimization (e.g., Direct Collocation), Diffusion, and Reinforcement Learning.
|
| |
| 15:00-16:30, Paper ThI2I.95 | Add to My Program |
| Autonomous Aerial Vehicle Carrier and MAV Collaboration: System Design, Trajectory Optimization, and Real-World Implementation |
|
| Shen, Zhipeng | The Hong Kong Polytechnic University |
| Zhou, Guanzhong | The Hong Kong Polytechnic University |
| Sun, Peng | University of Electronic Science and Technology of China |
| Lan, Bowen | The Hong Kong Polytechnic University |
| Meng, Qingyang | The Hong Kong Polytechnic University |
| Cao, Hao | The Hong Kong Polytechnic University |
| Zhou, Shiyu | City University of Hong Kong |
| Shao, Jinliang | University of Electronic Science and Technology of China, Chengdu |
| Huang, Hailong | The Hong Kong Polytechnic University |
|
|
| |
| 15:00-16:30, Paper ThI2I.96 | Add to My Program |
| A Hybrid Optimization Framework for Grasp Synthesis under Partial Observations |
|
| Zhang, Wenzheng | The University of Sydney |
| Afzal Maken, Fahira | Data61, CSIRO |
| Lai, Tin | University of Sydney |
| Ramos, Fabio | University of Sydney, NVIDIA |
Keywords: Grasping, Deep Learning in Grasping and Manipulation, Probabilistic Inference
Abstract: We propose a hybrid grasp synthesis framework that combines a learning-based Energy Based Model (EBM) with an analytical Iterative Closest Point (ICP) methodto generate robustgrasps from partially observed point clouds. The learned energy function acts as a prior within a Stein Variational Gradient Descent (SVGD) framework, guiding iterative refinement of grasp configurations. Evaluated on 67 objects with 5,360 grasp attempts, our method achieves an average success rate of 60.9%, outperforming AnyGrasp (31.1%) and Grasp Pose Detection (48.4%) and AS-ICP (56.6%). These results highlight the strong generalization ability of our approach and demonstrate how combining data-driven learning with geometric optimization addresses the limitations of either strategy in isolation.
|
| |
| 15:00-16:30, Paper ThI2I.97 | Add to My Program |
| VEGA: A Geometry-Aware Enveloping Layer-Based Path Planning Strategy for Accurate Robotic 3D Printing |
|
| Choi, Won Bin | Pohang University of Science and Technology |
| Chung, Wan Kyun | POSTECH |
Keywords: Additive Manufacturing, Motion and Path Planning, Computational Geometry
Abstract: Additive manufacturing offers extensive design freedom but remains limited by path planning strategies that rely on planar slicing, which introduce staircase artifacts. Non-planar slicing improves local fidelity yet still produces stacking artifacts due to exposed layer boundaries, leaving a gap in capturing complex geometries. This work proposes a Volumetric Envelope Generation Algorithm (VEGA) that generates geometry-aware enveloping layers through a buffering-erosion process. By introducing a Buffer Restraint Region (BRR), the method enables control over incorporation mode and layer positioning. Printability-based splitting further ensures feasible print paths for fabrication. Experiments were conducted on planar- and non-planar-base geometries, printed with a custom 3D printing robot. Printed models were scanned during evaluation, showing reductions of 68.5% in volumetric error, 69.1% in surface deviation, and 77.9% in chamfer distance relative to planar slicing, achieved without additional computational cost (≈32 s per model) or print length. These results demonstrate that enveloping-based path planning effectively mitigates artifacts inherent to slicing-based approaches, providing a strategy for high-fidelity, reliable fabrication of complex geometries.
|
| |
| 15:00-16:30, Paper ThI2I.98 | Add to My Program |
| ViSA-Flow: Accelerating Robot Skill Learning Via Large-Scale Video Semantic Action Flow |
|
| Chen, Changhe | University of Michigan |
| Yang, Quantao | KTH Royal Institute of Technology |
| Xu, Xiaohao | University of Michigan, Ann Arbor |
| Fazeli, Nima | University of Michigan |
| Andersson, Olov | KTH Royal Institute of Technology |
Keywords: Imitation Learning, Learning from Demonstration, Deep Learning in Grasping and Manipulation
Abstract: One of the central challenges preventing robots from acquiring complex manipulation skills is the prohibitive cost of collecting large-scale robot demonstrations. In contrast, humans are able to learn efficiently by watching others interact with their environment. To bridge this gap, we introduce semantic action flow as a core intermediate representation capturing the essential spatio-temporal manipulator-object interactions, invariant to superficial visual differences. We present ViSA-Flow, a framework that learns this representation self-supervised from unlabeled large-scale video data. First, a generative model is pre-trained on semantic action flows automatically extracted from large-scale human-object interaction video data, learning a robust prior over manipulation structure. Second, this prior is efficiently adapted to a target robot by fine-tuning on a small set of robot demonstrations processed through the same semantic abstraction pipeline. We demonstrate through extensive experiments on the CALVIN benchmark and real-world tasks that ViSA-Flow achieves state-of-the-art performance, particularly in low-data regimes, outperforming prior methods by effectively transferring knowledge from human video observation to robotic execution. Videos are available at https://visaflow-web.github.io/ViSAFLOW.
|
| |
| 15:00-16:30, Paper ThI2I.99 | Add to My Program |
| Selective Actuation for Microrobots Based on Distributed Magnetic Field Design |
|
| Fang, Kaiwen | Chinese University of Hong Kong, Shenzhen |
| Yang, Zhen | Chinese University of Hong Kong, Shenzhen |
| Wang, Lianshuo | Beihang University |
| Liu, Yuezhen | Chinese University of Hong Kong, Shenzhen |
| Chen, Hui | Chinese University of Hong Kong, Shenzhen |
| Liu, Yu | Chinese University of Hong Kong, Shenzhen |
| Yu, Jiangfan | Chinese University of Hong Kong, Shenzhen |
Keywords: Micro/Nano Robots, Automation at Micro-Nano Scales
Abstract: Mechanical stimulation is essential for regulating cellular processes such as proliferation, differentiation, and apoptosis. Magnetic microrobot swarms offer a promising platform for delivering targeted mechanical stimulation to cells via remote actuation under rotating magnetic fields. However, magnetic fields globally activate swarms in non-target regions, risking undesired biological effects. To overcome this limitation, we propose a spatially selective magnetic actuation strategy that confines mechanical stimulation to user-defined regions. A dual-robotic-arm magnetic actuation system is developed to generate a selective rotating magnetic field. The field ensures swarms have smooth rotation and longer chain formation within the target area, enabling effective mechanical stimulation, while swarms outside exhibit shortened chains and disordered motion. We further demonstrate that the area swept by the rotating chain of microrobots peaks within the targeted region but drops sharply beyond it. This approach provides a foundation for precise mechanostimulation in biomedical applications with minimal off-target effects.
|
| |
| 15:00-16:30, Paper ThI2I.100 | Add to My Program |
| WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments |
|
| Knights, Joshua | University of Sydney |
| Reid, Joseph | CSIRO |
| Roy, Kaushik | CSIRO |
| Hall, David | Commonwealth Scientific and Industrial Research Organisation |
| Cox, Mark | CSIRO |
| Moghadam, Peyman | CSIRO |
Keywords: Data Sets for Robotic Vision, Deep Learning for Visual Perception, RGB-D Perception
Abstract: Recent years have seen a significant increase in demand for robotic solutions in unstructured natural environments, alongside growing interest in bridging 2D and 3D scene understanding. However, existing robotics datasets are predominantly captured in structured urban environments, making them inadequate for addressing the challenges posed by complex, unstructured natural settings. To address this gap, we propose WildCross, a cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF poses and synchronized dense lidar submaps. We conduct comprehensive experiments on visual, lidar, and cross-modal place recognition, as well as metric depth estimation, demonstrating the value of WildCross as a challenging benchmark for multi-modal robotic perception tasks. We provide access to the code repository and dataset at https://csiro-robotics.github.io/WildCross.
|
| |
| 15:00-16:30, Paper ThI2I.101 | Add to My Program |
| Toward Human-Like Assistance: Detecting Help-Seeking in Human–Robot Collaboration Via Implicit Signals |
|
| San Martin, Ane | Tekniker |
| Iriondo Azpiri, Ander | Department of Autonomous and Intelligent Systems, Tekniker - Basque Research and Technology Alliance (BRTA) |
| Hagenow, Michael | University of Wisconsin - Madison |
| Shah, Julie A. | MIT |
| Kildal, Johan | TEKNIKER |
| Lazkano, Elena | University of Basque Country |
|
|
| |
| 15:00-16:30, Paper ThI2I.102 | Add to My Program |
| KINESIS: Motion Imitation for Human Musculoskeletal Locomotion |
|
| Simos, Merkourios | Ecole Polytechnique Fédérale De Lausanne (EPFL) |
| Chiappa, Alberto Silvio | Ecole Polytechnique Fédérale De Lausanne (EPFL) |
| Mathis, Alexander | Ecole Polytechnique Fédérale De Lausanne (EPFL) |
Keywords: Humanoid and Bipedal Locomotion, Sensorimotor Learning, Biologically-Inspired Robots
Abstract: How do humans move? Advances in reinforcement learning (RL) have produced impressive results in capturing human motion using physics-based humanoid control. However, torque-controlled humanoids fail to model key aspects of human motor control such as biomechanical joint constraints & non-linear and overactuated musculotendon control. We present KINESIS, a model-free motion imitation framework that tackles these challenges. KINESIS is trained on 1.8 hours of locomotion data and achieves strong motion imitation performance on unseen trajectories. Through a negative mining approach, KINESIS learns robust locomotion priors that we leverage to deploy the policy on several downstream tasks such as text-to-control, target point reaching, and football penalty kicks. Importantly, KINESIS learns to generate muscle activity patterns that correlate well with human EMG activity. We show that these results scale seamlessly across biomechanical model complexity, demonstrating control of up to 290 muscles. Overall, the physiological plausibility makes KINESIS a promising model for tackling challenging problems in human motor control. Code, videos and benchmarks are available at https://github.com/amathislab/Kinesis.
|
| |
| 15:00-16:30, Paper ThI2I.103 | Add to My Program |
| Physics-Informed Machine Learning for Efficient Sim-To-Real Data Augmentation in Micro-Object Pose Estimation |
|
| Zongcai, Tan | Imperial College London |
| Wei, Lan | Imperial College London |
| Zhang, Dandan | Imperial College London |
Keywords: Micro/Nano Robots, Computer Vision for Medical Robotics, Deep Learning for Visual Perception
Abstract: Precise pose estimation of optical microrobots is essential for enabling high-precision object tracking and autonomous biological studies. However, current methods rely heavily on large, high-quality microscope image datasets, which are difficult and costly to acquire due to the complexity of microrobot fabrication and the labour-intensive labelling. Digital twin systems offer a promising path for sim-to-real data augmentation, yet existing techniques struggle to replicate complex optical microscopy phenomena, such as diffraction artifacts and depth-dependent imaging. This work proposes a novel physics-informed deep generative learning framework that, for the first time, integrates wave optics-based physical rendering and depth alignment into a generative adversarial network (GAN), to synthesise high-fidelity microscope images for microrobot pose estimation efficiently. Our method improves the structural similarity index (SSIM) by 35.6% compared to purely AI-driven methods, while maintaining real-time rendering speeds (0.022 s/frame). The pose estimator (CNN backbone) trained on our synthetic data achieves 93.9%/91.9% (pitch/roll) accuracy, just 5.0%/5.4% (pitch/roll) below that of an estimator trained exclusively on real data.
|
| |
| 15:00-16:30, Paper ThI2I.104 | Add to My Program |
| T-CoLoc: Leveraging Tethers for Reliable Co-Localization within an Underwater ROV Chain |
|
| Drupt, Juliette | University of Montpellier |
| Dune, Claire | University of Toulon |
| Comport, Andrew Ian | CNRS-I3S/UNS |
| Hugel, Vincent | University of Toulon |
Keywords: Marine Robotics
Abstract: Underwater Remotely Operated Vehicles (ROVs) exchange data with a control station via a communication cable. One or more intermediate robots can be placed along this tether to manage its shape and minimize the mechanical effects on the ROV. This work deals with the localization of a pair of underwater robots connected by a tether, in a previously unknown environment. While each robot can estimate its trajectory and a model of its surroundings using Simultaneous Localization And Mapping (SLAM) algorithms, aligning these observations in the same reference frame requires inter-robot data association. In this work, we introduce T-Coloc, a new method for aligning models’ frames that leverages an estimation of the tether shape to align individual robot observations. An experimental validation in a pool demonstrates that T-CoLoc can align the trajectories of the two robots in the same reference frame with an error lower than 20 cm using the noisy shape estimation of a 3 m long tether.
|
| |
| 15:00-16:30, Paper ThI2I.105 | Add to My Program |
| MRASfM: Multi-Camera Reconstruction and Aggregation through Structure-From-Motion in Driving Scenes |
|
| Xuan, Lingfeng | Shanghai Jiao Tong University |
| Nie, Chang | SJTU |
| Xu, Yiqing | China University of Mining and Technology |
| Miao, Yanzi | China University of Mining and Technology |
| Wang, Hesheng | Shanghai Jiao Tong University |
Keywords: Localization, Mapping, Autonomous Vehicle Navigation
Abstract: Structure from Motion (SfM) estimates camera poses and reconstructs point clouds, forming a foundation for various tasks. However, applying SfM to driving scenes captured by multi-camera systems presents significant difficulties, including unreliable pose estimation, excessive outliers in road surface reconstruction, and low reconstruction efficiency. To address these limitations, we propose a Multi-camera Reconstruction and Aggregation Structure-from-Motion (MRASfM) framework specifically designed for driving scenes. MRASfM enhances the reliability of camera pose estimation by leveraging the fixed spatial relationships within the multi-camera system during the registration process. To improve the quality of road surface reconstruction, our framework employs a plane model to effectively remove erroneous points from the triangulated road surface. Moreover, treating the multi-camera set as a single unit in Bundle Adjustment (BA) helps reduce optimization variables to boost efficiency. In addition, MRASfM achieves multi-scene aggregation through scene association and assembly modules in a coarse-to-fine fashion. We deployed multi-camera systems on actual vehicles to validate the generalizability of MRASfM across various scenes and its robustness in challenging conditions through real-world applications. Furthermore, large-scale validation results on public datasets show the state-of-the-art performance of MRASfM, achieving 0.124 absolute pose error on the nuScenes dataset.The code is available at https://github.com/IRMVLab/MRASfM.
|
| |
| 15:00-16:30, Paper ThI2I.106 | Add to My Program |
| TerraSkipper: A Centimeter-Scale Robot for Multi-Terrain Skipping and Crawling |
|
| Singh, Shashwat | Carnegie Mellon University |
| Zhang, Ziyun | Carnegie Mellon University |
| Matonis, Spencer | Edulis Therapeutics |
| Temel, Zeynep | Carnegie Mellon University |
Keywords: Biomimetics, Biologically-Inspired Robots, Mechanism Design
Abstract: Mudskippers are unique amphibious fish capable of locomotion in diverse environments, including terrestrial surfaces, aquatic habitats, and highly viscous substrates such as mud. This versatile locomotion is largely enabled by their powerful tail, which stores and rapidly releases energy to produce impulsive jumps. Inspired by this biological mechanism, we present the design and development of a multi-terrain centimeter-scale skipping and crawling robot. The robot is predominantly 3D printed and features onboard sensing, computation, and power. It is equipped with two side fins for crawling, each integrated with a hall effect sensor for gait control, while a rotary springtail driven by a 10mm planetary gear motor enables continuous impulsive skipping across a range of substrates to achieve multi-terrain locomotion. We modeled and experimentally characterized the tail, identifying an optimal length of 25mm that maximizes the mean propulsive force (4N, peaks up to 6N) for forward motion. In addition, we evaluated skipping on substrates where fin based crawling alone fails, and varied the moisture content of uniform sand and bentonite clay powder to compare skipping with crawling. Skipping consistently produced higher mean velocities than crawling, particularly on viscous and granular media. Finally, outdoor tests on grass, loose sand, and hard ground confirmed that combining skipping on entangling and granular terrain with crawling on firm ground extends the operational range of the robot in real-world environments.
|
| |
| 15:00-16:30, Paper ThI2I.107 | Add to My Program |
| Supportive Relationships-Aware Hierarchical Reinforcement Learning for Efficient Ex-Situ Object Rearrangement |
|
| Xiao, Leibing | Shandong University |
| Wang, Xuemei | School of Control Science and Engineering, Shandong University |
| Zhongqiang, Zhao | Shandong University |
| Wang, Yachao | Shandong University |
| Liu, Jin | Shandong University |
| Wang, Chaoqun | Shandong University |
Keywords: Domestic Robotics, Service Robotics, Planning, Scheduling and Coordination
Abstract: In ex-situ object rearrangement tasks within open environments, robots face significant challenges due to the increased cost of moving objects over large workspaces. To address this issue, we propose a hierarchical reinforcement learning-based approach that takes into account the supportive relationships and semantic correlations between objects. The robot groups and stacks objects with compatible supportive capabilities, moving them together to their target locations to optimize task execution. Specifically, we use a large language model to assess the supportive relationships and semantic correlations between objects. In the high-level decision-making process, objects are grouped based on their supportive capabilities, while the low-level process refines these groupings using a graph capsule convolutional network. Experimental results demonstrate that our approach not only reduces the number of movements required but also improves task efficiency and significantly decreases task completion time by approximately 50%, compared to methods that do not consider supportive relationships.
|
| |
| 15:00-16:30, Paper ThI2I.108 | Add to My Program |
| Train Once, Apply Broadly: Low-Frequency Generative Augmentation for Driver Distraction Recognition under Photometric Shifts |
|
| Liu, Dichao | Dalian Maritime University |
| Zhao, Longjiao | Independent Researcher, Japan |
| Gu, Mingkai | Suzhou University of Science and Technology |
| Chen, HaoJiang | JiangSu University |
| Ji, Ying | Wuxi University of Technology |
Keywords: Deep Learning for Visual Perception, Intelligent Transportation Systems, Recognition
Abstract: Driver distraction recognition (DDR) degrades under deployment-time shifts in camera/ISP pipelines and illumination. We frame this as a single-source domain generalization (SSDG) problem: training on one labeled source domain and testing on unseen devices and lighting. Motivated by this, we propose Low-Frequency Generative Augmentation (LFGA), which separates each image into a fixed high-frequency structure and a re-renderable low-frequency base. Multi-stage, feature-conditioned generators perturb only the photometric low-frequency content and recombine it with the original high-frequency structure to yield "hard-but-correct" views to teach the model photometric invariances. Training imposes decision consistency via cross-entropy and logit matching, and promotes stage-wise separation along class-agnostic factors with a feature-dissimilarity loss. Generators are training-only. On two DDR benchmarks with synthetic cross-photometric shifts and a zero-shot real cross-device video test, LFGA improves cross-domain performance over strong SSDG and DDR baselines while preserving in-domain accuracy.
|
| |
| 15:00-16:30, Paper ThI2I.109 | Add to My Program |
| SonarGAN: A Progressive GAN Framework for Sonar Image Denoising under Multi-Type Noises |
|
| Hu, Zhangrui | Harbin Institute of Technology, Shenzhen |
| Feng, Yunxuan | Harbin Institute of Technology, Shenzhen |
| Nie, Binyu | Harbin Institute of Technology, Shenzhen |
| Yan, Lei | Harbin Institute of Technology, Shenzhen |
| Lu, Wenjie | Harbin Institute of Technology, Shenzhen |
| Hu, Liang | Harbin Institute of Technology, Shenzhen |
Keywords: Marine Robotics, Deep Learning for Visual Perception
Abstract: Forward-looking sonar is essential for underwater perception especially in turbid waters, yet its images are often strongly degraded by various noises, including speckle, sidelobe, and structural noises, which severely hinder downstream tasks such as underwater reconstruction, positioning, and navigation. Most conventional sonar denoising methods reduce the noise at the expense of loss of fine image features or blurred image, while modern supervised learning methods demand large paired datasets that are impractical to obtain in real underwater conditions. In this paper, we propose SonarGAN, a progressive Generative Adversarial Networks (GAN) based framework that denoises sonar images under multi-type noises in one go. Unlike traditional supervised methods, SonarGAN avoids the need for costly paired datasets by combining unpaired real and simulated images, synthetic noisy–clean pairs, and joint refinement for comprehensive denoising. Extensive experiments across multiple types of sonar and underwater environments demonstrate the effectiveness of SonarGAN and its generalization in real-world conditions.
|
| |
| 15:00-16:30, Paper ThI2I.110 | Add to My Program |
| Receding Horizon Reinforcement Learning with Autoregressive Model for Motion Control of Autonomous Vehicles |
|
| Yin, Xin | National University of Defense Technology |
| Cao, Haotian | National University of Defense Technology |
| Zhang, Xinglong | National University of Defense Technology |
| Ma, Qingwen | National University of Defense Technology |
| Xu, Xin | National University of Defense Technology |
| Xie, Haibin | National University of Defense Technology |
Keywords: Reinforcement Learning, Motion Control, Intelligent Transportation Systems
Abstract: This paper presents a model-based reinforcement learning (MBRL) approach with a receding horizon mechanism to optimize the lateral trajectory-tracking performance of autonomous vehicles (AVs). Accurate modeling of complex vehicle dynamics and adaptation to dynamic environments with limited data pose significant challenges for MBRL in AV control. To address these challenges, we propose sample-efficient algorithms that leverage autoregressive modeling to adapt from limited data while managing complex vehicle dynamics. Unlike traditional methods reliant on fixed models, our approach uses the temporal reasoning of autoregressive (AR) models to compensate for the residual dynamics, which effectively approximates the local effects of nonlinearities and disturbances. Integrated with real-time sensor data, the residual generation model is continuously refined via incremental learning in a closed-loop framework, enhancing adaptability. This architecture, combining physical modeling with data-driven residuals, maintains interpretability and improves responsiveness in complex scenarios. CarSim simulations demonstrate superior performance over other state-of-the-art learning-based predictive controllers and classical methods for AV lateral control. Real-world validation on a HongQi electric vehicle (HQEV) confirms the algorithm’s effectiveness, showing significant improvements over classical model predictive control (MPC). This approach holds substantial potential for advanced driver-assistance systems (ADAS) and fully autonomous driving, enabling precise control under diverse conditions.
|
| |
| 15:00-16:30, Paper ThI2I.111 | Add to My Program |
| Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera |
|
| Tang, Tutian | Shanghai Jiao Tong University |
| Ji, Xingyu | Shanghai Jiao Tong University |
| Li, Yutong | Shanghai Jiao Tong University |
| Liu, MingHao | Shanghai JiaoTong University |
| Xu, Wenqiang | Shanghai Jiaotong University |
| Lu, Cewu | ShangHai Jiao Tong University |
Keywords: Gesture, Posture and Facial Expressions, Deep Learning for Visual Perception, Human Detection and Tracking
Abstract: Recent advancements in visual-inertial motion capture systems have demonstrated the potential of combining monocular cameras with sparse inertial measurement units (IMUs) as cost-effective solutions, which effectively mitigate occlusion and drift issues inherent in single-modality systems. However, they are still limited by metric inaccuracies in global translations stemming from monocular depth ambiguity, and shape-agnostic local motion estimations that ignore anthropometric variations. We present Stereo-Inertial Poser, a real-time motion capture system that leverages a single stereo camera and six IMUs to estimate metric-accurate and shape-aware 3D human motion. By replacing the monocular RGB with stereo vision, our system resolves depth ambiguity through calibrated baseline geometry, enabling direct 3D keypoint extraction and body shape parameter estimation. IMU data and visual cues are fused for predicting drift-compensated joint positions and root movements, while a novel shape-aware fusion module dynamically harmonizes anthropometry variations with global translations. Our end-to-end pipeline achieves over 200 FPS without optimization-based post-processing, enabling real-time deployment. Quantitative evaluations across various datasets demonstrate state-of-the-art performance. Qualitative results show our method produces drift-free global translation under a long recording time and reduces foot-skating effects.
|
| |
| 15:00-16:30, Paper ThI2I.112 | Add to My Program |
| DANCeRS: A Distributed Algorithm for Negotiating Consensus in Robot Swarms with Gaussian Belief Propagation |
|
| Patwardhan, Aalok | Imperial College London |
| Davison, Andrew J | Imperial College London |
Keywords: Swarm Robotics, Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents
Abstract: Robot swarms require cohesive collective behaviour to address diverse challenges, including shape formation and decision-making. Existing approaches often treat consensus in discrete and continuous decision spaces as distinct problems. We present DANCeRS, a unified, distributed algorithm leveraging Gaussian Belief Propagation (GBP) to achieve consensus in both domains. By representing a swarm as a factor graph our method ensures scalability and robustness in dynamic environments, relying on purely peer-to-peer message passing. We demonstrate the effectiveness of our general framework through two applications where agents in a swarm must achieve consensus on global behaviour whilst relying on local communication. In the first, robots must perform path planning and collision avoidance to create shape formations. In the second, we show how the same framework can be used by a group of robots to form a consensus over a set of discrete decisions. Experimental results highlight our method's scalability and efficiency compared to recent approaches to these problems making it a promising solution for multi-robot systems requiring distributed consensus. We encourage the reader to see the supplementary video demo.
|
| |
| 15:00-16:30, Paper ThI2I.113 | Add to My Program |
| Occupancy-Aware Trajectory Planning for Autonomous Valet Parking in Uncertain Dynamic Environments |
|
| Savvas Sadiq Ali, Farhad Nawaz | University of Pennsylvania |
| Tariq, Faizan M. | Honda Research Institute USA, Inc |
| Bae, Sangjae | Honda Research Institute, USA |
| Isele, David | University of Pennsylvania, Honda Research Institute USA |
| Singh, Avinash | Honda Research Institute, USA |
| Figueroa, Nadia | University of Pennsylvania |
| Matni, Nikolai | University of Pennsylvania |
| D'sa, Jovin | Honda Research Institute, USA |
Keywords: Motion and Path Planning, Planning under Uncertainty, Autonomous Vehicle Navigation
Abstract: Autonomous Valet Parking (AVP) requires planning under partial observability, where parking spot availability evolves as dynamic agents enter and exit spots. Existing approaches either rely only on instantaneous spot availability or make static assumptions, thereby limiting foresight and adaptability. We propose an approach that estimates probability of future spot occupancy by distinguishing initially vacant and occupied spots while leveraging nearby dynamic agent motion. We propose a probabilistic estimator that integrates partial, noisy observations from a limited Field-of-View, with the evolving uncertainty of unobserved spots. Coupled with the estimator, we design a strategy planner that balances goal-directed parking maneuvers with exploratory navigation based on information gain, and incorporates wait-and-go behaviors at promising spots. Through randomized simulations emulating large parking lots, we demonstrate that our framework significantly improves parking efficiency and trajectory smoothness over existing approaches, while maintaining safety margins. Simulation videos: https://sites.google.com/view/avp-hri/home.
|
| |
| 15:00-16:30, Paper ThI2I.114 | Add to My Program |
| Actron3D: Learning Actionable Neural Functions from Videos for Transferable Robotic Manipulation |
|
| Zhang, Anran | Technical University of Munich |
| Chen, Hanzhi | Technical University of Munich |
| Burkhardt, Yannick | Technical University of Munich |
| Zhong, Yao | Technical University of Munich |
| Betz, Johannes | Technical University of Munich |
| Oleynikova, Helen | ETH Zurich |
| Leutenegger, Stefan | ETH Zurich |
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation, Learning from Demonstration
Abstract: We present Actron3D, a framework that enables robots to acquire transferable 6-DoF manipulation skills from monocular, uncalibrated, RGB-only human demonstration videos. Our key idea is to represent manipulation knowledge within a video as a continuous neural function over object space. At the core of Actron3D lies the Neural Affordance Function, which distills geometry, visual features, contact priors, and action flows from diverse demonstration videos into a compact 3D neural representation. During deployment, we adopt a hierarchical pipeline that retrieves the matched affordance function and transfers encoded manipulation knowledge to novel objects through coarse-to-fine differentiable optimization. Leveraging the continuous nature of Neural Affordance Function, the framework performs spatial queries over multimodal features to align demonstrations with observations and generates precise 6-DoF manipulation policy. Experiments in both simulation and the real-world demonstrate that Actron3D significantly outperforms prior methods, achieving a 14.9 percentage point improvement in the average success rate across 13 tasks while requiring only 2–3 demonstration videos per task.
|
| |
| 15:00-16:30, Paper ThI2I.115 | Add to My Program |
| A Narwhal-Inspired Sensing-To-Control Framework for Small Fixed-Wing Aircraft |
|
| Xie, Fengze | California Institute of Technology |
| Fan, Xiaozhou | HKUST(GZ) |
| Schuster, Jacob | California Institute of Technology |
| Yue, Yisong | California Institute of Technology |
| Morteza, Gharib | CALTECH |
Keywords: Model Learning for Control, Machine Learning for Robot Control, Aerial Systems: Mechanics and Control
Abstract: Fixed-wing unmanned aerial vehicles (UAVs) offer endurance and efficiency but lack low-speed agility because its highly-coupled dynamical model. We present an end-to-end sensing-to-control pipeline that combines bio-inspired hardware instrumentation, physics-informed dynamics learning, and convex control allocation. Measuring oncoming flow on a small airframe is difficult as near-body aerodynamics, propeller slipstream, control surfaces actuation, and the present of gusts would distort pressures and make sensor signals input dependent variables. To raise the signal-to-noise ratio, and to gain invaluable response time, inspired by the Narwhal's tusk, we protrude our in-house developed multi-hole probes far ahead into the upstream, and complement it with sparse yet carefully placed wing pressure sensors for local flow measurement, with systematically-introduced gust of significant magnitude. A data-driven calibration maps pressures signal of the probes to airspeed and flow angles. We then learn a control-affine model of aerodynamic forces with a soft left/right symmetry regularizer that improves identifiability under partial observability and limits confounding between wing pressures and aileron inputs. Desired wrenches (forces output) are realized by a regularized least-squares optimizer that yields smooth, trimmed actuation. Wind-tunnel studies across multiple airspeeds and gust conditions show that adding wing pressures reduces force-estimation error by 25%–30%, the proposed model degrades far less under distribution shift (about 12% versus 44% for an unstructured baseline), and force tracking improves with smoother inputs, including a 27% reduction in normal-force RMSE relative to a plain affine model and 34% relative to an unstructured baseline.
|
| |
| 15:00-16:30, Paper ThI2I.116 | Add to My Program |
| HINT-3D: Human-In-The-Loop Interactive Test-Time Adaptation for 3D Segmentation |
|
| Jamaleddine, Odei | American University of Beirut |
| Elhajj, Imad | American University of Beirut |
| Asmar, Daniel | American University of Beirut |
Keywords: Object Detection, Segmentation and Categorization
Abstract: We present HINT-3D, a human-in-the-loop test-time adaptation framework for 3D semantic segmentation. A few corrective clicks are converted into region masks by a promptable 3D interface (PointSAM). These masks supervise stability-aware updates to a pre-trained backbone at inference. We persist the updates so later scenes start from improved weights, enabling cumulative learning. The wrapper is backbone-agnostic: it requires only logits, a mask-to-index bridge, plus access to a small trainable parameter set, we instantiate it on KPConv, RandLA-Net, and Point Transformer v1. On S3DIS Area-5, HINT-3D delivers strong effort-accuracy gains within a scene, consistent zero-click improvements across scenes, and reduced Expected Calibration Error (ECE), while maintaining responsiveness with head-only updates and uncertainty-gated training. We report mIoU versus saved masks, cross-scene transfer, ECE, latency, and class-specific corrections on common indoor failure modes.
|
| |
| 15:00-16:30, Paper ThI2I.117 | Add to My Program |
| DYNAMO: Dependency-Aware Deep Learning Framework for Articulated Assembly Motion Prediction |
|
| Patel, Mayank | Purdue University |
| Jain, Rahul | Purdue University |
| Unmesh, Asim | Purdue University |
| Ramani, Karthik | Purdue University |
Keywords: Deep Learning for Visual Perception, Data Sets for Robotic Vision, Visual Learning
Abstract: Understanding the motion of articulated mechanical assemblies from static geometry remains a core challenge in 3D perception and design automation. Prior work on everyday articulated objects such as doors and laptops typically assumes simplified kinematic structures or relies on joint annotations. However, in mechanical assemblies like gears, motion arises from geometric coupling, through meshing teeth or aligned axes, making it difficult for existing methods to reason about relational motion from geometry alone. To address this gap, we introduce MechBench, a benchmark dataset of 693 diverse synthetic gear assemblies with part-wise ground-truth motion trajectories. MechBench provides a structured setting to study coupled motion, where part dynamics are induced by contact and transmission rather than predefined joints. Building on this, we propose DYNAMO, a dependency-aware neural model that predicts per-part SE(3) motion trajectories directly from segmented CAD point clouds. Experiments show that DYNAMO outperforms strong baselines, achieving accurate and temporally consistent predictions across varied gear configurations. Together, MechBench and DYNAMO establish a novel systematic framework for data-driven learning of coupled mechanical motion in CAD assemblies.
|
| |
| 15:00-16:30, Paper ThI2I.118 | Add to My Program |
| Right-Side-Out: Learning Zero-Shot Sim-To-Real Garment Reversal |
|
| Yu, Chang | University of California, Los Angeles |
| Ma, Siyu | University of California, Los Angeles |
| Du, Wenxin | University of California, Los Angeles |
| Zong, Zeshun | University of California, Los Angeles |
| Xue, Han | Shanghai Jiao Tong University |
| Chen, Wendi | Shanghai Jiao Tong University |
| Lu, Cewu | ShangHai Jiao Tong University |
| Yang, Yin | University of Utah |
| Han, Xuchen | Toyota Research Institute |
| Masterjohn, Joseph | Toyota Research Institute |
| Castro, Alejandro | Toyota Research Institute |
| Jiang, Chenfanfu | University of California, Los Angeles |
Keywords: Bimanual Manipulation, Deep Learning in Grasping and Manipulation, Simulation and Animation
Abstract: Turning garments right-side out is a challenging manipulation task: it is highly dynamic, entails rapid contact changes, and is subject to severe visual occlusion. We introduce Right-Side-Out, a zero-shot sim-to-real framework that effectively solves this challenge by exploiting task structures. We decompose the task into Drag/Fling to create and stabilize an access opening, followed by Insert&Pull to invert the garment. Each step uses a depth-inferred, keypoint-parameterized bimanual primitive that sharply reduces the action space while preserving robustness. Efficient data generation is enabled by our custom-built, high-fidelity, GPU-parallel Material Point Method (MPM) simulator that models thin-shell deformation and provides robust and efficient contact handling for batched rollouts. Built on the simulator, our fully automated pipeline scales data generation by randomizing garment geometry, material parameters, and viewpoints, producing depth, masks, and per-primitive keypoint labels without any human annotations. With a single depth camera, policies trained entirely in simulation deploy zero-shot on real hardware, achieving up to 81.3% success rate. By employing task decomposition and high fidelity simulation, our framework enables tackling highly dynamic, severely occluded tasks without laborious human demonstrations.
|
| |
| 15:00-16:30, Paper ThI2I.119 | Add to My Program |
| Robust Friction Estimation for an Active Upper-Limb Exoskeleton Via SOSML Observer |
|
| Heidari, Hamidreza | Technische Hochschule Deggendorf |
| De Risi, Paolino | Università Degli Studi Di Napoli Federico II |
| Ficuciello, Fanny | Università Di Napoli Federico II |
Keywords: Prosthetics and Exoskeletons, Rehabilitation Robotics, Actuation and Joint Mechanisms
Abstract: Friction in compact, geared actuators remains a primary barrier to transparency in upper-limb exoskeletons, especially near zero velocity and during frequent reversals. A momentum-based estimation framework is developed and evaluated on a two-DoF active device (modified EDUExo), where joint friction is recovered from on-board joint measurements and fitted to Coulomb–viscous and Stribeck laws. Two estimators are compared under identical conditions: a first-order momentum observer (FO) and a second-order sliding-mode momentum observer (SOSML). Three velocity trajectories are designed to probe complementary behaviors. In simulation, SOSML adheres more closely to the S-shaped friction law, and preserves loop symmetry under encoder noise; parameter variance and robustness under structured model mismatch are likewise improved relative to FO. The results indicate that SOSML delivers lower lag, cleaner noise profiles, and reduced parameter drift without changing the signal set or adding sensors, thereby strengthening friction identification and compensation on compact, gear-reduced actuators.
|
| |
| 15:00-16:30, Paper ThI2I.120 | Add to My Program |
| A Self-Supervised Learning Approach with Differentiable Optimization for UAV Trajectory Planning |
|
| Jiang, Yufei | Pennsylvania State University |
| Zhan, Yuanzhu | Pennsylvania State University |
| Gupta, Harsh vardhan | Sair Lab |
| Borde, Chinmay Mahendra | SAIR Labs |
| Geng, Junyi | Pennsylvania State University |
Keywords: Aerial Systems: Applications, Motion and Path Planning, Collision Avoidance
Abstract: While Unmanned Aerial Vehicles (UAVs) have gained significant traction across various fields, path planning in 3D environments remains a critical challenge, particularly under size, weight, and power (SWAP) constraints. Traditional modular planning systems often introduce latency and suboptimal performance due to limited information sharing and local minima issues. End-to-end learning approaches streamline the pipeline by mapping sensory observations directly to actions but require large-scale datasets, face significant sim-to-real gaps, or lack dynamical feasibility. In this paper, we propose a self-supervised UAV trajectory planning pipeline that integrates a learning-based depth perception with differentiable trajectory optimization. A 3D cost map guides UAV behavior without expert demonstrations or human labels. Additionally, we incorporate a neural network-based time allocation strategy to improve the efficiency and optimality. The system thus combines robust learning-based perception with reliable physics-based optimization for improved generalizability and interpretability. Both simulation and real-world experiments validate our approach across various environments, demonstrating its effectiveness and robustness. Our method achieves a 30.90% reduction in control effort while maintaining competitive tracking performance compared with state-of-the-art.
|
| |
| 15:00-16:30, Paper ThI2I.121 | Add to My Program |
| Improving Robotic Manipulation Robustness Via NICE Scene Surgery |
|
| Pakdamansavoji, Sajjad | Huawei |
| Pourkeshavarz, Mozhgan | Research Scientist at Huawei |
| Sigal, Adam | McGill University |
| Li, Zhiyuan | University of Toronto |
| Yang, Rui Heng | Huawei Technologies Canada |
| Rasouli, Amir | Huawei Technologies Canada |
Keywords: Deep Learning in Grasping and Manipulation, Manipulation Planning, Perception for Grasping and Manipulation
Abstract: Learning robust visuomotor policies for robotic manipulation remains a challenge in real-world settings, where visual distractors can significantly degrade performance and safety. In this work, we propose an effective and scalable framework, Naturalistic Inpainting for Context Enhancement (NICE). Our method minimizes out-of-distribution (OOD) gap in imitation learning by increasing visual diversity through construction of new experiences using existing demonstrations. By utilizing image generative frameworks and large language models, NICE performs three editing operations, object replacement, restyling, and removal of distracting (non-target) objects. These changes preserve spatial relationships without obstructing target objects and maintain action-label consistency. Unlike previous approaches, NICE requires no additional robot data collection, simulator access, or custom model training, making it readily applicable to existing robotic datasets. Using real-world scenes, we showcase the capability of our framework in producing photo-realistic scene enhancement. For downstream tasks, we use NICE data to finetune a vision-language model (VLM) for spatial affordance prediction and a vision-language-action (VLA) policy for object manipulation. Our evaluations show that NICE successfully minimizes OOD gaps, resulting in over 20% improvement in accuracy for affordance prediction in highly cluttered scenes. For manipulation tasks, success rate increases on average by 11% when testing in environments populated with distractors in different quantities. Furthermore, we show that our method improves visual robustness, lowering target confusion by 6%, and enhances safety by reducing collision rate by 7%.
|
| |
| 15:00-16:30, Paper ThI2I.122 | Add to My Program |
| HITL-D: Human in the Loop Diffusion Assisted Shared Control |
|
| Zilka, Riley | University of Alberta |
| Khlynovskiy, Sergey | University of Alberta |
| Wang, Allie | University of Alberta |
| Jagersand, Martin | University of Alberta |
Keywords: Physically Assistive Devices, Human Factors and Human-in-the-Loop, Imitation Learning
Abstract: Autonomous manipulation systems have achieved remarkable capabilities, yet the integration of human expertise with diffusion-based policies in shared control remains relatively unexplored. In this paper, we propose Human-In-The-Loop Diffusion (HITL-D), a shared control framework that enhances user performance in multistep, insertion, and fine manipulation tasks. HITL-D leverages a novel combination of diffusion-based policies and human control to provide autonomous end effector orientation updates conditioned on a scene point cloud and the Cartesian position of the end effector. This approach reduces the number of joystick control axes required, thereby lowering mental workload. In a multi-task user study with 12 participants, HITL-D reduced average task completion times by 40%, decreased perceived workload by 37%, and improved Likert-scale ratings for independence, intuitiveness, and confidence compared to traditional teleoperation methods. These results demonstrate that HITL-D effectively integrates human expertise with autonomous assistance, improving both objective and subjective aspects of teleoperation.
|
| |
| 15:00-16:30, Paper ThI2I.123 | Add to My Program |
| Learning Neural Control Barrier Functions from Expert Demonstrations Using Inverse Constraint Learning |
|
| Yang, Yuxuan | Washington University in Saint Louis |
| Sibai, Hussein | Washington University in St. Louis |
Keywords: Learning from Demonstration, Collision Avoidance, Machine Learning for Robot Control
Abstract: Safety is a fundamental requirement for autonomous systems operating in critical domains. Control barrier functions (CBFs) have been used to design safety filters that minimally alter nominal controls for such systems to maintain their safety. Learning neural CBFs has been proposed as a data-driven alternative for their computationally expensive optimization-based synthesis. However, it is often the case that the failure set of states that should be avoided is non-obvious or hard to specify formally, e.g., tailgating in autonomous driving, while a set of expert demonstrations that achieve the task and avoid the failure set is easier to generate. We use inverse constraint learning (ICL) to train a constraint function that classifies the states of the system under consideration to safe, i.e., belong to a controlled forward invariant set that is disjoint from the unspecified failure set, and unsafe ones, i.e., belong to the complement of that set. We then use that function to label a new set of simulated trajectories to train our neural CBF. We empirically evaluate our approach in four different environments, demonstrating that it outperforms existing baselines and achieves comparable performance to a neural CBF trained with the same data but annotated with ground-truth safety labels.
|
| |
| 15:00-16:30, Paper ThI2I.124 | Add to My Program |
| Differentiable Optimization-Based Modular Planning Framework for Pick-And-Place with Regrasp |
|
| Song, Yejun | Seoul National University |
| An, Seoki | Seoul National University |
| Lee, Somang | Seoul National University |
| Lee, Jeongmin | Seoul National University |
| Lee, Jeongseob | Seoul National University |
| Yoo, Geun Su | Seoul National University |
| Lee, Dongjun | Seoul National University |
Keywords: Manipulation Planning, Grasping, Task Planning
Abstract: Robotic manipulation commonly involves pick-and-place tasks in which regrasp may be necessary for low-dexterity manipulators. Many existing approaches rely on sampling, which becomes inefficient when repeated regrasp is required in high-dimensional configuration spaces. We propose a modular planning framework that comprises differentiable optimization-based modules: grasp generation, stable pose prediction, inverse kinematics solving, and path planning. The modular design yields a systematic pipeline, enabling direct pick-and-place, static or non-static release, and repeated regrasp by solving each module as needed. Each module leverages differentiable geometric features to efficiently solve its corresponding optimization problem. Our framework explicitly accounts for grasp constraints across both task scenes and predicts stable poses for regrasp planning via optimization rather than expensive physics simulations, thereby improving the feasibility and efficiency of planning. We validated the framework in pick-and-place simulations and real-world experiments.
|
| |
| 15:00-16:30, Paper ThI2I.125 | Add to My Program |
| MultiHand: Design and Verification of a Dexterous Hand with Multi-Modal Grasping Capabilities |
|
| Tian, Yaopeng | Tsinghua University |
| Guo, Changqing | Tsinghua University |
| Li, Shoujie | Tsinghua Shenzhen International Graduate School |
| Liang, Chenxin | Tsinghua Unviersity |
| Tan, Junbo | Tsinghua University |
| Wang, Xueqian | Center for Artificial Intelligence and Robotics, Graduate School at Shenzhen, Tsinghua University, 518055 Shenzhen, China |
|
|
| |
| 15:00-16:30, Paper ThI2I.126 | Add to My Program |
| The Turkish Ice Cream Robot: Examining Playful Deception in Social Human-Robot Interactions |
|
| Kim, Hyeonseong | Korea University |
| El-Helou, Roy | Queen's University |
| Lee, Seungbeen | Yonsei University |
| Choi, Sungjoon | Korea University |
| Pan, Matthew | Queen's University |
Keywords: Social HRI, Acceptability and Trust, Design and Human Factors
Abstract: Playful deception, a common feature in human social interactions, remains underexplored in Human-Robot Interaction (HRI). Inspired by the Turkish Ice Cream (TIC) vendor routine, we investigate how bounded, culturally familiar forms of deception influence user trust, enjoyment, engagement, and willingness-to-pay during robotic handovers. We design a robotic manipulator equipped with a custom end-effector and implement five TIC-inspired trick policies that deceptively delay the handover of an ice cream-shaped object. Through a mixed-design user study with 91 participants, we evaluate the effects of playful deception and interaction duration on user experience. Results reveal that TIC-inspired deception significantly enhances enjoyment and engagement, though reduces perceived safety and trust, suggesting a structured trade-off across the multi-dimensional aspects. Our findings demonstrate that playful deception can be a valuable design strategy for interactive robots in entertainment and engagement-focused contexts, while underscoring the importance of deliberate consideration of its complex trade-offs. Videos and user study snapshots are available on https://hyeonseong-kim98.github.io/turkish-ice-cream-robot/
|
| |
| 15:00-16:30, Paper ThI2I.127 | Add to My Program |
| Investigation of Multiple Buoyancy Controller Equipped Underwater Glider Robot Modeling for Control System Development and Gliding Simulation |
|
| Canete, Luis | University of San Carlos |
| Paquibot, Jun Niel | University of San Carlos |
| Matsumoto, Masaharu | Fukushima University |
| Takahashi, Takayuki | Fukushima University |
Keywords: Environment Monitoring and Management, Marine Robotics
Abstract: This paper discusses the development of a model for an underwater glider robot equipped with multiple buoyancy controllers aimed at environmental surveying of lakes. Primary target of the model is performing 2D simulation of actual gliding that will lead to control system development. This proves to be a challenge as gliding requires calculation of hydrodynamic forces in the medium, in this case water, which typically involves Computational Fluid Dynamics (CFD). Although CFD is a well established technique, it's a well known fact that it is expensive both in terms of computational resources and valuable time. Instead, an approach that combines CFD with Euler-Lagrange equations is proposed and undertaken. Discussion regarding the proposed underwater glider, derivation of the model, the architecture of the simulation, and preliminary simulation results referenced with actual gliding results are presented.
|
| |
| 15:00-16:30, Paper ThI2I.128 | Add to My Program |
| Do What You Say: Steering Vision-Language-Action Models Via Runtime Reasoning-Action Alignment Verification |
|
| Wu, Yilin | Carnegie Mellon University |
| Li, Anqi | NVIDIA |
| Hermans, Tucker | University of Utah |
| Ramos, Fabio | University of Sydney, NVIDIA |
| Bajcsy, Andrea | Carnegie Mellon University |
| Pérez-D'Arpino, Claudia | NVIDIA |
Keywords: Imitation Learning, Deep Learning Methods, Big Data in Robotics and Automation
Abstract: Reasoning Vision Language Action (VLA) models improve robotic instruction-following by generating step-by- step textual plans before low-level actions, an approach inspired by Chain-of-Thought (CoT) reasoning in language models. Yet even with a correct textual plan, the generated actions can still miss the intended outcomes in the plan, especially in out-of-distribution (OOD) scenarios. We formalize this phenomenon as a lack of embodied CoT faithfulness, and introduce a training-free, runtime policy steering method for reasoning-action alignment. Given a reasoning VLA’s intermediate textual plan, our framework samples multiple candidate action sequences from the same model, predicts their outcomes via simulation, and uses a pre-trained Vision-Language Model (VLM) to select the sequence whose outcome best aligns with the VLA’s own textual plan. Only executing action sequences that align with the textual reasoning turns our base VLA’s natural action diversity from a source of error into a strength, boosting robustness to semantic and visual OOD perturbations and enabling novel behavior composition without costly re-training. We also contribute a reasoning-annotated extension of LIBERO-100, environment variations tailored for OOD evaluation, and demonstrate up to 15% performance gain over prior work on behavior composition tasks. The overall framework scales with compute (347ms at K = 10 samples) and data diversity. Project Website at: https://yilin-wu98.github.io/steering-reasoning-vla
|
| |
| 15:00-16:30, Paper ThI2I.129 | Add to My Program |
| IRPFuzz: Fuzzing Industrial Robot Protocol Via LLM-Driven Traffic Semantic Analysis |
|
| Xi, Laile | Institute of Information Engineering, CAS; School of Cyber Security, University of Chinese Academy of Sciences |
| Lin, Weicheng | Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sc |
| Yang, Zhang | Beijing University of Posts and Telecommunications |
| Lin, Shenghao | Institute of Information Engineering, Chinese Academy of Sciences |
| Sun, Hao | UCAS |
| Ren, Yimo | insitute of information engineering |
| Zhu, Hongsong | Institute of Information Engineering, CAS; School of Cyber Security, University of Chinese Academy of Sciences |
|
|
| |
| 15:00-16:30, Paper ThI2I.130 | Add to My Program |
| TriCoSphere: A High‑Dexterity, Large‑Volume, 3‑Finger Coaxial Spherical Manipulator |
|
| Hu, Runze | Shanghaitech University |
| Zhu, Zhengying | ShanghaiTech University |
| Li, Jinyu | Shanghai Fourier Intelligent Technology Co., Ltd |
| Leng, Yatao | ShanghaiTech University |
| Liu, Jingshuai | Shanghai Fourier Intelligent Technology Co., Ltd |
| Xiao, Chenxi | ShanghaiTech University |
Keywords: Grippers and Other End-Effectors, Parallel Robots
Abstract: Designing robotic manipulators often requires balancing dexterity, speed, and payload capacity. While traditional serial-link and cable-driven manipulators offer high dexterity, they struggle to concurrently achieve high speed, and often lack the strength and stiffness required for many applications. To address these limitations, we present TriCoSphere, a novel 12-degree-of-freedom, three-fingered manipulator designed to optimize all three attributes. Each 4-DoF finger employs a Coaxial Spherical Parallel Mechanism (CSPM), which positions all actuators at the base. This parallel architecture minimizes finger inertia for high-speed motion and distributes loads across multiple linkages, enhancing payload capacity. We provide a complete kinematic analysis and develop an efficient inverse kinematics solver for precise fingertip control. Experiments demonstrate that each finger can support a 4.1 kg payload and achieve a motion bandwidth of 6.5 Hz. The manipulator’s grasp range and dexterity are showcased by handling objects from a 20 mm sphere to a 300 mm acrylic ball, as well as performing complex in-hand manipulation tasks. TriCoSphere is cost-effective, robust, and open-sourced to support future research.
|
| |
| 15:00-16:30, Paper ThI2I.131 | Add to My Program |
| Dfrnet-H: Dynamic Feature Refinement Network with Heterogeneous Kernels and Weighted Fusion for Highway Monitoring |
|
| Liu, Xu | Inner Mongolia University |
| Han, Wei | Inner Mongolia University |
| Ma, Ming | Inner Monglia University |
| Batu, Siren | Inner Mongolia University |
|
|
| |
| 15:00-16:30, Paper ThI2I.132 | Add to My Program |
| Seeing Farther and Smarter: Value-Guided Multi-Path Reflection for VLM Policy Optimization |
|
| Yang, Yanting | Rutgers University |
| Gao, Shenyuan | Hong Kong University of Science and Technology |
| Bu, Qingwen | The University of Hong Kong |
| Chen, Li | The University of Hong Kong |
| Metaxas, Dimitris N. | Rutgers University |
Keywords: Manipulation Planning, Deep Learning in Grasping and Manipulation, Autonomous Agents
Abstract: Solving complex, long-horizon robotic manipulation tasks requires a deep understanding of physical interactions, reasoning about their long-term consequences, and precise high-level planning. Vision-Language Models (VLMs) offer a general perceive-reason-act framework for this goal. However, previous approaches using reflective planning to guide VLMs in correcting actions encounter significant limitations. These methods rely on inefficient and often inaccurate implicit learning of state-values from noisy foresight predictions, evaluate only a single greedy future, and suffer from substantial inference latency. To address these limitations, we propose a novel test-time computation framework that decouples state evaluation from action generation. This provides a more direct and fine-grained supervisory signal for robust decision-making. Our method explicitly models the advantage of an action plan, quantified by its reduction in distance to the goal, and uses a scalable critic to estimate. To address the stochastic nature of single-trajectory evaluation, we employ beam search to explore multiple future paths and aggregate them during decoding to model their expected long-term returns, leading to more robust action generation. Additionally, we introduce a lightweight, confidence-based trigger that allows for early exit when direct predictions are reliable, invoking reflection only when necessary. Extensive experiments on diverse, unseen multi-stage robotic manipulation tasks demonstrate a 24.6% improvement in success rate over state-of-the-art baselines, while significantly reducing inference time by 56.5%.
|
| |
| 15:00-16:30, Paper ThI2I.133 | Add to My Program |
| MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose |
|
| Bhouri, Sirine | Imperial College London |
| Wei, Lan | Imperial College London |
| Zheng, Jian-Qing | University of Oxford |
| Zhang, Dandan | Imperial College London |
Keywords: Force and Tactile Sensing, Data Sets for Robot Learning, Sensor Fusion
Abstract: Acquiring aligned visuo-tactile datasets is slow and costly, requiring specialised hardware and large-scale data collection. Synthetic generation is promising, but prior methods are typically single-modality, limiting cross-modal learning. We present MultiDiffSense, a unified diffusion model that synthesises images for multiple vision-based tactile sensors (ViTac, TacTip, ViTacTip) within a single architecture. Our approach uses dual conditioning on CAD-derived, pose-aligned depth maps and structured prompts that encode sensor type and 4-DoF contact pose, enabling controllable, physically consistent multi-modal synthesis. Evaluating on 8 objects (5 seen, 3 novel) and unseen poses, MultiDiffSense outperforms a Pix2Pix cGAN baseline in SSIM by +36.3% (ViTac), +134.6% (ViTacTip), and +64.7% (TacTip). For downstream 3-DoF pose estimation, mixing 50% synthetic with 50% real halves the required real data while maintaining competitive performance. MultiDiffSense alleviates the data-collection bottleneck in tactile sensing and enables scalable, controllable multi-modal dataset generation for robotic applications.
|
| |
| 15:00-16:30, Paper ThI2I.134 | Add to My Program |
| Learning to Control the Whole-Body Shape of a Soft Robotic Arm in Unknown Situations |
|
| Tang, Zhiqiang | Southeast University |
| Wang, Qianqian | Southeast University |
| Rus, Daniela | MIT |
| Laschi, Cecilia | National University of Singapore |
Keywords: Modeling, Control, and Learning for Soft Robots, Machine Learning for Robot Control, Optimization and Optimal Control
Abstract: Control of soft robots is considered one of the key elements in achieving their intelligence. However, it faces challenging problems such as nonlinear dynamics, highly deformable structures, and operation in unpredictable situations. Numerous methods have been proposed to overcome these challenges, but most of them focus on controlling only a small part of the soft robot's body, such as the end effector. Whole-body shape control is a problem that has not yet been fully explored, but it is critical for tasks that require whole-body path planning to navigate in confined or crowded spaces. In this study, we developed a convolutional neural network (CNN)-based approach for controlling the robot's whole-body shape. The key novelty of our approach is that it learns a purely image-driven CNN control policy with online adaptive capability. Our approach has three main components: (1) training an offline shape policy to offer basic actions, (2) building a shape model and updating it online to maintain accuracy, (3) conducting Bayesian optimization based on the basic action and shape model to obtain optimal performance. The presented approach is validated on a soft robotic arm and experimental results demonstrate that the soft arm can be controlled to achieve target shapes and adapt to different previously unknown situations. Meanwhile, our approach achieved better shape control performance than the state-of-the-art method. Overall, this work presents a feasible learning-based approach to the whole-body shape control problem and contributes to the development of soft robot intelligence from the control perspective.
|
| |
| 15:00-16:30, Paper ThI2I.135 | Add to My Program |
| EgoFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for End-To-End Self-Driving |
|
| Su, Haisheng | Shanghai Jiao Tong University |
| Wu, Wei | SenseAuto |
| Yang, Zhenjie | Shanghai Jiao Tong University |
| Guan, Isabel | The Hong Kong University of Science and Technology |
Keywords: Autonomous Agents, Autonomous Vehicle Navigation, Motion and Path Planning
Abstract: Current End-to-End Autonomous Driving (E2E-AD) methods resort to unifying modular designs for various tasks (e.g. perception, prediction and planning). Although optimized with a fully differentiable framework in a planning-oriented manner, existing end-to-end driving systems lacking ego-centric designs still suffer from unsatisfactory performance and inferior efficiency, due to rasterized scene representation learning and redundant information transmission. In this paper, we propose an ego-centric fully sparse paradigm, named EgoFSD, for end-to-end self-driving. Specifically, EgoFSD consists of sparse perception, hierarchical interaction and iterative motion planner. The sparse perception module performs detection and online mapping based on sparse representation of the driving scene. The hierarchical interaction module aims to select the Closest In-Path Vehicle / Stationary (CIPV / CIPS) from coarse to fine, benefiting from an additional geometric prior. As for the iterative motion planner, both selected interactive agents and ego-vehicle are considered for joint motion prediction, where the output multi-modal ego-trajectories are optimized in an iterative fashion. In addition, position-level motion diffusion and trajectory-level planning denoising are introduced for uncertainty modeling, thereby enhancing the training stability and convergence speed. Extensive experiments are conducted on nuScenes and Bench2Drive datasets, which significantly reduces the average L2 error by 59% and collision rate by 92% than UniAD while achieves 6.9X faster running efficiency.
|
| |
| 15:00-16:30, Paper ThI2I.136 | Add to My Program |
| ST-DiffPlanner: A Safety-Enhanced Topology-Aware Diffusion Planner for Global Path Planning |
|
| Yan, Jiaquan | Beijing University of Posts and Telecommunications |
| Zhao, Fang | Beijing University of Posts and Telecommunications |
| Yuan, Huiyu | Beijing University of Posts and Telecommunications |
| Chen, Yushi | Beijing University of Posts and Telecommunications |
| Wang, Long | Beijing University of Posts and Telecommunications |
| Luo, Dan | Beijing Forestry University |
| Luo, Haiyong | Institute of Computing Technology, Chinese Academy of Sciences |
Keywords: Motion and Path Planning, Collision Avoidance
Abstract: In complex environments, traditional path planning methods rely on manually defined models, requiring tedious adjustments under varying scenarios or constraints. They also suffer from unstable time overhead and exponentially increasing computational costs as environmental complexity grows. Deep learning-enhanced methods, while optimizing decisions via neural networks, remain constrained by explicit search/sampling frameworks—this leads to unstable real-time performance and failure to capture real-world trajectory distributions. In contrast, diffusion-based planning directly learns trajectory distributions from data, offering predictable inference latency via fixed inversion steps and inherent support for multimodal solutions. However, its lack of explicit safety constraints often leads to trajectory safety issues, resulting in planning failures. To address these limitations, this paper proposes ST-DiffPlanner, a global path planner following the pipeline of ''topology cognition—direction focusing—trajectory generation". It introduces three targeted optimizations: (1) leveraging topological awareness to constrain the diffusion model to focus on collision-free regions; (2) optimizing inference-phase projection to ensure trajectory continuity and safe distances from obstacles; (3) designing a topology anchor-based safety loss to enhance model safety and training stability. Experimental results demonstrate that ST-DiffPlanner exhibits strong generalization across multiple scenarios and modalities, accurately capturing environmental features and learning task-compliant trajectory characteristics. Our method achieves an average trajectory generation success rate of 96.9%, significantly outperforming baseline methods. Moreover, validation in both simulated and real-world robot platforms confirms its applicability across different systems.
|
| |
| 15:00-16:30, Paper ThI2I.137 | Add to My Program |
| T-FunS3D: Task-Driven Hierarchical Open-Vocabulary 3D Functionality Segmentation |
|
| Feng, Jingkun | Delft University of Technology |
| Sabzevari, Reza | Technical University of Delft |
Keywords: Semantic Scene Understanding, Perception for Grasping and Manipulation, Object Detection, Segmentation and Categorization
Abstract: Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D scenes. It is a challenging task that requires spatial understanding and task interpretation. Current open-vocabulary 3D segmentation methods primarily focus on object-level recognition, while scene-wide part segmentation methods attempt to segment the entire scene exhaustively, making them highly resource-intensive. Balancing segmentation performance in terms of granularity, accuracy, and speed remains a challenge. As one step towards alleviating this, we introduce T-FunS3D, a task-driven hierarchical open-vocabulary 3D functionality segmentation method that provides actionable perception for robotic applications. Our method takes as input the 3D point cloud and posed RGB-D images of an indoor scene. We construct an open-vocabulary scene graph by extracting instances and their visual embeddings in the environment. Given a task description, T-FunS3D identifies the most relevant instances in the scene graph and locates their functional components leveraging a vision-language model. Experiments on the SceneFun3D dataset demonstrate that T-FunS3D is comparable to state-of-the-art in open-vocabulary 3D functionality segmentation, while achieving faster runtime and reduced memory usage.
|
| |
| 15:00-16:30, Paper ThI2I.138 | Add to My Program |
| Enhancing Vision-Based Policies with Omni-View and Cross-Modality Knowledge Distillation for Mobile Robots |
|
| Li, Kai | Zhejiang University, Westlake University |
| Zhao, Shiyu | Westlake University |
Keywords: RGB-D Perception, Visual Learning, Representation Learning
Abstract: Vision-based policies are widely applied in robotics for tasks such as manipulation and locomotion. On lightweight mobile robots, however, they face a trilemma of limited scene transferability, restricted onboard computation resources, and sensor hardware cost. To address these issues, we propose a knowledge distillation approach that transfers knowledge from an information-rich, appearance-invariant omni-view depth policy to a lightweight monocular policy. The key idea is to train the student not only to mimic the expert’s actions but also to align with the latent embeddings of the omni-view depth teacher. Experiments demonstrate that omni-view and depth inputs improve the scene transfer and navigation performance, and that the proposed distillation method enhances the performance of a single-view monocular policy, compared with policies solely imitating actions. Real-world experiments further validate the effectiveness and practicality of our approach. Code will be released publicly.
|
| |
| 15:00-16:30, Paper ThI2I.139 | Add to My Program |
| A Novel Paddling Propulsion Gait for a Wheel-Legged Robot on Sand Terrain |
|
| Tsui, En-Chieh | National Taiwan University |
| Chen, Wei-Ting | National Taiwan University |
| Yu, Wei-Shun | National Taiwan University |
| Chen, Hung-Hsin | Delta Electronics Inc |
| Chien, Shaoyu | Delta Electrics Inc |
| Lin, Pei-Chun | National Taiwan University |
Keywords: Wheeled Robots, Legged Robots, Field Robots
Abstract: Autonomous Mobile Robots are typically limited to structured environments, as conventional wheeled propulsion often fails on deformable terrains like sand due to excessive wheel slip and sinkage. To address this mobility challenge, this paper introduces a novel locomotion strategy for a high-degree-of-freedom wheel-legged robot. The proposed method is a gait based on an asymmetric "dynamic compact-and-push" cycle, where the robot's limbs perform a paddling-like motion to actively remodel the granular media. This active terrain remodeling allows the robot to generate net forward thrust where conventional wheeled locomotion is ineffective. We systematically designed and experimentally validated four distinct gaits founded on this principle. The results demonstrate that this approach enables sustained forward motion in an environment where wheeled propulsion is verified to fail, with the asynchronous paddling gait proving most effective. This work contributes a new, validated locomotion mechanism for sand terrains and provides a quantitative comparison of different limb coordination strategies.
|
| |
| 15:00-16:30, Paper ThI2I.140 | Add to My Program |
| DB-TSDF: Directional Bitmask-Based Truncated Signed Distance Fields for Efficient Volumetric Mapping |
|
| Maese, Jose Enrique | Universidad Pablo De Olavide |
| Merino, Luis | Universidad Pablo De Olavide |
| Caballero, Fernando | Universidad Pablo De Olavide |
Keywords: Mapping
Abstract: This paper presents a high-efficiency, CPU-only volumetric mapping framework based on a Truncated Signed Distance Field (TSDF). The system incrementally fuses raw LiDAR point-cloud data into a voxel grid using a directional bitmask-based integration scheme, producing dense and consistent TSDF representations suitable for real-time 3D reconstruction. A key feature of the approach is that the processing time per point-cloud remains constant, regardless of the voxel grid resolution, enabling high resolution mapping without sacrificing runtime performance. In contrast to most recent TSDF/ESDF methods that rely on GPU acceleration, our method operates entirely on CPU, achieving competitive results in speed. Experiments on real-world open datasets demonstrate that the generated maps attain accuracy on par with contemporary mapping techniques.
|
| |
| 15:00-16:30, Paper ThI2I.141 | Add to My Program |
| M²G-Net: Multimodal Mutual-Guidance Network for LiDAR Depth and Intensity Completion |
|
| Choi, Donghyun | Korea Advanced Institute of Science and Technology |
| Lee, Sangmin | Korea Advanced Institute of Science and Technology |
| Ryu, Jee-Hwan | Korea Advanced Institute of Science and Technology |
Keywords: Deep Learning for Visual Perception
Abstract: Autonomous driving has rapidly advanced with diverse sensors, especially Light Detection and Ranging (LiDAR), which provides precise geometry for tasks like simultaneous localization and mapping (SLAM). Recently, the performance of LiDAR-based SLAM has improved through studies leveraging intensity as a complementary cue to depth. However, in urban environments, dynamic objects occlude static scenes, degrading the stability and accuracy of LiDAR-based SLAM. While previous studies have focused mainly on completing occluded depth, they often disregard intensity, assuming it to be less critical or difficult to estimate due to inherent noise. This overlooks the strong complementary relationship between the two modalities, which can be exploited for effective multimodal completion. Furthermore, completing intensity alongside depth enables broader applicability to intensity-aware perception tasks. To address this issue, a Multimodal Mutual-Guidance (M 2G) module is proposed for the joint completion of occluded depth and intensity in LiDAR data. M 2G is integrated into a deep learning-based network that takes projected range and intensity images as input, enabling progressive cross-modal feature interaction. Leveraging the shared origin of LiDAR depth and intensity, M 2G balances noisy intensity and smooth depth via attention and structure-aware guidance. Experimental results demonstrate that the proposed method outperforms existing inpainting and depth completion approaches, validating its effectiveness for LiDAR completion.
|
| |
| 15:00-16:30, Paper ThI2I.142 | Add to My Program |
| KiRAS: Keyframe Guided Self-Imitation for Robust and Adaptive Skill Learning in Quadruped Robots |
|
| Wei, Xiaoyi | Fudan University |
| Zhai, Peng | Fudan University |
| Tu, Jiaxin | FuDan University |
| ZHang, Yueqi | Fudan University |
| Li, Yuqi | Fudan University |
| ZHang, Zonghao | Fudan University |
| Zhou, Hu | Power China Huadong Engineering Corporation Limited |
| ZHang, Lihua | Fudan University |
Keywords: Legged Robots, Imitation Learning
Abstract: With advances in reinforcement learning and imitation learning, quadruped robots can acquire diverse skills within a single policy by imitating multiple skill-specific datasets. However, the lack of datasets on complex terrains limits the ability of such multi-skill policies to generalize effectively in unstructured environments. Inspired by animation, we adopt keyframes as minimal and universal skill representations, relaxing dataset constraints and enabling the integration of terrain adaptability with skill diversity. We propose Keyframe Guided Self-Imitation for Robust and Adaptive Skill Learning (KiRAS), an end-to-end framework for acquiring and transitioning between diverse skill primitives on complex terrains. KiRAS first learns diverse skills on flat terrain through keyframe-guided self-imitation, eliminating the need for expert datasets; then continues training the same policy network on rough terrains to enhance robustness. To eliminate catastrophic forgetting, a proficiency-based Skill Initialization Technique is introduced. Experiments on Solo-8 and Unitree Go1 robots show that KiRAS enables robust skill acquisition and smooth transitions across challenging terrains. This framework demonstrates its potential as a lightweight platform for multi-skill generation and dataset collection. It further enables flexible skill transitions that enhance locomotion on challenging terrains.
|
| |
| 15:00-16:30, Paper ThI2I.143 | Add to My Program |
| SEM: Enhancing Spatial Understanding for Robust Robot Manipulation |
|
| Lin, Xuewu | Horizon |
| Lin, Tianwei | Horizon Robotics |
| Du, Yun | Horizon Robotics |
| Li, Jitao | Horizon Robotics |
| Xie, Hongyu | HorizonRobotics |
| Jin, Yiwei | Horizon Robotics |
| Huang, Lichao | Horizon Robotics Inc |
| Su, Zhizhong | Horizon Robotics |
Keywords: Deep Learning in Grasping and Manipulation, Dual Arm Manipulation, Dexterous Manipulation
Abstract: A key challenge in robot manipulation lies in developing policy models with consistent spatial understanding—the ability to reason about 3D geometry, object relations, and robot state. Existing mainstream models take 2D images as input, without performing explicit 3D modeling, and thus lack spatial understanding capabilities as well as 3D and embodiment generalization. To address this, we propose SEM (Spatial Enhanced Manipulation), a diffusion-based policy framework that constructs a unified spatial representation by projecting multi-view image features and joint-centric robot states into a unified 3D space. This spatially aligned representation provides a consistent feature space for the diffusion policy to condition on during action generation. Extensive experiments demonstrate that SEM significantly improves spatial understanding, leading to robust and generalizable manipulation across diverse tasks that outperform existing baselines.
|
| |
| 15:00-16:30, Paper ThI2I.144 | Add to My Program |
| SURF-Loco: Mastering Complex Industrial Terrains with 3D Surfel-Based Reinforcement Learning for Legged Robots |
|
| He, Bailin | Fudan University |
| Zhao, Xiting | Fudan University, Lenovo CTO Organization |
| Sun, Qiao | Lenovo CTO Organization |
| Hu, Xiaoyi | Lenovo CTO Organization |
| Liu, Haojie | Lenovo CTO Organization |
| Zhong, Jiangwei | Lenovo CTO Organization |
| Zhang, Wenqiang | Fudan University |
Keywords: Legged Robots, Reinforcement Learning, Sensor-based Control
Abstract: Legged robots offer significant potential for navigating complex industrial terrains, but their capabilities are often constrained by perception systems struggling to interpret intricate 3D geometry. Conventional 2D/2.5D representations like depth or elevation maps fail to capture complex 3D geometry, leading to unsafe locomotion. This paper presents SURF-Loco, a novel framework that enables robust legged locomotion by learning directly from a 3D surfel-based model. Our approach uses surfels to create an omnidirectional representation that explicitly encodes the geometric properties necessary for stable locomotion. We integrate this structured 3D representation into an end-to-end Mixture-of-Experts (MoE) reinforcement learning policy. A variational autoencoder (VAE) distills the complex 3D surroundings into a compact latent context. This geometric context enables a gating network to dynamically select expert sub-policies for agile, context-aware actions. We validate our method on hexapod robots, achieving robust zero-shot sim-to-real transfer on a variety of challenging industrial obstacles.
|
| |
| 15:00-16:30, Paper ThI2I.145 | Add to My Program |
| Finding an Initial Probe Pose in Teleoperated Robotic Echocardiography Via 2D LiDAR-Based 3D Reconstruction |
|
| Roshan, Mariadas Capsran | Swinburne University of Technology |
| Hidalgo Florez, Edgar Mauricio | Swinburne University of Technology |
| Isaksson, Mats | Swinburne University of Technology |
| Dunn, Michelle | Swinburne University of Technology |
| Pyaraka, Jagannatha Charjee | Swinburne University of Technology |
Keywords: Medical Robots and Systems, Computer Vision for Medical Robotics, Robotics in Under-Resourced Settings
Abstract: Echocardiography is a key imaging modality for cardiac assessment but remains highly operator-dependent, and access to trained sonographers is limited in underserved settings. Teleoperated robotic echocardiography has been proposed as a solution. However, clinical studies report longer examination times than manual procedures, increasing diagnostic delays and operator workload. Automating non-expert tasks, such as automatically moving the probe to an ideal starting pose, offers a pathway to reduce this burden. Prior vision- and depth-based approaches to estimate an initial probe pose are sensitive to lighting, texture, and anatomical variability. We propose a robot-mounted 2D LiDAR-based approach that reconstructs the chest surface in 3D and estimates the initial probe pose automatically. To the best of our knowledge, this is the first demonstration of robot-mounted 2D LiDAR used for 3D reconstruction of a human body surface. Through plane-based extrinsic calibration, the transformation between the LiDAR and robot base frames was estimated with an overall root mean square (RMS) residual of 1.82 mm and rotational uncertainty below 0.2°. The chest front surface, reconstructed from two linear LiDAR sweeps, was aligned with scale-augmented rigid registration to identify an initial probe pose. Mannequin-based study assessing reconstruction accuracy showed mean surface errors of 2.78 ± 0.21 mm. Human trials (N=5) evaluating the proposed approach found probe initial points typically 20–30 mm from the clinically defined initial point, while the variation across repeated trials on the same subject was less than 4 mm.
|
| |
| 15:00-16:30, Paper ThI2I.146 | Add to My Program |
| From Language to Action: Can LLM-Based Agents Be Used for Embodied Robot Cognition? |
|
| Shaji, Shinas | Fraunhofer IAIS |
| Huppertz, Fabian | Hochschule Bonn-Rhein-Sieg |
| Mitrevski, Alex | Chalmers University of Technology |
| Houben, Sebastian | Hochschule Bonn-Rhein-Sieg |
Keywords: Cognitive Control Architectures, Agent-Based Systems, AI-Enabled Robotics
Abstract: In order to flexibly act in an everyday environment, a robotic agent needs a variety of cognitive capabilities that enable it to reason about plans and perform execution recovery. Large language models (LLMs) have been shown to demonstrate emergent cognitive aspects, such as reasoning and language understanding; however, the ability to control embodied robotic agents requires reliably bridging high-level language to low-level functionalities for perception and control. In this paper, we investigate the extent to which an LLM can serve as a core component for planning and execution reasoning in a cognitive robot architecture. For this purpose, we propose a cognitive architecture in which an agentic LLM serves as the core component for planning and reasoning, while components for working and episodic memories support learning from experience and adaptation. An instance of the architecture is then used to control a mobile manipulator in a simulated household environment, where environment interaction is done through a set of high-level tools for perception, reasoning, navigation, grasping, and placement, all of which are made available to the LLM-based agent. We evaluate our proposed system on two household tasks (object placement and object swapping), which evaluate the agent's reasoning, planning, and memory utilisation. The results demonstrate that the LLM-driven agent can complete structured tasks and exhibits emergent adaptation and memory-guided planning, but also reveal significant limitations, such as hallucinations about the task success and poor instruction following by refusing to acknowledge and complete sequential tasks. These findings highlight both the potential and challenges of employing LLMs as embodied cognitive controllers for autonomous robots.
|
| |
| 15:00-16:30, Paper ThI2I.147 | Add to My Program |
| Open-Vocabulary Object-Goal Navigation by Generalizing Semantic Mapping with Dense CLIP |
|
| Wei, Meng | The University of Hong Kong |
| Wan, Chenyang | Zhejiang University |
| Wang, Tai | Shanghai AI Laboratory |
| Yang, Yuqiang | South China University of Technology |
| Cai, Wenzhe | Southeast University |
| Chen, Yilun | The Chinese University of Hong Kong |
| Wang, Hanqing | Beijing Institute of Technology |
| Pang, Jiangmiao | Shanghai AI Laboratory |
| Liu, Xihui | The University of Hong Kong |
Keywords: Vision-Based Navigation, RGB-D Perception, Deep Learning for Visual Perception
Abstract: Object-oriented embodied navigation tasks require agents to locate specific objects, either defined by category or images, in unseen environments. While recent methods have made progress in extending closed-set models to open-vocabulary scenarios with foundation models, they typically rely on training-free large language models (LLMs) or finetuning with end-to-end reinforcement learning (RL). However, they face challenges in efficiency (e.g., the overhead and cost of LLM inference) and limited generalization from intensive RL training. In this paper, we propose OVExp, a training-efficient framework for open-vocabulary exploration. We make the first effort to demonstrate the generalization capabilities of semantic map-based goal prediction networks using Dense CLIP models. A major challenge is that preserving both precise point-wise object locations and generalizable visual representations in the semantic map leads to unaffordable training costs. To address this, we design a Cross-Modal Transfer on Semantic Mapping strategy which adapts an intriguing text-only training and transfer to multi-model semantic mapping and goals in test-time. Despite relying on text-based spatial layouts with limited objects, OVExp demonstrates robust generalization to unseen targets on established ObjectNav benchmarks.
|
| |
| 15:00-16:30, Paper ThI2I.148 | Add to My Program |
| Learning Visuomotor Policy for Multi-Robot Laser Tag Game |
|
| Li, Kai | Zhejiang University, Westlake University |
| Zhao, Shiyu | Westlake University |
Keywords: Sensor-based Control, Cooperating Robots, Sensorimotor Learning
Abstract: In this paper, we study multi-robot laser tag, a simplified yet practical shooting-game-style task. Classic modular approaches on these tasks face challenges such as limited observability and reliance on depth mapping and interrobot communication. To overcome these issues, we present an end-to-end visuomotor policy that maps images directly to robot actions. We train a high-performing teacher policy with multi-agent reinforcement learning and distill its knowledge into a vision-based student policy. Technical designs, including a permutation-invariant feature extractor and depth–heatmap input, improve performance over standard architectures. Our policy outperforms classic methods by 16.7% in hitting accuracy and 6% in collision avoidance, and is successfully deployed on real robots. Code will be released publicly.
|
| |
| 15:00-16:30, Paper ThI2I.149 | Add to My Program |
| GMM-LIO: Adaptive and Robust LiDAR-Inertial Odometry with Gaussian Mixture Model Voxel Map |
|
| Deng, Zishun | Nankai University |
| Li, Can | Nankai University |
| Lin, Wanbiao | Nankai University |
| Sun, Lei | Nankai University |
Keywords: Localization, SLAM, Mapping
Abstract: Tightly coupled LiDAR–inertial odometry (LIO) systems are critical for autonomous navigation, yet their performance often degrades due to insufficient adaptability to diverse environments and limitations in map representation. To address these limitations, this paper presents GMM-LIO, a robust and adaptive LIO framework that integrates a novel information-theoretic scan processing module and a high-fidelity Gaussian Mixture Model (GMM) voxel map structure. At its core, GMM-LIO features a two-level adaptive front-end that dynamically modulates voxel resolution based on state uncertainty and adjusts surface covariance estimation according to local point density on a standard voxel grid. Furthermore, GMM-LIO employs a dynamic Gaussian Mixture Model voxel map to accurately model intersecting surfaces. The entire system is formulated as a robust Maximum a Posteriori (MAP)-based estimator, which employs an Iteratively Reweighted Least Squares (IRLS) solver together with a principled anisotropic information matrix to handle measurement outliers. Extensive evaluations on diverse public and self-collected datasets demonstrate that GMM-LIO achieves state-of-the-art accuracy and robustness, with a 36% relative improvement over leading LIO baselines.
|
| |
| 15:00-16:30, Paper ThI2I.150 | Add to My Program |
| A Hierarchical Adaptive Controller with Configurable Joint Torque Constraints for Flexible LIMS Elbow Cable-Driven Mechanism |
|
| Liang, Bin | Tsinghua University |
| Gao, Yibo | Qiyuan Lab |
| Deng, Yang | Tsinghua University |
| Gong, Kai | Qiyuan Lab |
| Zheng, Xudong | Qiyuan Lab |
| Hou, Zhili | QiYuan Lab |
| Lu, Weining | Tsinghua University |
| Jiao, Chunting | Chongqing University |
|
|
| |
| 15:00-16:30, Paper ThI2I.151 | Add to My Program |
| Swarm-ReID: Decentralized Self-Adaptive Gallery Construction for Multi-Robot Open-World Person Re-Identification |
|
| Kaplanis, Marios | Toyota Motor Europe |
| Kegeleirs, Miquel | Université Libre De Bruxelles |
| Garattoni, Lorenzo | Toyota Motor Europe |
| Birattari, Mauro | Université Libre De Bruxelles |
| Francesca, Gianpiero | Toyota Motor Europe |
Keywords: Swarm Robotics
Abstract: Swarm perception enables a robot swarm to collectively sense and understand the environment by integrating sensory inputs from individual robots. We explore its application to person re-identification (re-id), the task of recognizing previously observed individuals. Traditional re-id systems rely on static offline galleries, which restricts their use in open-world scenarios where new identities appear over time. In robotics, most methods address single-robot re-id in person-following tasks, limiting scalability to multi-person settings, while swarm perception studies largely overlook the role of re-id algorithms. To address these gaps, we propose Swarm-ReID, an unsupervised method for decentralized swarm re-identification. Our method introduces mechanisms for robot-to-robot communication and informed movement strategies, enabling the swarm to collaboratively construct adaptive galleries online without centralized control. Simulations across diverse environments, number of people, swarm sizes, communication protocols, and exploration behaviors show that Swarm-ReID consistently outperforms existing swarm perception methods. Our results highlight how communication and informed movement improve recognition performance, establishing Swarm-ReID as a state-of-the-art method for open-world multi-robot person re-identification.
|
| |
| 15:00-16:30, Paper ThI2I.152 | Add to My Program |
| Comparative Analysis of Energy Transfers and Performance in Safety-Critical Control Using Control Barrier Functions |
|
| Maiani, Arturo | Sapienza University of Rome, Bambino Gesù Children's Hospital |
| Califano, Federico | University of Twente |
| Govoni, Lorenzo | Sapienza University of Rome |
| Pietrabissa, Antonio | Sapienza University of Rome |
Keywords: Performance Evaluation and Benchmarking, Collision Avoidance, Dynamics
Abstract: Control barrier functions (CBFs) are used in safety-critical control strategies, implementing a modification of a nominal control action to achieve invariance of a subset of the state space representing safe operating conditions. In this paper we perform a comparative study involving existing safety-critical CBF designs, including energy-based CBFs and Exponential CBFs. The analysis, performed both theoretically and on a benchmark obstacle avoidance task, provides insights into how these CBFs affect energy transfers and the overall performance of the closed-loop system, highlighting benefits and limitations of each approach. To validate our analysis, we conduct software simulations on a 3R planar robot and a 7-DoF robotic manipulator, complemented by experimental evaluations on a physical robotic platform.
|
| |
| 15:00-16:30, Paper ThI2I.153 | Add to My Program |
| Quality Over Quantity: Demonstration Curation Via Influence Functions for Data-Centric Robot Learning |
|
| Lee, Haeone | KAIST |
| Min, Taywon | KAIST |
| Kim, Junsu | KAIST |
| Kang, Sinjae | KAIST |
| Liu, Fangchen | University of California, Berkeley |
| Pinto, Lerrel | New York University |
| Lee, Kimin | KAIST |
Keywords: Data Sets for Robot Learning, Learning from Demonstration
Abstract: Learning from demonstrations has emerged as a promising paradigm for end-to-end robot control, particularly when scaled to diverse and large datasets. However, the quality of demonstration data, often collected through human teleoperation, remains a critical bottleneck for effective data-driven robot learning. Human errors, operational constraints, and teleoperator variability introduce noise and suboptimal behaviors, making data curation essential yet largely manual and heuristic-driven. In this work, we propose Quality over Quantity (QoQ), a grounded and systematic approach to identifying high-quality data by defining data quality as the contribution of each training sample to reducing loss on validation demonstrations. To efficiently estimate this contribution, we leverage influence functions, which quantify the impact of individual training samples on model performance. We further introduce two key techniques to adapt influence functions for robot demonstrations: (i) using maximum influence across validation samples to capture the most relevant state-action pairs, and (ii) aggregating influence scores of state-action pairs within the same trajectory to reduce noise and improve data coverage. Experiments in both simulated and real-world settings show that QoQ consistently improves policy performances over prior data selection methods.
|
| |
| 15:00-16:30, Paper ThI2I.154 | Add to My Program |
| FD-VLA: Force-Distilled Vision-Language-Action Model for Contact-Rich Manipulation |
|
| Zhao, Ruiteng | National University of Singapore |
| Wang, Wenshuo | National University of Singapore |
| Ma, Yicheng | Nanyang Technological University |
| Li, Xiaocong | Eastern Institute of Technology, Ningbo |
| Tay, Francis | NUS |
| Ang Jr, Marcelo H | National University of Singapore |
| Zhu, Haiyue | Agency for Science, Technology and Research (A*STAR) |
Keywords: Learning from Demonstration, Deep Learning in Grasping and Manipulation, Imitation Learning
Abstract: Force sensing is a crucial modality for Vision-Language-Action (VLA) frameworks, as it enables fine-grained perception and dexterous manipulation in contact-rich tasks. We present Force-Distilled VLA (FD-VLA), a novel framework that integrates force awareness into contact-rich manipulation without relying on physical force sensors. The core of our approach is a Force Distillation Module (FDM), which distills force by mapping a learnable query token, conditioned on visual observations and robot states, into a predicted force token aligned with the latent representation of actual force signals. During inference, this distilled force token is injected into the pretrained VLM, enabling force-aware reasoning while preserving the integrity of its vision-language semantics. This design provides two key benefits: first, it allows practical deployment across a wide range of robots that lack expensive or fragile force-torque sensors, thereby reducing hardware cost and complexity; second, the FDM introduces an additional force-vision-state fusion prior to the VLM, which improves cross-modal alignment and enhances perception-action robustness in contact-rich scenarios. Surprisingly, our physical experiments show that the distilled force token outperforms direct sensor force measurements as well as other baselines, which highlights the effectiveness of this force-distilled VLA approach.
|
| |
| 15:00-16:30, Paper ThI2I.155 | Add to My Program |
| Unveiling the Impact of Data and Model Scaling on High-Level Control for Humanoid Robots |
|
| Wei, Yuxi | Shanghai Jiao Tong University |
| Wong, Ziseoi | Zhejiang University |
| Yin, Kangning | Shanghai Jiao Tong University |
| Hu, Yue | University of Michigan |
| Wang, Jingbo | Shanghai AI Lab |
| Chen, Siheng | Shanghai Jiao Tong University |
Keywords: Human and Humanoid Motion Analysis and Synthesis, Datasets for Human Motion, Deep Learning Methods
Abstract: Data scaling has long remained a critical bottleneck in robot learning. For humanoid robots, human videos and motion data are abundant and widely available, offering a free and large-scale data source. Besides, the semantics related to the motions enable modality alignment and high-level robot control learning. However, how to effectively mine raw video, extract robot-learnable representations, and leverage them for scalable learning remains an open problem. To address this, we introduce Humanoid-Union, a large-scale dataset generated through an autonomous pipeline, comprising over 260 hours of diverse, high-quality humanoid robot motion data with semantic annotations derived from human motion videos. The dataset can be further expanded via the same pipeline. Building on this data resource, we propose SCHUR, a scalable learning framework designed to explore the impact of large-scale data on high-level control in humanoid robots. Experimental results demonstrate that SCHUR achieves high robot motion generation quality and strong text-motion alignment under data and model scaling, with 37% reconstruction improvement under MPJPE and 25% alignment improvement under FID comparing with previous methods. Its effectiveness is further validated through deployment in real-world humanoid robot.
|
| |
| 15:00-16:30, Paper ThI2I.156 | Add to My Program |
| Onion-LO++: An Adaptive and Degradation Resistant Continuous-Time LiDAR Odometry |
|
| Cheng, Xiaolong | Southeast University |
| Sun, Ye | Southeast University |
| Geng, Keke | Southeast University |
| Ma, Tianxiao | Southeast University |
| Liu, Zhichao | Southeast University |
Keywords: SLAM, Field Robots
Abstract: In an era dominated by multi-sensor fusion, this paper explores the operational limits of LiDAR-only odometry. We introduce Onion-LO++, which is designed to overcome two practical limitations of Onion-LO: poor performance in geometrically degenerate environments and instability under high-motion conditions. In order to mitigate point cloud degradation, we propose a coarse-to-fine point cloud segmentation approach that extracts intensity and weak corner features from planar regions, while dynamically adjusting the downsampling rate based on the proportion of planar points to maximize geometric constraints. To handle high-motion scenarios, we integrate a continuous-time trajectory model into the backend optimization and introduce an adaptive onion factor that adjusts optimization parameters in real time. Extensive experiments on five challenging public datasets demonstrate that Onion-LO++ outperforms state-of-the-art methods and operates reliably across narrow spaces, degenerate scenes, high-speed motion, and high-altitude aerial mapping. We open-source the code on GitHub.1
|
| |
| 15:00-16:30, Paper ThI2I.157 | Add to My Program |
| M2GRPO: Mamba-Based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit |
|
| Feng, Yukai | University of Chinese Academy of Sciences |
| Wu, Zhiheng | Baidu Inc., Beijing 100085, China |
| Wu, Zhengxing | Chinese Academy of Sciences |
| Gu, Junwen | Institute of Automation, Chinese Academy of Sciences |
| Yu, Junzhi | Peking University |
| Tan, Min | Institute of Automation, Chinese Academy of Sciences |
Keywords: Biologically-Inspired Robots, Reinforcement Learning, Cooperating Robots
Abstract: Traditional policy learning methods in cooperative pursuit face fundamental challenges in biomimetic underwater robots, where long-horizon decision making, partial observability, and inter-robot coordination require both expressiveness and stability. To address these issues, a novel framework called Mamba-based multi-agent group relative policy optimization (M2GRPO) is proposed, which integrates a selective state-space Mamba policy with group-relative policy optimization under the centralized-training and decentralized-execution (CTDE) paradigm. Specifically, the Mamba-based policy leverages observation history to capture long-horizon temporal dependencies and exploits attention-based relational features to encode inter-agent interactions, producing bounded continuous actions through normalized Gaussian sampling. To further improve credit assignment without sacrificing stability, the group-relative advantages are obtained by normalizing rewards across agents within each episode and optimized through a multi-agent extension of GRPO, significantly reducing the demand for training resources while enabling stable and scalable policy updates. Extensive simulations and real-world pool experiments across team scales and evader strategies demonstrate that M2GRPO consistently outperforms MAPPO and recurrent baselines in both pursuit success rate and capture efficiency. Overall, the proposed framework provides a practical and scalable solution for cooperative underwater pursuit with biomimetic robot systems.
|
| |
| 15:00-16:30, Paper ThI2I.158 | Add to My Program |
| ABPolicy: Asynchronous B‑Spline Flow Policy for Real‑Time and Smooth Robotic Manipulation |
|
| Yang, Fan | Tianjin University |
| Jing, Peiguang | Tianjin University |
| Qu, Kaihua | Tianjin University |
| Zhao, Ningyuan | Tianjin University |
| Su, Yuting | Tianjin University |
Keywords: Imitation Learning, Deep Learning in Grasping and Manipulation, Learning from Demonstration
Abstract: Robotic manipulation requires policies that are smooth and responsive to evolving observations. However, synchronous inference in the raw action space introduces several challenges, including intra-chunk jitter, inter-chunk discontinuities, and stop-and-go execution. These issues undermine a policy's smoothness and its responsiveness to environmental changes. We propose ABPolicy, an asynchronous flow‑matching policy that operates in a B‑spline control‑point action space. First, the B‑spline representation ensures intra‑chunk smoothness. Second, we introduce bidirectional action prediction coupled with refitting optimization to enforce inter‑chunk continuity. Finally, by leveraging asynchronous inference, ABPolicy delivers real-time, continuous updates. We evaluate ABPolicy across seven tasks encompassing both static settings and dynamic settings with moving objects. Empirical results indicate that ABPolicy reduces trajectory jerk, leading to smoother motion and improved performance. Project website: url{https://teee000.github.io/ABPolicy/}.
|
| |
| 15:00-16:30, Paper ThI2I.159 | Add to My Program |
| Sample-Efficient Learning with Online Expert Correction for Autonomous Catheter Steering in Endovascular Bifurcation Navigation |
|
| Wang, Hao | Tongji University |
| Yao, Tianliang | The Chinese University of Hong Kong (CUHK) |
| Lu, Bo | Soochow University |
| Pei, Zhiqiang | University of Shanghai for Science and Technology |
| Liu, Dong | Shanghai Aopeng Medical Technology Co., Ltd |
| Ma, Lei | Tongji University |
| Qi, Peng | Tongji University |
Keywords: Surgical Robotics: Steerable Catheters/Needles, Medical Robots and Systems, Modeling, Control, and Learning for Soft Robots
Abstract: Robot-assisted endovascular intervention offers a safe and effective solution for remote catheter manipulation, reducing radiation exposure while enabling precise navigation. Reinforcement learning (RL) has recently emerged as a promising approach for autonomous catheter steering; however, conventional methods suffer from sparse reward design and reliance on static vascular models, limiting their sample efficiency and generalization to intraoperative variations. To overcome these challenges, this paper introduces a sample-efficient RL framework with online expert correction for autonomous catheter steering in endovascular bifurcation navigation. The proposed framework integrates three key components: (1) A segmentation-based pose estimation module for accurate real-time state feedback, (2) A fuzzy controller for bifurcation-aware orientation adjustment, and (3) A structured reward generator incorporating expert priors to guide policy learning. By leveraging online expert correction, the framework reduces exploration inefficiency and enhances policy robustness in complex vascular structures. Experimental validation on a robotic platform using a transparent vascular phantom demonstrates that the proposed approach achieves convergence in 123 training episodes—a 25.9% reduction compared to the baseline Soft Actor-Critic (SAC) algorithm—while reducing average positional error to 83.8% of the baseline. These results indicate that combining sample-efficient RL with online expert correction enables reliable and accurate catheter steering, particularly in anatomically challenging bifurcation scenarios critical for endovascular navigation.
|
| |
| 15:00-16:30, Paper ThI2I.160 | Add to My Program |
| Real-Time Robotic Needle Insertion in Deformable and Moving Structure Using Learning-By-Example Method |
|
| Ha, Thuc Long | RDH-ICUBE - Université De Strasbourg |
| Bert, Julien | LaTIM, CHRU De Brest |
| Courtecuisse, Hadrien | RDH-ICUBE - Université De Strasbourg |
Keywords: Medical Robots and Systems, Simulation and Animation, Learning from Demonstration
Abstract: This paper presents an innovative and practical method for robotic needle steering in radio-frequency ablation (RFA) to treat cancer. One of the main challenges in this process is that tissue shifts and deforms during needle insertion, making it difficult to accurately predict the needle's path in real time. Inverse finite element (iFE) simulations have been used to address this problem. While these methods are accurate, they often require further refinement for effective time performance in real-world robotic systems. This is because when the method is incorporated into a real robot, there can be a delay in command execution. To address this challenge, we propose a machine learning-based solution that learns from offline simulations, shifting the intensive calculations required by iFE methods to an offline training stage and enabling online prediction of tissue deformation with reduced computational time. Our network was trained on data from numerous simulated needle insertions to capture interactions among insertion forces, tissue properties, and resulting motion. Once trained, the model produces predictions almost instantaneously, making it suitable for real-time applications. We validated the approach by steering the needle in a simulated deformable, moving gel to compare it with numerical-based methods, and then performing needle steering within a reconstructed human body that involves multiple structures and integrates the robot's dynamics. The results demonstrated that the developed networks achieved slightly better accuracy in the first scenario while also running faster, resulting in improved performance under the robot's dynamics. These findings show that our method is a promising advancement toward real-time guidance systems for needle-based medical procedures.
|
| |
| 15:00-16:30, Paper ThI2I.161 | Add to My Program |
| DA-MMP: Learning Coordinated and Accurate Throwing with Dynamics-Aware Motion Manifold Primitives |
|
| Chu, Chi | Shanghai Qi Zhi Institute |
| Xu, Huazhe | Tsinghua University; Shanghai Qi Zhi Institute |
Keywords: Representation Learning, Deep Learning in Grasping and Manipulation, Imitation Learning
Abstract: Dynamic manipulation is a key capability for advancing robot performance, enabling skills such as tossing. While recent learning-based approaches have pushed the field forward, most methods still rely on manually designed action parameterizations, limiting their ability to produce the highly coordinated motions required in complex tasks. Motion planning can generate feasible trajectories, but the dynamics gap—stemming from control inaccuracies, contact uncertainties, and aerodynamic effects—often causes large deviations between planned and executed trajectories. In this work, we propose Dynamics-Aware Motion Manifold Primitives (DA-MMP), a motion generation framework for goal-conditioned dynamic manipulation, and instantiate it on a challenging real-world ring-tossing task. Our approach extends motion manifold primitives to variable-length trajectories through a compact parameterization and learns a high-quality manifold from a large-scale dataset of planned motions. Building on this manifold, a conditional flow matching model is trained in the latent space with a small set of real-world trials, enabling the generation of throwing trajectories that account for execution dynamics. Experiments show that our method can generate coordinated and smooth motion trajectories for the ring-tossing task. In real-world evaluations, it achieves high success rates and even surpasses the performance of trained human experts. Moreover, it generalizes to novel targets beyond the training range, indicating that it successfully learns the underlying trajectory–dynamics mapping.
|
| |
| 15:00-16:30, Paper ThI2I.162 | Add to My Program |
| The Curse of Precision: A Data Scaling Law for High-Precision Robotic Manipulation |
|
| Xu, Cuijie | Tsinghua University |
| Xu, Yuanfan | Tsinghua University |
| Xue, Min | Taiyuan University of Technology |
| Lin, Jianjie | Technische Universität München |
| Zhang, Xudong | Tsinghua Univ |
| Wang, Jian | Tsinghua Univ |
| Wang, Yu | Tsinghua University |
| Yu, Jincheng | Tsinghua University |
Keywords: Imitation Learning, AI-Based Methods, Deep Learning in Grasping and Manipulation
Abstract: While scaling laws for imitation learning have primarily focused on generalization in open-world settings, the relationship between data and precision in closed-world tasks like robotic assembly remains largely unexplored. This paper systematically investigates this relationship and introduces a novel scaling law. We find that to achieve a fixed success rate, the required number of demonstrations N, grows super-exponentially as the target precision P, approaches a limit c. This relationship is accurately captured by the model log( N) ∝ 1/( P- c). Crucially, we reveal that the limit precision c is not a static physical constant of the task but an emergent property of the entire agent system, including its sensors and expert policy. Through experiments on canonical manipulation tasks, we validate this law and demonstrate that improving system components—such as adding a wrist camera or using a more effective expert—measurably lowers c, thus expanding the system's achievable precision. Our work provides a new theoretical framework for precision in robotics and a quantitative metric to evaluate system capabilities. Furthermore, these findings provide a practical methodology for guiding the development and debugging of high-precision manipulation systems.
|
| |
| 15:00-16:30, Paper ThI2I.163 | Add to My Program |
| 4DRaL: Bridging 4D Radar with LiDAR for Place Recognition Using Knowledge Distillation |
|
| Huang, Ningyuan | Northeastern University |
| Li, Zhiheng | Northeastern University |
| Fang, Zheng | Northeastern University |
Keywords: Localization, Deep Learning for Visual Perception, Visual Learning
Abstract: Place recognition is crucial for loop closure detection and global localization in robotics. Although mainstream algorithms typically rely on cameras and LiDAR, these sensors are susceptible to adverse weather conditions. Fortunately, the recently developed 4D millimeter-wave radar (4D radar) offers a promising solution for all-weather place recognition. However, the inherent noise and sparsity in 4D radar data significantly limit its performance. Thus, in this paper, we propose a novel framework called 4DRaL that leverages knowledge distillation (KD) to enhance the place recognition performance of 4D radar. Its core is to adopt a high-performance LiDAR-to-LiDAR (L2L) place recognition model as a teacher to guide the training of a 4D radar-to-4D radar (R2R) place recognition model. 4DRaL comprises three key KD modules: a local image enhancement module to handle the sparsity of raw 4D radar points, a feature distribution distillation module that ensures the student model generates more discriminative features, and a response distillation module to maintain consistency in feature space between the teacher and student models. More importantly, 4DRaL can also be trained for 4D radar-to-LiDAR (R2L) place recognition through different module configurations. Experimental results prove that 4DRaL achieves state-of-the-art performance in both R2R and R2L tasks regardless of normal or adverse weather.
|
| |
| 15:00-16:30, Paper ThI2I.164 | Add to My Program |
| Offline Reinforced Finetuning for Chunk-Based VLA Via Real-World RL Policy Distillation with Vision-Guided Copilot |
|
| Wu, Yihao | tsinghua university |
| Yu, Zhenjun | Shanghai Jiao Tong University |
| Yin, Shun | South China Normal University |
| Tan, Junbo | Tsinghua University |
| Wang, Zhihao | Harbin Institute of Technology, Shenzhen |
| Wang, Xueqian | Center for Artificial Intelligence and Robotics, Graduate School at Shenzhen, Tsinghua University, 518055 Shenzhen, China |
|
|
| |
| 15:00-16:30, Paper ThI2I.165 | Add to My Program |
| Edged USLAM: Edge-Aware Event-Based SLAM with Learning-Based Depth Priors |
|
| Sarıözkan, Şebnem | Paderborn University |
| Şahin, Hürkan | Universität Paderborn |
| Álvarez-Tuñón, Olaya | Aarhus University |
| Kayacan, Erdal | Paderborn University |
Keywords: Visual-Inertial SLAM, SLAM, Data Sets for SLAM
Abstract: Conventional visual simultaneous localization and mapping (SLAM) algorithms often fail under rapid motion, low illumination, or abrupt lighting transitions due to motion blur and limited dynamic range. Event cameras mitigate these issues with high temporal resolution and high dynamic range (HDR), but their sparse, asynchronous outputs complicate feature extraction and integration with other sensors; e.g. inertial measurement units (IMUs) and standard cameras. We present Edged USLAM, a hybrid visual–inertial system that extends Ultimate SLAM (USLAM) with an edge-aware front-end and a lightweight depth module. The frontend enhances event frames for robust feature tracking and nonlinear motion compensation, while the depth module provides coarse, region-of-interest (ROI)-based scene depth to improve motion compensation and scale consistency. Evaluations across public benchmarks and real-world unmanned air vehicle (UAV) flights demonstrate that performance varies significantly by scenario. For instance, event-only methods like point–line event-based visual–inertial odometry (PL-EVIO) or learning-based pipelines such as deep event-based visual odometry (DEVO) excel in highly aggressive or extreme HDR conditions. In contrast, Edged USLAM provides superior stability and minimal drift in slow or structured trajectories, ensuring consistently accurate localization on real flights under challenging illumination. These findings highlight the complementary strengths of event-only, learning-based, and hybrid approaches, while positioning Edged USLAM as a robust solution for diverse aerial navigation tasks.
|
| |
| 15:00-16:30, Paper ThI2I.166 | Add to My Program |
| LIMBERO: A Limbed Climbing Exploration Robot Toward Traveling on Rocky Cliffs |
|
| Uno, Kentaro | Tohoku University |
| Imai, Masazumi | Tohoku University |
| Takada, Kazuki | Tohoku University |
| Kataonami, Teruhiro | Tohoku University |
| Matsuura, Yudai | Tohoku University |
| Ringeval-Meusnier, Antonin | INSA Toulouse |
| Nagaoka, Keita | Tohoku University |
| Eguchi, Mikio | Tohoku University |
| Nishibe, Ryo | Tohoku University |
| Yoshida, Kazuya | Tohoku University |
Keywords: Field Robots, Grippers and Other End-Effectors, Space Robotics and Automation
Abstract: In lunar and planetary exploration, legged robots have attracted significant attention as an alternative to conventional wheeled robots, which struggle to traverse rough and uneven terrain. To enable locomotion over highly irregular and steeply inclined surfaces, limbed climbing robots equipped with grippers on their feet have emerged as a promising solution. In this paper, we present LIMBERO, a 10 kg-class quadrupedal climbing robot that employs spine-type grippers for stable locomotion and climbing on rugged and steep terrain. We first introduce a novel gripper design featuring coupled finger-closing and spine-hooking motions, tightly actuated by a single motor, which achieves exceptional grasping performance (>150 N) despite its lightweight design (525 g). Furthermore, we develop an efficient algorithm to visualize a geometry-based graspability index on continuous rough terrain. Finally, we integrate these components into LIMBERO and demonstrate its ability to ascend steep rocky surfaces under a 1 G gravity condition, a performance not previously achieved yet for limbed climbing robots of this scale.
|
| |
| 15:00-16:30, Paper ThI2I.167 | Add to My Program |
| NaviTrace: Evaluating Embodied Navigation of Vision-Language Models |
|
| Windecker, Tim | Karlsruhe Institute of Technology |
| Patel, Manthan | ETH Zurich |
| Reuss, Moritz | Karlsruher Institut of Technology |
| Schwarzkopf, Richard | Karlsruhe Institute of Technology |
| Cadena, Cesar | ETH Zurich |
| Lioutikov, Rudolf | Karlsruhe Institute of Technology |
| Hutter, Marco | ETH Zurich |
| Frey, Jonas | Stanford University |
Keywords: Vision-Based Navigation, Performance Evaluation and Benchmarking, Data Sets for Robot Learning
Abstract: Vision–language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models’ navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation.
|
| |
| 15:00-16:30, Paper ThI2I.168 | Add to My Program |
| Quartic Oscillator Model with Applications in Locomotion |
|
| Carvalho, André | IdMEC/IST |
| Botto, Miguel | DEM/IST |
| Martins, Jorge | Technical University of Lisbon - Instituto Superior Técnico |
Keywords: Humanoid and Bipedal Locomotion, Modeling and Simulating Humans, Dynamics
Abstract: This work introduces a novel compliant model for running gaits. The model consists of a linear leg stiffness paired with a nonlinear energy regulation term. This new model, termed the quartic model, is shown to reproduce the external dynamics of a running gait. The characteristics of the gait are imposed through parametric conditions which are derived through linearization of the model. The nonlinear nature of the model ensures convergence towards a limit cycle, which makes the model a useful template for the control of legged systems.
|
| |
| 15:00-16:30, Paper ThI2I.169 | Add to My Program |
| Towards Real-World Identification of Fatigued Muscle Groups Via Musculoskeletal Simulation |
|
| Chauhan, Jenishkumar | Indian Institute of Technology Gandhinagar |
| Brahmbhatt, Samarth | Independent Researcher |
| Vashista, Vineet | Indian Institute of Technology Gandhinagar |
Keywords: Rehabilitation Robotics, Modeling and Simulating Humans, Human and Humanoid Motion Analysis and Synthesis
Abstract: Contactless diagnosis of musculoskeletal disorders can potentially improve population health as well as robot behaviours in collaborative settings. However, current diagnosis methods require an in-person physical examination in which a trained physician senses, through contact, the force applied by various muscles. Simulation tools exist, but their use for diagnosis with real data is under-explored. In this paper, we propose an algorithm for identifying which upper-limb muscle group is fatigued. Our algorithm compares the real-world free-space motion of the subject with that of a simulated musculoskeletal model, and is therefore contactless: preventing the the need for invasive sensing or in-person assessment. Our algorithm simulates various fatigue conditions using a physics-based musculoskeletal model and extracts diagnostic motion features from both real and simulated data, which are compared for diagnosis. Experimental results on real data demonstrate that the proposed method can reliably distinguish between multiple muscle-groups of fatigue. Additionally, through comprehensive performance comparisons, we show how recent advanced musculoskeletal simulators can be properly configured to address the sim-to-real gap in the context of the fatigue diagnosis task. Our approach can potentially spur further research in remote and automated diagnosis, significantly lowering the barrier to large-scale and early detection.
|
| |
| 15:00-16:30, Paper ThI2I.170 | Add to My Program |
| A Computationally Efficient Nonparametric Approach for Robot Imitation Learning |
|
| Wang, Yijin | University of Leeds |
| Wu, Shaokang | University of Leeds |
| Liu, Chen | University of Leeds |
| Zhang, Chuankai | University of Leeds |
| Silvério, João | German Aerospace Center (DLR) |
| Huang, Yanlong | University of Leeds |
Keywords: Learning from Demonstration, Imitation Learning
Abstract: Transferring human skills to robots through learning from demonstrations has been an important topic in the robotics community, and many models have been developed for learning and adapting such skills. Among them, nonparametric representations are an appealing choice, since nonparametric solutions alleviate the explicit definition of basis functions, require fewer hyperparameters, and facilitate straightforward generalization for tasks involving high-dimensional inputs (e.g., human-robot collaboration and dual-arm manipulation). However, a commonly raised concern for nonparametric models is their computational complexity. In this paper, we propose a computationally efficient solution for nonparametric skill learning, whose computation time grows quadratically with the length of demonstrations, as opposed to the cubic growth in a standard nonparametric model. The solution is further improved by exploiting local models and fusing their predictions. We evaluate our approach in a 2-D writing task with time input, a 3-D human-guided obstacle avoidance task, and a dual-arm transportation task associated with 7-D input. The results show that our solution achieves comparable performance to the parametric method and enables instant adaptations in tasks associated with time or multi-dimensional inputs.
|
| |
| 15:00-16:30, Paper ThI2I.171 | Add to My Program |
| STRIVE: Structured Representation Integrating VLM Reasoning for Efficient Object Navigation |
|
| Zhu, Haokun | Carnegie Mellon University |
| Li, Zongtai | Carnegie Mellon University |
| Liu, Zhixuan | Carnegie Mellon University |
| Wang, Wenshan | Carnegie Mellon University |
| Zhang, Ji | Carnegie Mellon University |
| Francis, Jonathan | Carnegie Mellon University, Bosch Center for AI |
| Oh, Jean | Carnegie Mellon University, Lavoro AI Research |
Keywords: Semantic Scene Understanding, AI-Enabled Robotics, AI-Based Methods
Abstract: Vision-Language Models (VLMs) have been increasingly integrated into object navigation tasks for their rich prior knowledge and strong reasoning abilities. However, applying VLMs to navigation presents two key challenges: effectively parsing and structuring complex environment information and determining when and how to query VLMs. Insufficient environment understanding and over-reliance on VLMs (e.g. querying at every step) can easily lead to unnecessary backtracking and reduced navigation efficiency, especially in large continuous environments. To address these challenges, we propose a novel framework that incrementally constructs a multi-layer environment representation consisting of viewpoints, object nodes, and room nodes during navigation. Viewpoints and object nodes facilitate intra-room exploration and accurate target localization, while room nodes support efficient inter-room planning. Building on this structured representation, we propose a novel two-stage navigation policy, integrating high-level planning guided by VLM reasoning with low-level VLM-assisted exploration to efficiently and reliably locate a goal object. We evaluated our approach on four simulated benchmarks (HM3D v1&v2, RoboTHOR, and MP3D), and achieved state-of-the-art performance on both the success rate (SR% 13.1%) and navigation efficiency (SPL% 6.2%). We further validate our method on a real robot platform, demonstrating strong robustness across 120 episodes in 10 different indoor environments. Project page is available at: https://zwandering.github.io/STRIVE.github.io/.
|
| |
| 15:00-16:30, Paper ThI2I.172 | Add to My Program |
| A Counterfactual Reasoning Framework for Fault Diagnosis in Robot Perception Systems |
|
| Han, Haeyoon | California Institute of Technology |
| Taheri, Mahdi | California Institute of Technology (Caltech) |
| Chung, Soon-Jo | Caltech |
| Hadaegh, Fred | Jet Propulsion Laboratory |
Keywords: Failure Detection and Recovery, Sensor-based Control, Task and Motion Planning
Abstract: Perception systems provide a rich understanding of the environment for autonomous systems, shaping decisions in all downstream modules. Hence, accurate detection and isolation of faults in perception systems is important. Faults in perception systems pose particular challenges: faults are often tied to the perceptual context of the environment, and errors in their multi-stage pipelines can propagate across modules. To address this, we adopt a counterfactual reasoning approach to propose a framework for fault detection and isolation (FDI) in perception systems. As opposed to relying on physical redundancy (i.e., having extra sensors), our approach utilizes analytical redundancy with counterfactual reasoning to construct perception reliability tests as causal outcomes influenced by system states and fault scenarios. Counterfactual reasoning generates reliability test results under hypothesized faults to update the belief over fault hypotheses. We derive both passive and active FDI methods. While the passive FDI can be achieved by belief updates, the active FDI approach is defined as a causal bandit problem, where we utilize Monte Carlo Tree Search (MCTS) with upper confidence bound (UCB) to find control inputs that maximize a detection and isolation metric, designated as Effective Information (EI). The mentioned metric quantifies the informativeness of control inputs for FDI. We demonstrate the approach in a robot exploration scenario, where a space robot performing vision-based navigation actively adjusts its attitude to increase EI and correctly isolate faults caused by sensor damage, dynamic scenes, and perceptual degradation.
|
| |
| 15:00-16:30, Paper ThI2I.173 | Add to My Program |
| Robust LPV Modeling of Precision Motion Systems Via Edge-Theorem Verification |
|
| Al-Rawashdeh, Yazan | Princess Sumaya University for Technology |
| Al Janaideh, Mohammad | University of Guelph |
Keywords: Motion Control
Abstract: This work proposes a systematic workflow for constructing grid-based Linear Parameter-Varying (LPV) models from frequency response data. Transfer functions are estimated at multiple scheduling-parameter grid points, fitted with a fixed model order, and transformed into controllable canonical realizations to ensure structural consistency. These vertex models are interpolated into an LPV state-space representation, while robust stability is verified using the Edge Theorem, which reduces the problem to checking edge polynomials of the convex hull. The novelty of the approach lies in integrating frequency-domain identification, canonical-form embedding, and polytope-based robust stability analysis into a unified LPV framework. Unlike conventional methods that rely on time-domain experiments or subspace techniques, the proposed method exploits experimentally accessible frequency-response data and avoids coordinate mismatches during interpolation. Validation on a precision motion system demonstrates both theoretical soundness and practical applicability, confirming the workflow as a reliable pathway from frequency-domain data to robust LPV control design.
|
| |
| 15:00-16:30, Paper ThI2I.174 | Add to My Program |
| Knowledge Optical to Sonar (KnOTS): Towards the Transfer of Knowledge of Underwater Object Detection from Optical to Forward-Looking Sonar Imagery |
|
| Keenan, Caroline | MIT Lincoln Laboratory |
| Wawrzynek, Ella R. | MIT Lincoln Laboratory |
| Whelihan, David | MIT Lincoln Laboratory |
| Mahncke, Ivy | Franklin W. Olin College of Engineering |
| Leonard, John J. | MIT |
| Miller, Madeline D. | MIT Lincoln Laboratory |
Keywords: Marine Robotics, Sensor Fusion, Deep Learning for Visual Perception
Abstract: We develop an approach to detect objects in forward-looking sonar (FLS) images using corresponding optical images and without the need for expert manual labeling of sonar images. Sonar sensing is more robust to disadvantageous underwater environmental conditions than optical sensing, but the scarcity of labeled sonar data leads to decreased performance of methods which rely on an abundance of training data. We aim to transfer insights from data-rich applications such as object detection in optical imaging to the data-scarce area of object detection in sonar images. Our approach involves recording of contemporaneous images from commercially available sensors viable for use aboard unmanned underwater vehicles. We collect new optical and sonar data in a shallow, clear-water environment and employ existing object detection techniques for optical images. We leverage the commonality of the sensors’ fields of view and our algorithmic processing of the sonar image to transfer knowledge of object bounding boxes to sonar images to create a dataset. Through this transfer, we enable training of a model that detects objects in unseen sonar images and does not require optical images as input at test time.
|
| |
| 15:00-16:30, Paper ThI2I.175 | Add to My Program |
| The iMETRO Dynamic Simulation: An Open-Source Simulator for Intravehicular Space Robotics Research |
|
| Hart, Nikki | Rice University |
| Dunkelberger, Nathan | Nasa Jsc, Caci |
| Holum, Erik | NASA, METECS, Astropup Engineering |
| Kavraki, Lydia | Rice University |
| Zemler, Emma | NASA |
| Azimi, Shaun | NASA |
Keywords: Space Robotics and Automation, Simulation and Animation, Manipulation Planning
Abstract: We present the iMETRO Dynamic Simulation, the first open-source dynamic simulation environment for research in the use of robot manipulators inside space vehicles for maintenance and logistics tasks, or intravehicular robotics (IVR). IVR has great potential to facilitate science and exploration on the Moon by saving crew time, but there are limited open-source resources that would enable researchers to identify the next set of challenges in manipulation for IVR. We provide a full-featured, high-fidelity dynamic simulation of the real-world iMETRO IVR test facility, which includes mockups representative of the interior of a future space vehicle as well as an 8-DoF manipulator that serves as an example robot platform for this research. Our modular simulator enables new software, hardware, and operational paradigms to be tested in a reconfigurable mockup environment. To improve the accessibility and extensibility of this simulation environment, we also provide ROS 2 hardware control interfaces to MuJoCo as well as a model conversion tool such that the same models may be used with ROS 2 and MuJoCo. To evaluate the sim-to-real transfer capabilities of this simulation, we present an open-source example application demonstration developed in the simulation and transfer it to the real-world iMETRO facility in less than a day. Finally, we identify the challenges and opportunities in modeling a real-world facility to aid future simulation efforts. The open- source simulation and application can be found at https: //github.com/NASA-JSC-Robotics. The MuJoCo and ROS 2 integration tools have migrated to the ros-controls organization and can be found at https://github.com/ ros-controls/mujoco_ros2_control.
|
| |
| 15:00-16:30, Paper ThI2I.176 | Add to My Program |
| One-Shot View Planning and Online Optimization-Based Replanning for Unknown Object Reconstruction |
|
| Patiño Miñán, José Johil | Cardiff University |
| Kingston, Zachary | Purdue University |
| Romero-Cano, Victor | Cardiff University |
| Lai, Yu-Kun | Cardiff University |
| Hernández, Juan D. | Cardiff University |
Keywords: Reactive and Sensor-Based Planning, Constrained Motion Planning, Motion and Path Planning
Abstract: Robotic inspection tasks often require constructing high-quality 3D models of objects from a minimal number of views. Traditional next-best view planning (NBVP) approaches incrementally select view poses but fail to account for global optimality of the inspection trajectory, thus leading to inefficient inspection paths. Recent one-shot view planning (OSVP) methods address this challenge by predicting informative view poses from an initial observation. While subsequent improvements on the pioneering OSVP approach attempt to improve prediction accuracy, they can still fail when faced with out of distribution (OoD) examples. With recent advances in generative modeling, OSVP methods can infer a plausible object shape from one observation and then derive the corresponding solution set of view poses. However, because the predicted shape may deviate from the true geometry, these methods can still generate infeasible views. To overcome these limitations, we propose a novel OSVP framework that leverages RGB-D data to generate geometric priors and incorporates online video-based reconstruction. Our method formulates viewpoint selection and path optimization, so that both the calculated poses and the connecting trajectories satisfy visibility constraints, maintain smoothness, and can be locally replanned to compensate for discrepancies between predicted and real object geometries. We validate our OSVP approach through simulation benchmarks against state-of-the-art OSVP techniques and demonstrate its effectiveness on a real Franka Emika manipulator.
|
| |
| 15:00-16:30, Paper ThI2I.177 | Add to My Program |
| Endoscopic Spine Surgical View Enhancement Via Diffusion-Prior Contrastive and Physics-Informed Constraints for Robotic Navigation |
|
| Han, Haojie | Tsinghua University |
| Ma, Longfei | Tsinghua University |
| Xu, Kai | Beijing Tsinghua Changgung Hospital |
| Gu, Suxi | Orthopedic Department, Beijing Tsinghua Changgung Hospital |
| Zhang, Shipeng | Tsinghua University |
| Ning, Guochen | Tsinghua University |
| Chen, Fang | Shanghai Jiao Tong University |
| Liao, Hongen | Tsinghua University |
Keywords: Computer Vision for Medical Robotics, Data Sets for Robotic Vision, Vision-Based Navigation
Abstract: In robot-assisted spinal endoscopy, intraoperative imaging is frequently degraded by bleeding, irrigation fluids, bubbles, smoke, and uneven illumination, which can severely compromise surgical precision, safety, and decisionmaking. Accurate identification of anatomical structures is particularly critical in spinal procedures, yet acquiring paired clean and degraded images in real clinical settings is infeasible. To address this challenge, we propose DCP-Net, an unpaired endoscopic image restoration framework tailored for robotic spinal surgery. DCP-Net integrates Diffusion-Prior Contrastive Learning (DPCL) to leverage generative priors and contrastive objectives for robust latent representations, and Physics-Informed Constraints (PIC) to ensure anatomically consistent restoration. Furthermore, we introduce Diffusion-Prior Uncertainty Estimation (DPUE), providing pixel-wise confidence maps that quantify restoration reliability and guide risk-aware robotic perception. We further constructed a dataset comprising 21,845 paired/unpaired samples of intraoperative visual degradations in spinal endoscopy, primarily involving bleeding, bubbles, and other artifacts. Extensive experiments show that DCP-Net outperforms existing methods in both quantitative metrics and perceptual quality, significantly improving visual clarity and supporting various robotic navigation tasks. Among these tasks, accurate bleeding point detection plays a particularly critical role in ensuring safe and precise navigation in clinical practice.
|
| |
| 15:00-16:30, Paper ThI2I.178 | Add to My Program |
| The Trajectory Bundle Method: Unifying Sequential-Convex Programming and Sampling-Based Trajectory Optimization |
|
| Tracy, Kevin | Carnegie Mellon University |
| Zhang, John | Massachusetts Institute of Technology |
| Arrizabalaga, Jon | Massachusetts Institute of Technology (MIT) |
| Schaal, Stefan | Google X |
| Erez, Tom | Google |
| Tassa, Yuval | University of Washington |
| Manchester, Zachary | Massachusetts Institute of Technology |
Keywords: Optimization and Optimal Control, Motion and Path Planning, Integrated Planning and Control
Abstract: We present a unified framework for solving trajectory optimization problems in a derivative-free manner through the use of sequential convex programming. Traditionally, nonconvex optimization problems are solved by forming and solving a sequence of convex optimization problems, where the cost and constraint functions are approximated locally through Taylor series expansions. This presents a challenge for functions where differentiation is expensive or unavailable. In this work, we present a derivative-free approach to form these convex approximations by computing samples of the dynamics, cost, and constraint functions and letting the solver interpolate between them. Our framework includes sample-based trajectory optimization techniques like model-predictive path integral (MPPI) control as a special case and generalizes them to enable features like multiple shooting and general equality and inequality constraints that are traditionally associated with derivative-based sequential convex programming methods. The resulting framework is simple, flexible, and capable of solving a wide variety of practical motion planning and control problems.
|
| |
| 15:00-16:30, Paper ThI2I.179 | Add to My Program |
| Learning Neural Observer-Predictor Models for Limb-Level Sampling-Based Locomotion Planning |
|
| Kulkarni, Abhijeet Mangesh | University of Delaware |
| Poulakakis, Ioannis | University of Delaware |
| Huang, Guoquan (Paul) | University of Delaware |
Keywords: Motion and Path Planning, Collision Avoidance, Machine Learning for Robot Control
Abstract: Accurate full-body motion prediction is essential for the safe, autonomous navigation of legged robots, enabling critical capabilities like limb-level collision checking in cluttered environments. Simplified kinematic models often fail to capture the complex, closed-loop dynamics of the robot and its low-level controller, limiting their predictions to simple planar motion. To address this, we present a learning-based observer-predictor framework that accurately predicts this motion. Our method features a neural observer with provable Uniformly Ultimately Bounded (UUB) guarantees that provides a reliable latent state estimate from a history of proprioceptive measurements. This stable estimate initializes a computationally efficient predictor, designed for the rapid, parallel evaluation of thousands of potential trajectories required by modern sampling-based planners. We validated the system by integrating our neural predictor into an Model Predictive Path Integral (MPPI)-based planner on a Vision 60 quadruped. Hardware experiments successfully demonstrated effective, limb-aware motion planning in a challenging, narrow passage and over small objects, highlighting our system’s ability to provide a robust foundation for high-performance, collision-aware planning on dynamic robotic platforms.
|
| |
| 15:00-16:30, Paper ThI2I.180 | Add to My Program |
| Direct In-Petri Dish Liquid Droplet Manipulation Based on Microscope and Ultrasonic Phased Transducer Array |
|
| Li, Huamin | ShanghaiTech University |
| Liu, Song | ShanghaiTech University |
Keywords: Automation at Micro-Nano Scales, Micro/Nano Robots, Nanomanufacturing
Abstract: High-precision liquid droplet manipulation is widely used in life science, biomedical engineering, and industry. However, there still lacks a robotic approach supporting the direct in-petri dish manipulation of liquid droplets without the need of additives (such as magnetic beads) or other setup requirements (such as on hydrophobic or conductive surfaces). From the robotics perspective, this challenge concerns with designing an adequate robot end-effector capable of firmly holding and effectively moving droplet on hydrophilic surfaces. In this paper, we propose an automated robotic system for the direct in-petri dish liquid droplet manipulation based on ultrasonic phased transducer array (UPTA) and microscope, which can interact with users to selectively grasp droplet, follow user designated trajectories to transport droplet, and positioning droplet in high precision. The core working mechanism of the proposed system is to precisely generate an inclined single focal point, acting as a noncontact end-effector, inside the droplet under the guidance of microscope, which can induce sufficient hydrodynamic actuation forces on the peripheral contact line of the droplet to keep the droplet moving along the ultrasonic end-effector. Since it is additive free, our system is inherently compatible to medical, chemical and industrial protocols. Details regarding the system design and implementation, the ultrasound focusing strategy, and the visual servo control scheme are elaborated in this paper. Experiments validated the effectiveness of the proposed system.
|
| |
| 15:00-16:30, Paper ThI2I.181 | Add to My Program |
| RoTri-Diff: A Spatial Robot–Object Triadic Interaction-Guided Diffusion Model for Bimanual Manipulation |
|
| Chen, Zixuan | Nanjing Univeristy |
| Chan, Nga Teng | Hong Kong University of Science and Techchnology |
| Hou, Yiwen | National University of Singapore |
| Tie, Chenrui | National University of Singapore |
| Liu, Zixuan | National University of Singapore |
| Chen, Haonan | National University of Singapore |
| Chen, Junting | National University of Singapore |
| Shi, Jieqi | Nanjing University |
| Gao, Yang | Nanjing University |
| Huo, Jing | Nanjing University |
| Shao, Lin | National University of Singapore |
Keywords: Bimanual Manipulation, Imitation Learning, RGB-D Perception
Abstract: Bimanual manipulation is a fundamental robotic skill that requires continuous and precise coordination between two arms. While imitation learning (IL) is the dominant paradigm for acquiring this capability, existing approaches, whether robot-centric or object-centric, often overlook the dynamic geometric relationship among the two arms and the manipulated object. This limitation frequently leads to inter-arm collisions, unstable grasps, and degraded performance in complex tasks. To address this, in this paper we explicitly models the Robot–Object Triadic Interaction (RoTri) representation in bimanual systems, by encoding the relative 6D poses between the two arms and the object to capture their spatial triadic relationship and establish continuous triangular geometric constraints. Building on this, we further introduce RoTri-Diff, a diffusion-based imitation learning framework that combines RoTri constraints with robot keyposes and object motion in a hierarchical diffusion process. This enables the generation of stable, coordinated trajectories and robust execution across different modes of bimanual manipulation. Extensive experiments show that our approach outperforms state-of-the-art baselines by 10.2% on 11 representative RLBench2 tasks and achieves stable performance on 4 challenging real-world bimanual tasks. Project website: https://rotri-diff.github.io/.
|
| |
| 15:00-16:30, Paper ThI2I.182 | Add to My Program |
| Instrumentation for Imitation Learning: Enhancing Training Datasets for Clothes Hanger Insertion |
|
| Proesmans, Remko | Ghent University |
| Lips, Thomas | Ghent University |
| Wyffels, Francis | Ghent University |
Keywords: Deep Learning in Grasping and Manipulation, Imitation Learning, Perception for Grasping and Manipulation
Abstract: Large behaviour models have transformed the field of robotic manipulation, but prohibitive data requirements have thus far prevented a revolution similar to vision language models. We believe that instrumentation, i.e. sensor integration in objects, can provide invaluable state information and enable efficient learning for robotic manipulation. In this paper, we present instrumented imitation learning of clothes hanger insertion. Using 180 teleoperated demonstrations, we train diffusion policies with and without access to instrumentation data. Results show that policies leveraging instrumentation outperform vision-only counterparts by 14–25 %pt and exhibit greater task awareness. Crucially, a black-box imitation learning policy learns to prioritise instrumentation signals without explicit guidance. In addition, enhancing the teleoperation dataset with rollouts from an instrumented “expert” policy, enables a vision-only “student” policy to achieve performance comparable to the instrumented expert, thereby surpassing the original vision-only policy. These findings establish instrumentation as a promising strategy to enhance imitation learning for robotic manipulation. Datasets are available on Zenodo [10.5281/zenodo.17122216].
|
| |
| 15:00-16:30, Paper ThI2I.183 | Add to My Program |
| Accelerated Multi-Modal Motion Planning Using Context-Conditioned Diffusion Models |
|
| Sandra, Edward | KU Leven |
| Vanroye, Lander | KU Leuven |
| Dirckx, Dries | KU Leuven |
| Cartuyvels, Ruben | KU Leuven |
| Swevers, Jan | KU Leuven |
| Decré, Wilm | KU Leuven |
Keywords: Imitation Learning, Task and Motion Planning
Abstract: Classical methods in robot motion planning, such as sampling-based and optimization-based methods, often struggle with scalability towards higher-dimensional state spaces and complex environments. Diffusion models, known for their capability to learn complex, high-dimensional and multi-modal data distributions, provide a promising alternative when applied to motion planning problems and have already shown interesting results. However, most of the current approaches train their model for a single environment, limiting their generalization to environments not seen during training. The techniques that do train a model for multiple environments rely on a specific camera to provide the model with the necessary environmental information and therefore always require that sensor. To effectively adapt to diverse scenarios without the need for retraining, this research proposes Context-Aware Motion Planning Diffusion (CAMPD). CAMPD leverages a classifier-free denoising probabilistic diffusion model, conditioned on sensor-agnostic contextual information. An attention mechanism, integrated in the well-known U-Net architecture, conditions the model on an arbitrary number of contextual parameters. CAMPD is evaluated on a 7-DoF robot manipulator and benchmarked against state-of-the-art approaches on real-world tasks, showing its ability to generalize to unseen environments and generate high-quality, multi-modal trajectories, at a fraction of the time required by existing methods.
|
| |
| 15:00-16:30, Paper ThI2I.184 | Add to My Program |
| Markerless Robot Detection and 6D Pose Estimation for Multi-Agent SLAM |
|
| Rueggeberg, Markus | German Aerospace Center (DLR) |
| Ulmer, Maximilian | Deutsches Zentrum Für Luft Und Raumfahrt |
| Durner, Maximilian | German Aerospace Center DLR |
| Boerdijk, Wout | German Aerospace Center (DLR) |
| Müller, Marcus Gerhard | German Aerospace Center |
| Triebel, Rudolph | German Aerospace Center (DLR) |
| Giubilato, Riccardo | German Aerospace Center (DLR) |
Keywords: Multi-Robot SLAM, Multi-Robot Systems, Field Robots
Abstract: The capability of multi-robot SLAM approaches to merge localization history and maps from different observers is often challenged by the difficulty in establishing data association. Loop closure detection between perceptual inputs of different robotic agents is easily compromised in the context of perceptual aliasing, or when perspectives differ significantly. For this reason, direct mutual observation among robots is a powerful way to connect partial SLAM graphs, but often relies on the presence of calibrated arrays of fiducial markers (e.g., AprilTag arrays), which severely limits the range of observations and frequently fails under sharp lighting conditions, e.g., reflections or overexposure. In this work, we propose a novel solution to this problem leveraging recent advances in Deep-Learning-based 6D pose estimation. We feature markerless pose estimation as part of a decentralized multi-robot SLAM system and demonstrate the benefit to the relative localization accuracy among the robotic team. The solution is validated experimentally on data recorded in a test field campaign on a planetary analogous environment
|
| |
| 15:00-16:30, Paper ThI2I.185 | Add to My Program |
| Programmable Deformation Design of Porous Soft Actuator through Volumetric-Pattern-Induced Anisotropy |
|
| Meng, Canqi | ShanghaiTech University |
| Bai, Weibang | ShanghaiTech University |
Keywords: Soft Robot Materials and Design, Hydraulic/Pneumatic Actuators, Soft Sensors and Actuators
Abstract: Conventional soft pneumatic actuators, typically based on hollow elastomeric chambers, often suffer from small structural support and require costly geometry-specific redesigns for multimodal functionality. Porous materials, such as foam, filled into chambers, can provide structural stability to the actuators. However, methods to achieve programmable deformation by tailoring the porous body itself remain underexplored. In this paper, a novel design method is presented to realize soft porous actuators with programmable deformation by incising specific patterns into the porous foam body. This approach introduces localized structural anisotropy of the foam, guiding the material's deformation under a global vacuum input. Furthermore, three fundamental patterns on a cylindrical foam substrate are discussed: transverse for bending, longitudinal for tilting, and diagonal for twisting. A computational model is built with Finite Element Analysis (FEA) to investigate the mechanism of the incision-patterning method. Experiments demonstrate that with a potential optimal design of the pattern array number N, actuators can achieve bending up to 80^{circ} (N=2), tilting of 18^{circ} (N=1), and twisting of 115^{circ} (N=8). The versatility of our approach is demonstrated via pattern transferability, scalability, and mold-less rapid prototyping of complex designs. As a comprehensive application, we translate the human hand crease map into a functional incision pattern, creating a bio-inspired soft gripper capable of human-like adaptive grasping. Our work provides a new, efficient, and scalable paradigm for the design of multi-functional soft porous robots.
|
| |
| 15:00-16:30, Paper ThI2I.186 | Add to My Program |
| Uncovering Communication Bottlenecks in Scalable ROS 2 Deployments on Kubernetes for Cloud/Edge Robotics |
|
| Zhang, Yongzhou | Karlsruhe University of Applied Sciences |
| Waldhorst, Oliver | Karlsruhe University of Applied Sciences |
| Hein, Björn | Karlsruhe University of Applied Sciences |
Keywords: Networked Robots, Software, Middleware and Programming Environments, Engineering for Robotic Systems
Abstract: Containerization and orchestration using cloud-native technologies enable scalable deployment of robotic software. Integrating ROS~2 with Kubernetes offers a flexible infrastructure, but also introduces a complex, multi-layered communication stack - from DDS middleware to container networks and the physical layer. Each layer adds overhead and variability that impact application-level performance. This paper presents a comprehensive analysis of communication performance across the cloud–edge–robot continuum, focusing on throughput and one-way latency in scalable ROS~2 deployments. We evaluate communication across intra-robot, edge, and cloud segments using wired and wireless connections, including emerging technologies like Wi-Fi~7 and high-speed LAN. Using a Kubernetes-based testbed, we investigate various ROS~2 middlewares, CNI plugins, QoS configurations, and encryption options. Our experiments reveal the impact of network overlays, routing paths, and middleware choices on latency and bandwidth. Despite the inherent complexity, the results confirm the feasibility of deploying ROS~2 in orchestrated, scalable environments. We summarize key insights as practical takeaways, many of which apply beyond Kubernetes, to guide the design of robust cloud/edge robotic systems.
|
| |
| 15:00-16:30, Paper ThI2I.187 | Add to My Program |
| Wirelessly Powered Zero Net Magnetic Torque Motor for Tissue Regenerating Robotic Implant |
|
| Davies, Jack | The University of Sheffield |
| Liu, Jialun | The University of Sheffield |
| Duffield, Cameron | The University of Sheffield |
| Zhao, Zihan | The University of Sheffield |
| Damian, Dana | The University of Sheffield |
| Miyashita, Shuhei | The University of Sheffield |
Keywords: Medical Robots and Systems, Mechanism Design, Micro/Nano Robots
Abstract: In biomedical engineering, robotic implants have shown new methods to restore and improve bodily function and regenerate tissue. A significant challenge with the design of these devices is to safely actuate them for weeks or months while they reside in a patient's body. The application of a rotating magnetic field offers a solution to remotely transfer torque. However, this method will cause a net torque on the body within the field, which will cause rotational motion of the implant. Here we present a wirelessly-driven magnetic motor which can be driven with an external magnetic field, using an electromagnetic coil(s), to control a robotic implant. Due to the magnetic torque canceling mechanism, this wireless motor is actuatable with a single coil and produces no net torque on the entire body. When physically tested, the motor was able to produce around 0.5 mNm of torque, which is comparable to conventional ungeared motors of the same size. The motor was demonstrated in a robotic implant and successfully applied force to stretch a porcine esophagus.
|
| |
| 15:00-16:30, Paper ThI2I.188 | Add to My Program |
| UltraHiT: A Hierarchical Transformer Architecture for Generalizable Internal Carotid Artery Robotic Ultrasonography |
|
| Wang, Teng | Tsinghua University |
| Jiang, Haojun | Tsinghua University |
| Wang, Yuxuan | Xidian University |
| Sun, Zhenguo | Beijing Academy of Artificial Intelligence |
| Yan, Xiangjie | Tsinghua University |
| Li, Xiang | Tsinghua University |
| Huang, Gao | Tsinghua University |
Keywords: Medical Robots and Systems, Imitation Learning, Computer Vision for Medical Robotics
Abstract: Carotid ultrasound is crucial for the assessment of cerebrovascular health, particularly the internal carotid artery (ICA). While previous research has explored automating carotid ultrasound, none has tackled the challenging ICA. This is primarily due to its deep location, tortuous course, and significant individual variations, which greatly increase scanning complexity. To address this, we propose a Hierarchical Transformer-based decision architecture, namely UltraHiT, that integrates high-level variation assessment with low-level action decision. Our motivation stems from conceptualizing individual vascular structures as morphological variations derived from a standard vascular model. The high-level module identifies variation and switches between two low-level modules: an adaptive corrector for variations, or a standard executor for normal cases. Specifically, both the high-level module and the adaptive corrector are implemented as causal transformers that generate predictions based on the historical scanning sequence. To ensure generalizability, we collected the first large-scale ICA scanning dataset comprising 164 trajectories and 72K samples from 28 subjects of both genders. Based on the above innovations, our approach achieves a 95% success rate in locating the ICA on unseen individuals, outperforming baselines and demonstrating its effectiveness. Project website: https://ultrahit-thu.github.io/UltraHiT/.
|
| |
| 15:00-16:30, Paper ThI2I.189 | Add to My Program |
| Impact-Robust Posture Optimization for Aerial Manipulation |
|
| Afifi, Amr | University of Twente |
| Gazar, Ahmad | Max-Planck Institute for Intelligent Systems |
| Alonso-Mora, Javier | Delft University of Technology |
| Robuffo Giordano, Paolo | Irisa Cnrs Umr6074 |
| Franchi, Antonio | University of Twente / Sapienza University of Rome |
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications, Multi-Contact Whole-Body Motion Planning and Control
Abstract: We present a novel method for optimizing the posture of kinematically redundant torque-controlled robots to improve robustness during impacts. A rigid impact model is used as the basis for a configuration-dependent metric that quantifies the variation between pre- and post-impact velocities. By finding configurations (postures) that minimize the aforementioned metric, spikes in the robot's state and input commands can be significantly reduced during impacts, improving safety and robustness. The problem of identifying impact-robust postures is posed as a min–max optimization of the aforementioned metric. To overcome the real-time intractability of the problem, we reformulate it as a gradient-based motion task that iteratively guides the robot towards configurations that minimize the proposed metric. This task is embedded within a task-space inverse dynamics (TSID) whole-body controller, enabling seamless integration with other control objectives. The method is applied to a kinematically redundant aerial manipulator performing repeated point contact tasks. We test our method inside a realistic physics simulator and compare it with the nominal TSID. Our method leads to a reduction (up to 51% w.r.t. standard TSID) of post-impact spikes in the robot's configuration and successfully avoids actuator saturation. Moreover, we demonstrate the importance of kinematic redundancy for impact-robustness using additional numerical simulations on a quadruped and a humanoid robot, resulting in up to 45%, reduction of post-impact spikes in the robot's state w.r.t. nominal TSID.
|
| |
| 15:00-16:30, Paper ThI2I.190 | Add to My Program |
| Koopman Representation of Nonlinear Virtual Environments in Kinesthetic Haptic Systems |
|
| Zhou, Yanting | McGill University |
| Kovecses, Jozsef | McGill University |
| Forbes, James Richard | McGill University |
Keywords: Haptics and Haptic Interfaces
Abstract: Rendering haptic feedback with nonlinear virtual environments (VEs) is important in many applications that require highly accurate force feedback. This paper considers the use of the Koopman operator to represent a nonlinear VE interacting with a haptic system. Simulation and experimental results demonstrated that the proposed method provides an effective representation of the nonlinear dynamics of a Duffing-oscillator VE. A multi-user study further confirmed this conclusion. In addition, a closed-loop (CL) stability analysis is performed leveraging the Koopman representation of the nonlinear VE to access stability of the overall haptic system. This alternative way of representing nonlinear VEs enables a convenient CL stability analysis that is less conservative than traditional passivity-based methods. Since a linear combination of all lifted states is used to represent the nonlinearity, such representation is also more robust to uncertainties in the modelling of the haptic device than a traditional nonlinear model.
|
| |
| 15:00-16:30, Paper ThI2I.191 | Add to My Program |
| Vision-Based Reasoning with Topology-Encoded Graphs for Anatomical Path Disambiguation in Robot-Assisted Endovascular Navigation |
|
| Zhao, Jiyuan | Tongji University |
| Shi, Zhengyu | Tongji University |
| Tian, Wentong | Tongji University |
| Yao, Tianliang | The Chinese University of Hong Kong (CUHK) |
| Liu, Dong | Shanghai Aopeng Medical Technology Co., Ltd |
| Liu, Tao | aopeng |
| Wu, Yizhe | Department of Cardiology, Zhongshan Hospital, Fudan University, Shanghai Institute of Cardiovascular Diseases, National Clinical |
| Qi, Peng | Tongji University |
|
|
| |
| 15:00-16:30, Paper ThI2I.192 | Add to My Program |
| Tactile Hide and Seek: Bimanual Object Blind Search and Retrieval Via Tactile-Only Feedback |
|
| Fu, Xiangyu | Technical University of Munich |
| Xing, Hao | Technical University of Munich (TUM) |
| Armleder, Simon | Technische Universität München |
| Shen, Wenlan | Technical University of Munich |
| Wang, Fengyi | Technical University of Munich |
| Guadarrama Olvera, Julio Rogelio | Technical University of Munich |
| Cheng, Gordon | Technical University of Munich |
Keywords: Force and Tactile Sensing, Dual Arm Manipulation, Sensor Fusion
Abstract: Locating and identifying objects in vision-denied environments is a critical challenge for intelligent robot systems. To address the limitation of vision, we present a tactile-only method for object search and recognition using custom tactile skin sensors on robot hands. The method involves searching an object in a vision-denied environment with a tactile hide and seek strategy. Upon contact, the system employs a novel two-phase classification process: an initial single-handed classification by pushing the object, followed by a two-handed verification stage that incorporates size measurement to confirm the object's identity and reduce critical errors. To support this approach, we introduce the HAS (Hide-and-Seek) dataset, a large-scale, multimodal tactile dataset of 1.1 million samples collected on a custom sensor hardware. Our system achieves an object classification accuracy of 91.1% and a weight classification accuracy of 83.1% on the HAS dataset, with a strict joint accuracy of 79.6%. The full online pipeline attains a 61.4% success rate in real-world identification, with the bimanual verification stage further correcting up to 17.6% of single-hand errors. Comprehensive ablation studies validate the contribution of individual sensor modalities and demonstrate the effectiveness of our tactile-only method for autonomous operation in a non-vision environment. Our project page is available at tactile-hide-and-seek.github.io.
|
| |
| 15:00-16:30, Paper ThI2I.193 | Add to My Program |
| A Model Predictive Control Approach to Blending in Shared Control |
|
| Jabbour, Elio | Inria, Auctus |
| Vulliez, Margot | Inria, Auctus |
| Préault, Célestin | CESI Lineact |
| Padois, Vincent | Inria, Auctus |
Keywords: Telerobotics and Teleoperation, Human-Robot Teaming
Abstract: Shared control aims at assisting human operators using robots in physically and cognitively demanding tasks which cannot be automated as they require human expertise and deliberative abilities. Sharing control for a given task typically involves blending algorithms that combine human control inputs and (pre)planned assistance trajectories. Conventional blending techniques, such as Linear Blending, compute a combined output but neither guarantee the feasibility of the blended motion nor the optimality of the combined decision. In the context of teleoperation, this paper presents a formulation where blending is defined as a constrained optimal control problem. Model Predictive Control is used to determine a feasible blended trajectory through a receding horizon constrained optimization. The proposed method is evaluated in a 13-participant pick and place teleoperation study and compared to Linear Blending and unassisted Teleoperation. The experimental results demonstrate the superiority of the proposed shared control framework in terms of safety, performance as well as physical and cognitive comfort.
|
| |
| 15:00-16:30, Paper ThI2I.194 | Add to My Program |
| Towards Safe Autonomous Surgical Tasks with Control Barrier Functions |
|
| Iacono, Cristina | CREATE Consortium |
| De Risi, Paolino | Università Degli Studi Di Napoli Federico II |
| Moccia, Rocco | Istituto Italiano Di Tecnologia |
| Siciliano, Bruno | Univ. Napoli Federico II |
| Ficuciello, Fanny | Università Di Napoli Federico II |
Keywords: Surgical Robotics: Laparoscopy, Motion Control, Optimization and Optimal Control
Abstract: Safety is of utmost importance in surgical robots, as they operate in a complex and dynamic environment that directly impacts the patient’s health based on the surgical procedure’s success. One of the main difficulties in the control of surgical manipulators is in efficiently encoding dynamic nonlinear safety constraints into trajectory planning and robot control strategies. Control Barrier Functions (CBFs) represent a valuable control method for safety-critical environments such as the surgical one since its rigorous formulation aims at ensuring safety in controlled dynamic systems. This work represents a step forward in autonomous surgical task execution since it defines Lipschitz-continuous critical and autonomously prioritized dynamic constraints enforced through a CBF framework for the safe execution of surgical robotic tasks. The proposed framework, moreover, leverages Dual Quaternion (DQ) algebra for a unified and computationally efficient representation of geometric tasks and constraints, allowing for the straightforward definition of complex, time-varying surgical constraints. The safety framework is tested in simulation on the da Vinci Research Kit (dVRK) CoppeliaSim simulator and with the real dVRK robot in several surgical sub-tasks.
|
| |
| 15:00-16:30, Paper ThI2I.195 | Add to My Program |
| SPOT: Spatio-Temporal Trajectory Planning for UAVs in Unknown Dynamic Environments |
|
| Srivastava, Astik | IIIT Hydrabad |
| Bitla, Bhanu Teja | IIIT Hydrabad |
| J Chackenkulam, Thomas | IIT BHU |
| Thomas, Antony | International Institute of Information Technology, Hyderabad |
| Krishna, Madhava | IIIT Hyderabad |
Keywords: Aerial Systems: Perception and Autonomy, Collision Avoidance, Motion and Path Planning
Abstract: We address the problem of reactive motion planning for quadrotors operating in unknown environments with dynamic obstacles. Our approach leverages a 4-dimensional spatio-temporal planner, integrated with vision-based Safe Flight Corridor (SFC) generation and trajectory optimization. Unlike prior methods that rely on map fusion, our framework is mapless, enabling collision avoidance directly from perception while reducing computational overhead. Dynamic obstacles are detected and tracked using a vision-based object segmentation and tracking pipeline, allowing robust classification of static versus dynamic elements in the scene. To further enhance robustness, we introduce a backup planning module that reactively avoids dynamic obstacles when no direct path to the goal is available, mitigating the risk of collisions during deadlock situations. We validate our method extensively in both simulation and real-world hardware experiments, and benchmark it against state-of-the-art approaches, showing significant advantages for reactive UAV navigation in dynamic, unknown environments.
|
| |
| 15:00-16:30, Paper ThI2I.196 | Add to My Program |
| How to Train Your Tactile Model: Tactile Perception with Multi-Fingered Robot Hands |
|
| Ford, Christopher | University of Bristol |
| Shi, Kaichen | University of Bristol |
| Butcher, Laura Elizabeth | University of Bristol |
| Lepora, Nathan | University of Bristol |
| Psomopoulou, Efi | University of Bristol |
Keywords: Force and Tactile Sensing
Abstract: Rapid deployment of new tactile sensors is essential for scalable robotic manipulation, especially in multi-fingered hands equipped with vision-based tactile sensors. However, current methods for inferring contact properties rely heavily on convolutional neural networks (CNNs), which, while effective on known sensors, require large, sensor-specific datasets. Furthermore, they require retraining for each new sensor due to differences in lens properties, illumination, and sensor wear. Here we introduce TacViT, a novel tactile perception model based on Vision Transformers, designed to generalize on new sensor data. TacViT leverages global self-attention mechanisms to extract robust features from tactile images, enabling accurate contact property inference even on previously unseen sensors. This capability significantly reduces the need for data collection and retraining, accelerating the deployment of new sensors. We evaluate TacViT on sensors for a five-fingered robot hand and demonstrate its superior generalization performance compared to CNNs. Our results highlight TacViT’s potential to make tactile sensing more scalable and practical for real-world robotic applications.
|
| |
| 15:00-16:30, Paper ThI2I.197 | Add to My Program |
| OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-To-Robot Action Transfer |
|
| Wang, Kuanning | Fudan University |
| Fan, Ke | Fudan University |
| Fu, Yuqian | Fudan University |
| Lin, Siyu | Fudan University |
| Luo, Hu | Fudan University |
| Seita, Daniel | University of Southern California |
| Fu, Yanwei | Fudan University |
| Jiang, Yu-Gang | Fudan University |
| Xue, Xiangyang | Fudan University |
Keywords: Learning from Demonstration, Perception for Grasping and Manipulation, Sensor Fusion
Abstract: We present OCRA, an Object-Centric framework for video-based human-to-Robot Action transfer that learns directly from human demonstration videos to enable robust manipulation. Object-centric learning emphasizes task-relevant objects and their interactions while filtering out irrelevant background, providing a natural and scalable way to teach robots. OCRA leverages multi-view RGB videos, the state-of-the-art 3D foundation model VGGT, and advanced detection and segmentation models to reconstruct object-centric 3D point clouds, capturing rich interactions between objects. To handle properties not easily perceived by vision alone, we incorporate tactile priors via a large-scale dataset of over one million tactile images. These 3D and tactile priors are fused through a multimodal module (ResFiLM) and fed into a Diffusion Policy to generate robust manipulation actions. Extensive experiments on both vision-only and visuo-tactile tasks show that OCRA significantly outperforms existing baselines and ablations, demonstrating its effectiveness for learning from human demonstration videos.
|
| |
| 15:00-16:30, Paper ThI2I.198 | Add to My Program |
| NuRisk: A Visual Question Answering Dataset for Agent-Level Risk Assessment in Autonomous Driving |
|
| Gao, Yuan | Technical University of Munich |
| Piccinini, Mattia | Technical University of Munich |
| Brusnicki, Roberto | Technical University of Munich |
| Zhang, Yuchen | Technical University of Munich |
| Betz, Johannes | Technical University of Munich |
Keywords: Semantic Scene Understanding, Data Sets for Robotic Vision, Performance Evaluation and Benchmarking
Abstract: Understanding risk in autonomous driving requires not only perception and prediction, but also high-level reasoning about agent behavior and context. Current Vision Language Model (VLM)-based methods primarily ground agents in static images and provide qualitative judgments, lacking the spatio–temporal reasoning needed to capture how risks evolve over time. To address this gap, we propose NuRisk, a comprehensive Visual Question Answering (VQA) dataset comprising 2.9K scenarios and 1.1M agent-level samples, built on real-world data from nuScenes and Waymo, completed with safety-critical scenarios from the CommonRoad simulator. The dataset provides Bird's-eye view (BEV) based sequential images with quantitative, agent-level risk annotations, enabling spatio–temporal reasoning. We benchmark well-known VLMs across different prompting techniques and find that they fail to perform explicit spatio-temporal reasoning, resulting in a peak accuracy of 33% at high latency. To address these shortcomings, our fine-tuned 7B VLM agent improves accuracy to 41% and reduces latency by 75%, demonstrating explicit spatio-temporal reasoning capabilities that proprietary models lacked. While this represents a significant step forward, the modest accuracy underscores the profound challenge of the task, establishing NuRisk as a critical benchmark for advancing spatio-temporal reasoning in autonomous driving. More information can be found at https://github.com/TUM-AVS/NuRisk.
|
| |
| 15:00-16:30, Paper ThI2I.199 | Add to My Program |
| Fabric Pneumatic Artificial Muscle-Based Head-Neck Exosuit: Design, Evaluation, and Modeling |
|
| Schäffer, Katalin | University of Notre Dame |
| Bales, Ian | University of Utah |
| Zhang, Haohan | University of Utah |
| McGuinness, Margaret | University of Notre Dame |
Keywords: Wearable Robotics, Prosthetics and Exoskeletons, Soft Robot Applications
Abstract: Wearable exosuits assist human movement in tasks ranging from rehabilitation to daily activities; specifically, head-neck support is necessary for patients with certain neurological disorders. Rigid-link exoskeletons have shown to enable head-neck mobility compared to static braces, but their bulkiness and restrictive structure inspire designs using ``soft" actuation methods. In this paper, we propose a fabric pneumatic artificial muscle-based exosuit design for head-neck support. We describe the design of our prototype and physics-based model, enabling us to derive actuator pressures required to compensate for gravitational load. Our modeled range of motion and workspace analysis indicate that the limited actuator lengths impose slight limitations (83% workspace coverage), and gravity compensation imposes a more significant limitation (43% workspace coverage). We introduce compression force along the neck as a novel, potentially comfort-related metric. We further apply our model to compare the torque output of various actuator placement configurations, allowing us to select a design with stability in lateral deviation and high axial rotation torques. The model correctly predicts trends in measured data where wrapping the actuators around the neck is not a significant factor. Our test dummy and human user demonstration confirm that the exosuit can provide functional head support and trajectory tracking, underscoring the potential of artificial muscle–based soft actuation for head–neck mobility assistance.
|
| |
| 15:00-16:30, Paper ThI2I.200 | Add to My Program |
| Sight Over Site: Perception-Aware Reinforcement Learning for Efficient Robotic Inspection |
|
| Kuhlmann, Richard | ETH Zürich |
| Wolfram, Jakob | ETH Zurich |
| Sun, Boyang | ETH Zurich |
| Xing, Jiaxu | University of Zurich |
| Scaramuzza, Davide | University of Zurich |
| Pollefeys, Marc | ETH Zurich |
| Cadena, Cesar | ETH Zurich |
Keywords: Deep Learning for Visual Perception, Vision-Based Navigation
Abstract: Autonomous inspection is a central problem in robotics, with applications ranging from industrial monitoring to search-and-rescue. Traditionally, inspection has often been reduced to navigation tasks, where the objective is to reach a predefined location while avoiding obstacles. However, this formulation captures only part of the real inspection problem. In real-world environments, the inspection targets may become visible well before their exact coordinates are reached, making further movement both redundant and inefficient. What matters more for inspection is not simply arriving at the target’s position, but positioning the robot at a viewpoint from which the target becomes observable. In this work, we revisit inspection from a perception-aware perspective. We propose an end-to-end reinforcement learning framework that explicitly incorporates target visibility as the primary objective, enabling the robot to find the shortest trajectory that guarantees visual contact with the target without relying on a map. The learned policy leverages both perceptual and proprioceptive sensing and is trained entirely in simulation, before being deployed to a real-world robot. We further develop an algorithm to compute ground-truth shortest inspection paths, which provides a reference for evaluation. Through extensive experiments, we show that our method outperforms existing classical and learning-based navigation approaches, yielding more efficient inspection trajectories in both simulated and real-world settings.
|
| |
| 15:00-16:30, Paper ThI2I.201 | Add to My Program |
| CoPaRo: A Compact, Backdrivable 6-DOF Hybrid Parallel Robot with Serial-Like Form Factor and Large Workspace |
|
| Verret, Samuel | Université Laval |
| Gosselin, Clement | Université Laval |
Keywords: Parallel Robots, Mechanism Design, Physical Human-Robot Interaction
Abstract: A novel, compact, backdrivable 6-degree-of-freedom (DOF) hybrid parallel robot with a large axisymmetric workspace is proposed, referred to as CoPaRo, short for Compact Parallel Robot. The architecture achieves a high workspace-to-footprint ratio comparable to that of serial robots. The proposed robot is well suited for physical human-robot interaction (pHRI) due to its low inertia, backdrivability, and large workspace. A complete kinematic analysis is provided, including forward and inverse kinematics and velocity equations. All singularity conditions of the proposed architecture are identified, and the complete usable workspace is presented, accounting for singularities, mechanical interferences, and numerical stability. A CAD model and computer animations of the robot are provided to illustrate its motion, highlighting both the compact footprint and the large workspace. The actuators are positioned close to the base and transmit motion to distal joints via pulleys to reduce the robot's inertia. Direct-drive or quasi-direct-drive actuators can be used to enable backdrivability.
|
| |
| 15:00-16:30, Paper ThI2I.202 | Add to My Program |
| Narrate2Nav: Real-Time Visual Navigation with Implicit Language Reasoning in Human-Centric Environments |
|
| Payandeh, Amirreza | George Mason University |
| Pokhrel, Anuj | George Mason University |
| Song, Daeun | Ewha Womans University |
| Zampieri, Marcos | George Mason University |
| Xiao, Xuesu | George Mason University |
Keywords: Imitation Learning, Visual Learning, Representation Learning
Abstract: Large Vision-Language Models (VLMs) have demonstrated potential in enhancing mobile robot navigation in human-centric environments by understanding contextual cues, human intentions, and social dynamics while exhibiting reasoning capabilities. However, their computational complexity and limited sensitivity to continuous numerical data impede real-time performance and precise motion control. To this end, we propose Narrate2Nav, a real-time vision-action model that leverages a self-supervised learning framework based on the Barlow Twins redundancy reduction loss to embed implicit natural language reasoning, social cues, and human intentions within a visual encoder. The model combines RGB inputs, motion commands, and textual signals of scene context during training to bridge from robot observations to low-level motion commands for short-horizon point-goal navigation during deployment. Extensive evaluation of Narrate2Nav across diverse and challenging scenarios in an unseen offline dataset, complemented by a small-scale real-world experiment, demonstrates a 52.94% improvement over the next best baseline in offline testing, with consistent gains observed in real-world evaluations.
|
| |
| 15:00-16:30, Paper ThI2I.203 | Add to My Program |
| Aion: Towards Hierarchical 4D Scene Graphs with Temporal Flow Dynamics |
|
| Catalano, Iacopo | University of Turku |
| Montijano, Eduardo | Universidad De Zaragoza |
| Placed, Julio A. | Instituto Tecnológico De Aragón |
| Civera, Javier | Universidad De Zaragoza |
| Peña Queralta, Jorge | Zurich University of Applied Sciences |
Keywords: Mapping, Human-Centered Robotics, RGB-D Perception
Abstract: Autonomous navigation in dynamic environments requires spatial representations that capture both semantic structure and temporal evolution. 3D Scene Graphs (3DSGs) provide hierarchical multi-resolution abstractions that encode geometry and semantics, but existing extensions toward dynamics largely focus on individual objects or agents. In parallel, Maps of Dynamics (MoDs) model typical motion patterns and temporal regularities, yet are usually tied to grid-based discretizations that lack semantic awareness and do not scale well to large environments. In this paper we introduce textbf{Aion}, a framework that embeds emph{temporal flow dynamics} directly within a hierarchical 3DSG, effectively incorporating the temporal dimension. Aion employs a graph-based sparse MoD representation to capture motion flows over arbitrary time intervals and attaches them to navigational nodes in the scene graph, yielding more interpretable and scalable predictions that improve planning and interaction in complex dynamic environments.
|
| |
| 15:00-16:30, Paper ThI2I.204 | Add to My Program |
| CooperDrive: Enhancing Driving Decisions through Cooperative Perception |
|
| Qu, Deyuan | Toyota Motor North America, InfoTech Labs |
| Chen, Qi | Toyota Motor North America, InfoTech Labs |
| Altintas, Onur | Toyota Motor North America, R&D |
| Shimizu, Takayuki | Toyota Motor North America, R&D |
Keywords: Computer Vision for Transportation, Motion and Path Planning, Sensor Fusion
Abstract: Autonomous vehicles equipped with robust onboard perception, localization, and planning still face limitations in occlusion and non-line-of-sight (NLOS) scenarios, where delayed reactions can increase collision risk. We propose CooperDrive, a cooperative perception framework that augments situational awareness and enables earlier, safer driving decisions. CooperDrive offers two key advantages: (i) each vehicle retains its native perception, localization, and planning stack, and (ii) a lightweight object-level sharing and fusion strategy bridges perception and planning. Specifically, CooperDrive reuses detector Bird’s-Eye View (BEV) features to estimate accurate vehicle poses without additional heavy encoders, thereby reconstructing BEV representations and feeding the planner with low latency. On the planning side, CooperDrive leverages the expanded object set to anticipate potential conflicts earlier and adjust speed and trajectory proactively, thereby transforming reactive behaviors into predictive and safer driving decisions. Real-world closed-loop tests at occlusion-heavy NLOS intersections demonstrate that CooperDrive increases reaction lead time, minimum time-to-collision (TTC), and stopping margin, while requiring only ~90 kbps bandwidth and maintaining an average end-to-end latency of 89 ms.
|
| |
| 15:00-16:30, Paper ThI2I.205 | Add to My Program |
| Decentralized Admittance Control for a Multi–manipulator System: Theory and Experiments |
|
| Carriero, Graziano | University of Basilicata |
| Sileo, Monica | University of Basilicata |
| Fregnan, Sebastiano | Lund University |
| Pierri, Francesco | Università Della Basilicata |
| Caccavale, Fabrizio | Università Degli Studi Della Basilicata |
| Karayiannidis, Yiannis | Lund University |
Keywords: Multi-Robot Systems, Cooperating Robots, Compliance and Impedance Control
Abstract: This paper presents a decentralized control framework for cooperative object transportation with multiple robotic manipulators. In particular two admittance schemes are designed in order to regulate external contact wrenches and internal interaction wrenches without a central unit or all-to-all communication. Each manipulator estimates the wrenches exerted by its teammates through a bank of consensus-based observers that exploits a strongly connected communication graph. These estimates feed two local admittance filters: an external filter, computing the reference object trajectory while limiting environmental wrenches, and an internal filter, generating the end-effector trajectory to minimize each robot’s contribution to internal wrenches. Experiments carried out with three 7-DOF Franka Emika Panda arms show a marked reduction of both external and internal wrenches, demonstrating the effectiveness and robustness of the proposed approach.
|
| |
| 15:00-16:30, Paper ThI2I.206 | Add to My Program |
| Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators |
|
| Badithela, Apurva | Princeton University |
| Snyder, David | Princeton University |
| Zha, Lihan | Princeton University |
| Mikhail, Joseph | University of Texas, Austin |
| O'Kelly, Matthew | Waymo |
| Dixit, Anushri | University of California, Los Angeles |
| Majumdar, Anirudha | Princeton University |
Keywords: Probability and Statistical Methods, Performance Evaluation and Benchmarking
Abstract: Rapid progress in imitation learning, foundation models, and large-scale datasets has led to robot manipulation policies that generalize to a wide-range of tasks and environments. However, rigorous evaluation of these policies remains a challenge. Typically in practice, robot policies are often evaluated on a small number of hardware trials without any statistical assurances. We present SureSim, a framework to augment large-scale simulation with relatively small-scale real-world testing to provide reliable inferences on the real-world performance of a policy. Our key idea is to formalize the problem of combining real and simulation evaluations as a prediction-powered inference problem, in which a small number of paired real and simulation evaluations are used to rectify bias in large-scale simulation. We then leverage non-asymptotic mean estimation algorithms to provide confidence intervals on mean policy performance. Using physics-based simulation, we evaluate both diffusion policy and multi-task fine-tuned (pi_0) on a joint distribution of objects and initial conditions, and find that our approach saves over (20-25%) of hardware evaluation effort to achieve similar bounds on policy performance.
|
| |
| 15:00-16:30, Paper ThI2I.207 | Add to My Program |
| Multimodal Knowledge Distillation for Egocentric Action Recognition Robust to Missing Modalities |
|
| Carrión-Ojeda, Dustin | Technical University of Darmstadt |
| Santos-Villafranca, Maria | University of Zaragoza |
| Perez-Yus, Alejandro | University of Zaragoza |
| Bermudez-Cameo, Jesus | University of Zaragoza |
| Guerrero, Jose J. | University of Zaragoza |
| Schaub-Meyer, Simone | Technical University of Darmstadt |
Keywords: Recognition, Deep Learning for Visual Perception, Sensor Fusion
Abstract: Egocentric action recognition enables robots to facilitate human-robot interactions and monitor task progress. Existing methods often rely solely on RGB videos, although additional modalities, such as audio, can improve accuracy under challenging conditions. However, most multimodal approaches assume that all modalities are available at inference time, leading to significant accuracy drops, or even failure, when inputs are missing. To address this limitation, we introduce KARMMA, a multimodal Knowledge distillation framework for egocentric Action Recognition robust to Missing Mod Alities that does not require modality alignment across all samples during training or inference. KARMMA distills knowledge from a multimodal teacher into a multimodal student that leverages all available modalities while remaining robust to missing ones, enabling deployment across diverse sensor configurations without retraining. Our student uses approximately 50% fewer computational resources than the teacher, resulting in a lightweight and fast model that is well suited for on-robot deployment. Experiments on Epic-Kitchens and Something-Something demonstrate that our student achieves competitive accuracy while significantly reducing performance degradation under missing modality conditions.
|
| |
| 15:00-16:30, Paper ThI2I.208 | Add to My Program |
| CERNet: Class-Embedding Predictive-Coding RNN for Unified Robot Motion, Recognition, and Confidence Estimation |
|
| Sawada, Hiroki | CY Cergy Paris Universite |
| Pitti, Alexandre | CY Cergy Paris Universite |
| Quoy, Mathias | CY Cergy Paris Universite |
Keywords: Bioinspired Robot Learning, Neurorobotics, Intention Recognition
Abstract: Robots interacting with humans must not only generate learned movements in real-time, but also infer the intent behind observed behaviors and estimate the confidence of their own inferences. This paper proposes a unified model that achieves all three capabilities within a single hierarchical predictive-coding recurrent neural network equipped with a class embedding vector, CERNet, which leverages a dynamically updated class embedding vector to unify motor generation and recognition. The model operates in two modes: generation and inference. In the generation mode, the class embedding constrains the hidden state dynamics to a class-specific subspace; in the inference mode, it is optimized online to minimize prediction error, enabling real-time recognition. Validated on a humanoid robot across 26 kinesthetically taught alphabets, our hierarchical model achieves 76% lower trajectory reproduction error than a parameter-matched single-layer baseline, maintains motion fidelity under external perturbations, and infers the demonstrated trajectory class online with 68% Top-1 and 81% Top-2 accuracy. Furthermore, internal prediction errors naturally reflect the model’s confidence in its recognition. This integration of robust generation, real-time recognition, and intrinsic uncertainty estimation within a single neural network framework offers a compact and extensible approach to motor memory in physical robots, with potential applications in intent-sensitive human–robot collaboration.
|
| |
| 15:00-16:30, Paper ThI2I.209 | Add to My Program |
| Precise and Efficient Collision Prediction under Uncertainty in Autonomous Driving |
|
| Kaufeld, Marc | Technical University of Munich |
| Betz, Johannes | Technical University of Munich |
Keywords: Collision Avoidance, Planning under Uncertainty, Probability and Statistical Methods
Abstract: This research introduces two efficient methods to estimate the collision risk of planned trajectories in autonomous driving under uncertain driving conditions. Deterministic collision checks of planned trajectories are often inaccurate or overly conservative, as noisy perception, localization errors, and uncertain predictions of other traffic participants introduce significant uncertainty into the planning process. This paper presents two semi-analytic methods to compute the collision probability of planned trajectories with arbitrary convex obstacles. The first approach evaluates the probability of spatial overlap between an autonomous vehicle and surrounding obstacles, while the second estimates the collision probability based on stochastic boundary crossings. Both formulations incorporate full state uncertainties, including position, orientation and velocity, and achieve high accuracy at computational costs suitable for real-time planning. Simulation studies verify that the proposed methods closely match Monte Carlo results while providing significant runtime advantages, enabling their use in risk-aware trajectory planning. The collision estimation methods are available as open-source software.
|
| |
| 15:00-16:30, Paper ThI2I.210 | Add to My Program |
| MUSE: Multimodal Uncertainty Quantification of State Estimation |
|
| Kim, Minkyung | University of Illinois Urbana-Champaign |
| Che, Henry | University of Illinois, Urbana-Champaign |
| Chandaka, Bhargav | University of Illinois at Urbana-Champaign |
| Pramuanpornsatid, Bhumsitt | University of Illinois Urbana-Champaign |
| Yang, Chengyu | University of Illinois Urbana-Champaign |
| Cheng, Sheng | University of Illinois Urbana-Champaign |
| Wang, Xiaofeng | University of South Carolina |
| Hovakimyan, Naira | University of Illinois at Urbana-Champaign |
| Wang, Shenlong | University of Illinois at Urbana-Champaign |
Keywords: Deep Learning for Visual Perception, SLAM, Visual-Inertial SLAM
Abstract: Accurate visual state estimation has been a central topic in robotics with a wide range of applications in robot navigation, autonomous driving, and autonomous flight. Recent advances in robot perception have led to significant improvements in the accuracy and robustness of state estimation, yet a fundamental challenge remains in how to quantify and calibrate its precision, i.e., how confident we are in an estimate and whether failures can be detected. This issue is particularly pronounced in visual–inertial odometry (VIO), where the heteroscedastic and multimodal nature of the problem makes uncertainty quantification especially difficult. This paper introduces MUSE (Multimodal Uncertainty Quantification of State Estimation), a novel real-time learning-based framework that leverages the strong and efficient sequential modeling capacity of Mamba to estimate localization uncertainty from multiple asynchronous sensor streams. Experiments on both public and in-house datasets demonstrate that MUSE achieves superior reliability and robustness compared to existing uncertainty quantification methods, and ablation studies justify the benefits of its key design choices. We release our source code and dataset at https://github.com/hungdche/MUSE.
|
| |
| 15:00-16:30, Paper ThI2I.211 | Add to My Program |
| Design and Validation of a Soft Self-Centering Gripper for Delicate Object Handling |
|
| Zhang, Xiaoqian | University of Genova and Technical University of Munich |
| Baggetta, Mario | University of Genoa |
| Piazza, Cristina | Technical University Munich (TUM) |
| Berselli, Giovanni | Università Di Genova |
Keywords: Grippers and Other End-Effectors, Grasping, Soft Robot Applications
Abstract: Harvesting, gripping, and handling of fruit and vegetables require end-effectors that ensure grip stability while, at the same time, minimizing surface damage and bruising. Conventional rigid or partially compliant solutions often generate localized load concentrations and require high positioning accuracy, limiting their effectiveness in unstructured environments. This work presents a novel three-finger gripper with a self-centering closing mechanism and soft fingers. The system operates in two stages: first, the fingers slide (FS) along linear guides driven by a dedicated motor to adapt to the object size; second, a separate motor actuates the finger closure to establish the grasp. The finger design is inspired by the handed-shearing auxetic (HSA) actuator, enabling controlled pre-shaping (PS) by adapting to the object geometry. The proposed design was first validated through finite element simulations, comparing pre-shaping (PS) against passive compliance (PC) under matched load conditions. Results demonstrate that PS significantly improves pressure uniformity and grasp stability. A fully functional prototype was then fabricated via additive manufacturing.
|
| |
| 15:00-16:30, Paper ThI2I.212 | Add to My Program |
| Privacy-Aware LLMs-Assisted Task Planning for Home Robots |
|
| Chen, Zhanjie | Oklahoma State University |
| Sheng, Weihua | Oklahoma State University |
Keywords: Task Planning, Service Robotics
Abstract: Multi-modal large language models (LLMs) are expected to significantly enhance the intelligence of home service robots. However, reliance on cloud processing of raw visual data poses critical privacy risks. To address this problem, we propose a novel two-stage cloud-edge hybrid architecture for robots in domestic environments. This architecture employs a lightweight local LLM to perform sensitive content screening and semantic abstraction before transmitting the data to a more powerful cloud-based LLM for high-level planning and reasoning. Experiments with our end-to-end system demonstrate that it effectively protects a wide range of private data with minimal impact on task success rates. Without modifying cloud models, our approach offers a deployable performance–privacy trade-off for home robots, advancing safe and socially acceptable autonomy.
|
| |
| 15:00-16:30, Paper ThI2I.213 | Add to My Program |
| DRILL: Deployment & Reading of In-Ground Low-Cost Soil Moisture Logging Sensors Using an Autonomous Ground Robot |
|
| Deb, Aarya | Purdue University |
| Norwood, Joseph | Princeton University |
| Vassilev, Martin | University of Georgia |
| MacDonald, Sean | Purdue University |
| Kim, Kitae | Purdue University |
| Cappelleri, David | Purdue University |
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation
Abstract: Accurate soil moisture data is crucial to precise irrigation and manual deployment of existing sensors is labor-intensive and expensive, especially in cornfield environments. We present DRILL, an unmanned ground vehicle (UGV) for autonomously deploying and reading low-cost biodegradable soil moisture sensors. The platform consists of a mechanical drilling head (linear actuator, auger drill, 16-slot encoded sensor dispenser, and a chute for guiding) and a reading head with a vector network analyzer (VNA), combined with a vison-guided navigation system for logging soil moisture data without human intervention. The robot platform has been experimentally validated in real-world farm environments over an extensive period and achieved a success rate of 93.75% for the deployment cycle and 100% for the reading cycle, with a mean cycle time of under a minute per sensor. Out of 330 sensor readings with the VNA, overall 73.3% produced valid peaks in 100-160 MHz range indicating a valid soil moisture reading and over 95.3% during the first half of the study, suggesting sensor aging. With a mean in-ground plane alignment error of 1.3 cm in X and 0.6 cm in Y, well within the 4 cm tolerance in each axis, DRILL demonstrates a scalable platform for autonomous soil monitoring and timely data collection in precision agriculture.
|
| |
| 15:00-16:30, Paper ThI2I.214 | Add to My Program |
| FR-CDNet: Unified Scene Change Detection Model across Viewpoint Variations and Different Temporal Conditions |
|
| Peng, Yilin | South China Normal University |
| Fu, Yingchun | South China Normal University |
| Li, Xiangru | South China Normal University |
| Li, Zhenhao | South China Normal University |
| Chen, Shuqi | South China Normal University |
| Ji, Shunping | Wuhan University |
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, Computer Vision for Transportation
Abstract: Scene Change Detection (SCD) is a critical task for building smart cities, yet its practical application faces dual challenges: existing methods typically rely on temporal conditions present in the training data and the ideal assumption of small viewpoint differences. Consequently, they struggle to handle the common and significant viewpoint variations in real-world scenarios and exhibit strong sensitivity to temporal conditions, leading to drastic performance degradation under unseen temporal settings. To address these challenges, we propose the Fusion-Refinement Change Detection Network (FR-CDNet). By modeling correspondences between objects and preserving spatial prior information from ideally aligned scenes during the disentangled processing of different temporal directions, our network achieves a unified handling of varying degrees of viewpoint variations and different temporal conditions---a capability existing methods lack. Furthermore, FR-CDNet can automatically distinguish the temporal attribution of change entities to better support downstream tasks. To better evaluate performance in real-world settings, we further construct the URSCD dataset, which includes larger viewpoint differences and more diverse change scenarios. Extensive experiments demonstrate the universal scene detection capability of our method: it achieves significant improvement in F1-score on unaligned scenes while maintaining performance comparable to SOTA on aligned scenes. Ablation studies further demonstrate that the proposed framework can be migrated to enhance various mainstream models, effectively eliminating temporal condition dependency while improving overall performance.
|
| |
| 15:00-16:30, Paper ThI2I.215 | Add to My Program |
| Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation |
|
| Unmesh, Asim | Purdue University |
| Kaki, Ramesh | BITS Pilani, Hyderabad Campus |
| Jain, Rahul | Purdue University |
| Patel, Mayank | Purdue University |
| Ramani, Karthik | Purdue University |
Keywords: Deep Learning for Visual Perception, Recognition
Abstract: Temporal Action Segmentation (TAS) requires dividing videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting comprehensive datasets infeasible. Existing methods remain limited to closed vocabularies and fixed label sets. In this work, we explore the largely unexplored problem of Open-Vocabulary Zero-Shot Temporal Action Segmentation (OVTAS) by leveraging the strong zero-shot capabilities of Vision–Language Models (VLMs). We introduce a training-free pipeline that follows a segmentation-by-classification design: (i) Frame–Action Embedding Similarity (FAES) matches video frames to candidate action labels, and (ii) Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency. Beyond proposing OVTAS, we present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action segmentation. Experiments on standard benchmarks show that OVTAS achieves strong results without task-specific supervision, underscoring the potential of VLMs for structured temporal understanding. We release code and embeddings at our project page.
|
| |
| 15:00-16:30, Paper ThI2I.216 | Add to My Program |
| TreeIRL: Safe Urban Driving with Tree Search and Inverse Reinforcement Learning |
|
| Tomov, Momchil | Motional AD |
| Lee, Sang Uk | Motional AD |
| Hendargo, Hansford | Motional AD |
| Huh, Jinwook | Motional AD |
| Han, Teawon | Motional AD |
| Howington, Forbes | Motional AD |
| Rodrigues da Silva, Rafael | Motional AD |
| Bernasconi, Gianmarco | Motional AD |
| Heim, Marc | Motional AD |
| Findler, Samuel | Motional AD |
| Ji, Xiaonan | Motional AD |
| Boule, Alexander | Motional AD |
| Napoli, Michael | Motional AD |
| Chen, Kuo | Motional AD |
| Miller, Jesse | Motional AD |
| Floor, Boaz Cornelis | Motional AD |
| Hu, Yunqing | Motional AD |
Keywords: Autonomous Vehicle Navigation, Motion and Path Planning, Reinforcement Learning
Abstract: We present TreeIRL, a novel planner for autonomous driving that combines Monte Carlo tree search (MCTS) and inverse reinforcement learning (IRL) to achieve state-of-the-art performance in simulation and in real-world driving. The key idea is to use MCTS to find a promising set of safe candidate trajectories and a deep scoring function trained with IRL to select the most human-like among them. We evaluate TreeIRL against classical and state-of-the-art planners on large-scale simulations and on 500+ miles of real-world autonomous driving in the Las Vegas metropolitan area. Scenarios include navigating heavy urban traffic, adaptive cruise control, cut-ins, and traffic lights. TreeIRL achieves the best overall performance, striking a balance between safety, progress, comfort, and human-likeness. To the best of our knowledge, our work is the first public-road demonstration of MCTS-based planning and underscores the importance of evaluating planners across a diverse set of metrics and in real-world environments. TreeIRL is highly extensible and could be further improved with reinforcement learning and imitation learning, providing a framework for exploring different combinations of classical and learning-based approaches to solve the planning bottleneck in autonomous driving.
|
| |
| 15:00-16:30, Paper ThI2I.217 | Add to My Program |
| Tactile-Based Human Intent Recognition for Robot Assistive Navigation |
|
| Peng, Shaoting | University of Illinois Urbana-Champaign |
| Crowder, Dakarai | University of Illinois Urbana Champaign |
| Yuan, Wenzhen | University of Illinois |
| Driggs-Campbell, Katherine | University of Illinois at Urbana-Champaign |
Keywords: Intention Recognition, Touch in HRI, Physically Assistive Devices
Abstract: Robot assistive navigation (RAN) is critical for enhancing the mobility and independence of the growing population of mobility-impaired individuals. However, existing systems often rely on interfaces that fail to replicate the intuitive and efficient physical communication observed between a person and a human caregiver, limiting their effectiveness. In this paper, we introduce Tac-Nav, a RAN system that leverages a cylindrical tactile skin mounted on a Stretch 3 mobile manipulator to provide a more natural and efficient interface for human navigational intent recognition. To robustly classify the tactile data, we developed the Cylindrical Kernel Support Vector Machine (CK-SVM), an algorithm that explicitly models the sensor's cylindrical geometry and is consequently robust to the natural rotational shifts present in a user's grasp. Comprehensive experiments were conducted to demonstrate the effectiveness of our classification algorithm and the overall system. Results show that CK-SVM achieved superior classification accuracy on both simulated (97.1%) and real-world (90.8%) datasets compared to four baseline models. Furthermore, a pilot study confirmed that users more preferred the Tac-Nav tactile interface over conventional joystick and voice-based controls. Code and video are available at: https://sites.google.com/view/tac-nav/home.
|
| |
| 15:00-16:30, Paper ThI2I.218 | Add to My Program |
| GATO: GPU-Accelerated and Batched Trajectory Optimization for Scalable Edge Model Predictive Control |
|
| Du, Alexander | Columbia University |
| Adabag, Emre | University of Michigan |
| Bravo-Palacios, Gabriel | Dartmouth College |
| Plancher, Brian | Dartmouth College |
Keywords: Optimization and Optimal Control, Software Architecture for Robotic and Automation, Control Architectures and Programming
Abstract: While Model Predictive Control (MPC) delivers strong performance across robotics applications, solving the underlying (batches of) nonlinear trajectory optimization (TO) problems online remains computationally demanding. Existing GPU-accelerated approaches either parallelize single solves, handle large batches at sub-real-time rates, or sacrifice model generality for speed. This leaves a large gap in solver performance for many state-of-the-art MPC applications that require real-time batches of tens to low-hundreds of solves. As such, we present GATO, an open source, GPU-accelerated, batched TO solver co-designed across algorithm, software, and computational hardware to deliver real-time throughput for these moderate batch size regimes. Our approach leverages a combination of block-, warp-, and thread-level parallelism within and across solves for ultra-high performance. We demonstrate the effectiveness of our approach through a combination of: simulated benchmarks showing speedups of 18-21x over CPU baselines and 1.4-16x over GPU baselines as batch size increases; case studies highlighting improved disturbance rejection and convergence behavior; and finally a validation on hardware using an industrial manipulator. We open source GATO to support reproducibility and adoption.
|
| |
| 15:00-16:30, Paper ThI2I.219 | Add to My Program |
| PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding |
|
| Che, Lirong | Tsinghua University |
| Gan, Zhenfeng | Tsinghua University |
| Chen, Yanbo | Tsinghua University |
| Tan, Junbo | Tsinghua University |
| Wang, Xueqian | Center for Artificial Intelligence and Robotics, Graduate School at Shenzhen, Tsinghua University, 518055 Shenzhen, China |
|
|
| |
| 15:00-16:30, Paper ThI2I.220 | Add to My Program |
| Bridging Policy and Real-World Dynamics: LLM-Augmented Rebalancing for Shared Micromobility Systems |
|
| Tan, Heng | Lehigh University |
| Yan, Hua | Lehigh University |
| Yang, Yu | Lehigh University |
Keywords: Intelligent Transportation Systems
Abstract: Shared micromobility services such as e-scooters and bikes have become an integral part of urban transportation, yet their efficiency critically depends on effective vehicle rebalancing. Existing methods either optimize for average demand patterns or employ robust optimization and reinforcement learning to handle predefined uncertainties. However, these approaches overlook emergent events (e.g., demand surges, vehicle outages, regulatory interventions) or sacrifice performance in normal conditions. We introduce AMPLIFY, an LLM-augmented policy adaptation framework for shared micromobility rebalancing. The framework combines a baseline rebalancing module with an LLM-based adaptation module that adjusts strategies in real time under emergent scenarios. The adaptation module ingests system context, demand predictions, and baseline strategies, and refines adjustments through self-reflection. Evaluations on real-world e-scooter data from Chicago show that our approach improves demand satisfaction and system revenue compared to baseline policies, highlighting the potential of LLM-driven adaptation as a flexible solution for managing uncertainty in micromobility systems.
|
| |
| 15:00-16:30, Paper ThI2I.221 | Add to My Program |
| HoMeR: Learning In-The-Wild Mobile Manipulation Via Hybrid Imitation and Whole-Body Control |
|
| Sundaresan, Priya | Stanford University |
| Malhotra, Rhea | Stanford University |
| Miao, Phillip | Stanford University |
| Yang, Jingyun | Stanford University |
| Wu, Jimmy | Princeton University |
| Hu, Hengyuan | Stanford University |
| Antonova, Rika | Stanford University |
| Engelmann, Francis | Stanford University |
| Sadigh, Dorsa | Stanford University |
| Bohg, Jeannette | Stanford University |
Keywords: Imitation Learning, Mobile Manipulation, Learning from Demonstration
Abstract: We introduce HoMeR, an imitation learning framework for mobile manipulation that combines whole-body control with hybrid action modes that handle both long-range and fine-grained motion, enabling effective performance on realistic in-the-wild tasks. At its core is a fast, kinematics-based whole-body controller that maps desired end-effector poses to coordinated motion across the mobile base and arm. Within this reduced end-effector action space, HoMeR learns to switch between absolute pose predictions for long-range movement and relative pose predictions for fine-grained manipulation, offloading low-level coordination to the controller and focusing learning on task-level decisions. We deploy HoMeR on a holonomic mobile manipulator with a 7-DoF arm in a real home. We compare HoMeR to baselines without hybrid actions or whole-body control across 3 simulated and 3 real household tasks such as opening cabinets, sweeping trash, and rearranging pillows. Across tasks, HoMeR achieves an overall success rate of 79.17% using just 20 demonstrations per task, outperforming the next best baseline by 29.17 on average. HoMeR is also compatible with vision-language models and can leverage their internet-scale priors to better generalize to novel object appearances, layouts, and cluttered scenes. In summary, HoMeR moves beyond tabletop settings and demonstrates a scalable path toward sample-efficient, generalizable manipulation in everyday indoor spaces. Code, videos, and supplementary material are available at: https://homer-manip.github.io/.
|
| |
| 15:00-16:30, Paper ThI2I.222 | Add to My Program |
| On the Conic Complementarity of Planar Contacts |
|
| de Mont-Marin, Yann | Inria, DI ENS |
| Montaut, Louis | INRIA (Paris) - CIIRC (Prague) |
| Hebert, Martial | CMU |
| Ponce, Jean | Ecole Normale Supérieure |
| Carpentier, Justin | INRIA |
Keywords: Dynamics, Optimization and Optimal Control, Simulation and Animation
Abstract: We present a unifying theoretical result that con- nects two foundational principles in robotics: the Signorini law for point contacts, which underpins many simulation methods for preventing object interpenetration, and the center of pres- sure (also known as the zero-moment point), a key concept in optimization-based locomotion control. Our contribution is the planar Signorini condition, a conic complementarity formulation that models general planar contacts between rigid bodies. We prove that this formulation is equivalent to enforcing the punctual Signorini law across an entire contact surface, thereby bridging the gap between discrete and continuous contact models. A geometric interpretation reveals that the framework naturally captures three physical regimes —stick- ing, separating, and tilting— within a unified complementarity structure. This leads to a principled extension of the classical center of pressure, which we refer to as the extended center of pressure. By establishing this connection, our work provides a mathematically consistent and computationally tractable foundation for handling planar contacts, with implications for both the accurate simulation of contact dynamics and the design of next-generation control and optimization algorithms in locomotion and manipulation.
|
| |
| 15:00-16:30, Paper ThI2I.223 | Add to My Program |
| Relative Position Matters: Trajectory Prediction and Planning with Polar Representation |
|
| Zhang, Bozhou | Fudan University |
| Song, Nan | Fudan University |
| Gao, Bingzhao | Tongji University |
| Zhang, Li | Fudan University |
Keywords: Autonomous Vehicle Navigation, Computer Vision for Transportation, Visual Learning
Abstract: Trajectory prediction and planning in autonomous driving are highly challenging due to the complexity of predicting surrounding agents' movements and planning the ego agent's actions in dynamic environments. Existing methods encode map and agent positions and decode future trajectories in Cartesian coordinates. However, modeling the relationships between the ego vehicle and surrounding traffic elements in Cartesian space can be suboptimal, as it does not naturally capture the varying influence of different elements based on their relative distances and directions. To address this limitation, we adopt the Polar coordinate system, where positions are represented by radius and angle. This representation provides a more intuitive and effective way to model spatial changes and relative relationships, especially in terms of distance and directional influence. Based on this insight, we propose Polaris, a novel method that operates entirely in Polar coordinates, distinguishing itself from conventional Cartesian-based approaches. By leveraging the Polar representation, this method explicitly models distance and direction variations and captures relative relationships through dedicated encoding and refinement modules, enabling more structured and spatially aware trajectory prediction and planning. Extensive experiments on the challenging prediction (Argoverse 2) and planning benchmarks (nuPlan) demonstrate that Polaris achieves state-of-the-art performance.
|
| |
| 15:00-16:30, Paper ThI2I.224 | Add to My Program |
| GSRender: Deduplicated Occupancy Estimation Via Weakly Supervised 3D Gaussian Splatting |
|
| Sun, Qianpu | Tsinghua University |
| Zhou, Sifan | Southeast University |
| Changyong, Shu | Beihang University |
| Han, Sirui | Hong Kong University of Science and Technology |
| Yuan, Chun | Tsinghua University |
Keywords: Data Sets for Robotic Vision, Recognition, Visual Learning
Abstract: Weakly-supervised 3D occupancy perception is crucial for vision-based autonomous driving in outdoor environments. Previous methods based on NeRF often face a challenge in balancing the number of samples used. Too many samples can decrease efficiency, while too few can compromise accuracy, leading to variations in the mean Intersection over Union (mIoU) by 5-10 points. Furthermore, even with surrounding-view image inputs, only a single image is rendered from each viewpoint at any given moment. This limitation leads to duplicated predictions, which significantly impacts the practicality of the approach. However, this issue has largely been overlooked in existing research. To address this, we propose GSRender, which uses 3D Gaussian Splatting for weakly-supervised occupancy estimation, simplifying the sampling process. Additionally, we introduce the Ray Compensation module, which reduces duplicated predictions by compensating for features from adjacent frames. Finally, we redesign the dynamic loss to remove the influence of dynamic objects from adjacent frames. Extensive experiments show that our approach achieves SOTA results in RayIoU (+6.0), while also narrowing the gap with 3D-supervised methods. This work lays a solid foundation for weakly-supervised occupancy perception. The code will be released soon.
|
| |
| 15:00-16:30, Paper ThI2I.225 | Add to My Program |
| Active Next-Best-View Optimization for Risk-Averse Path Planning |
|
| Mollaei Khass, Amirhossein | Lehigh University |
| Liu, Guangyi | Amazon |
| Pandey, Vivek | Lehigh University |
| Jiang, Wen | University of Pennsylvania |
| Lei, Boshu | University of Pennsylvania |
| Daniilidis, Kostas | University of Pennsylvania |
| Motee, Nader | Lehigh Universitty |
Keywords: View Planning for SLAM, RGB-D Perception, Planning under Uncertainty
Abstract: Safe navigation in uncertain environments requires planning methods that integrate risk aversion with active perception. In this work, we present a unified frame- work that refines a coarse reference path by construct- ing tail-sensitive risk maps from Average Value-at-Risk statistics on an online-updated 3D Gaussian-splat Radiance Field. These maps enable the generation of locally safe and feasible trajectories. In parallel, we formulate Next- Best-View (NBV) selection as an optimization problem on the SE(3) pose manifold, where Riemannian gradient descent maximizes an expected information gain objective to reduce uncertainty most critical for imminent motion. Our approach advances the state-of-the-art by coupling risk-averse path refinement with NBV planning, while introducing scalable gradient decompositions that support efficient online updates in complex environments. We demonstrate the effectiveness of the proposed framework through extensive computational studies.
|
| |
| 15:00-16:30, Paper ThI2I.226 | Add to My Program |
| VeriGraph: Scene Graphs for Execution Verifiable Robot Planning |
|
| Ekpo, Daniel | University of Maryland, College Park |
| Levy, Mara | University of Maryland, College Park |
| Suri, Saksham | University of Maryland |
| Huynh, Chuong | University of Maryland, College Park |
| Swaminathan, Archana | University of Maryland, College Park |
| Shrivastava, Abhinav | University of Maryland, College Park |
Keywords: Task and Motion Planning, Visual Learning
Abstract: Recent advancements in vision-language models (VLMs) offer potential for robot task planning, but challenges remain due to VLMs’ tendency to generate incorrect action sequences. To address these limitations, we propose VeriGraph, a novel framework that integrates VLMs for robotic planning while verifying action feasibility. VeriGraph employs scene graphs as an intermediate representation, capturing key objects and spatial relationships to improve plan verification and refinement. The system generates a scene graph from input images and uses it to iteratively check and correct action sequences generated by an LLM-based task planner, ensuring constraints are respected and actions are executable. Our approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% for language-based tasks and 30% for image-based tasks.
|
| |
| 15:00-16:30, Paper ThI2I.227 | Add to My Program |
| Learn to Quantify Social Interaction with Constraints for Pedestrian Walking |
|
| Shi, Xiaodan | Stockholm University |
Keywords: Motion and Path Planning, Deep Learning Methods, Intention Recognition
Abstract: Long-term human path forecasting in crowds is critical for autonomous moving platforms (like autonomous driving cars and social robots) to avoid collision and make high-quality planning. Although the current research take into account social interactions for prediction, they don't reveal the exact kinds of social interactions happened among people and how the social interactions affect the decision-making process of pedestrians, which further limits its robustness. Social interactions in pedestrian walking are intuitively massive and hard to label and quantify. In this paper, we explore creatively to quantify and interpret how pedestrians interact with others by proposing Learn to Cluster. Our clustering social interactions is probabilistic latent variable generative, learning directly from sequential trajectory observations, scalable to arbitrary number of pedestrians. Learn to cluster is label-free and can be naturally integrated into the training process of the prediction model. The latent variables will then serve as 'labels' to categorize social interactions. Extensive experiments over several trajectory prediction benchmarks demonstrate that our method is able to learn the patterns of social interactions and effectively integrate the patterns to pedestrian trajectory prediction.
|
| |
| 15:00-16:30, Paper ThI2I.228 | Add to My Program |
| Towards Learning Boulder Excavation with Hydraulic Excavators |
|
| Gruetter, Jonas | ETH Zurich |
| Terenzi, Lorenzo | ETHZ |
| Egli, Pascal Arturo | RSL, ETHZ |
| Hutter, Marco | ETH Zurich |
Keywords: Robotics and Automation in Construction, Field Robots, Reinforcement Learning
Abstract: Construction sites frequently require removing large rocks before excavation or grading can proceed. Human operators typically extract these boulders using only standard digging buckets, avoiding time-consuming tool changes to specialized grippers. This task demands manipulating irregular objects with unknown geometries in harsh outdoor environments where dust, variable lighting, and occlusions hinder perception. The excavator must adapt to varying soil resistance—dragging along hard-packed surfaces or penetrating soft ground—while coordinating multiple hydraulic joints to secure rocks using a shovel. Current autonomous excavation focuses on continuous media (soil, gravel) or uses specialized grippers with detailed geometric planning for discrete objects. These approaches either cannot handle large irregular rocks or require impractical tool changes that interrupt workflow. We train a reinforcement learning policy in simulation using rigid-body dynamics and analytical soil models. The policy processes sparse LiDAR points (just 20 per rock) from vision-based segmentation and proprioceptive feedback to control standard excavator buckets. The learned agent discovers different strategies based on soil resistance: dragging along the surface in hard soil and penetrating directly in soft conditions. Field tests on a 12-ton excavator achieved 70% success across varied rocks (0.4--0.7m) and soil types, compared to 83% for human operators. This demonstrates that standard construction equipment can learn complex manipulation despite sparse perception and challenging outdoor conditions.
|
| |
| 15:00-16:30, Paper ThI2I.229 | Add to My Program |
| MOTCues: 3D Multi-Object Tracking with Birth Prior and Shape Description Informed by Point Cloud Cues |
|
| Lee, Hanyeol | Seoul National University |
| Choe, Yeongkwon | Kangwon National University |
| Kim, Taeyoon | Seoul National University |
| Park, Chan Gook | Seoul National University |
Keywords: Intelligent Transportation Systems, Autonomous Vehicle Navigation
Abstract: Reliable multi-object tracking (MOT) is essential for autonomous systems but remains challenging due to ambiguous object characteristics such as birth, death, and motion models, as well as detector errors including false detections and missed objects. Random finite set (RFS) theory provides a rigorous mathematical foundation that enables the formulation of fundamental uncertainties in object estimation under the Bayesian framework. We propose MOTCues, a MOT algorithm built on the RFS-based Poisson multi-Bernoulli filter, which integrates informative components derived from point cloud cues into the estimator as a tailored formulation. The object birth intensity function is modeled with a Gaussian mixture distribution for effective initialization of new-born objects, while object shape information is captured by constructing bounding box-centric descriptors to enhance hypothesis management. Evaluations on the KITTI dataset and the nuScenes benchmark demonstrate that integrating point cloud cues improves tracking performance by reducing ID switches, achieving superior results compared to baseline model-based trackers in real-world object tracking scenarios.
|
| |
| 15:00-16:30, Paper ThI2I.230 | Add to My Program |
| Optimizing Vehicle Trajectories at a Signalized Intersection in Mixed Traffic |
|
| Peng, Cheng | Chang'an University |
| Wang, Jiaping | Chang 'an University |
| Liu, Pengchao | Chang'an University |
| Wang, Zhen | Chang'an University |
| Zhao, Xiangmo | School of Information Engineering, Chang 'an University |
Keywords: Intelligent Transportation Systems, Field Robots, Automation Technologies for Smart Cities
Abstract: With the advancement of connected and automated vehicles (CAVs), achieving accurate vehicle trajectory prediction and optimal control has become a critical challenge for improving the efficiency and safety of mixed traffic flow. However, due to the complex dynamic interactions between CAVs and human-driven vehicles (HVs) and the nonlinear nature of signal coordination, existing studies lack comprehensive consideration of CAV position adjustments within the platoon and their guidance effects on trailing HVs. This paper proposes a data-driven method for CAV state prediction and trajectory optimization. Employing an application-specific improved Informer model, our method accurately predicts CAV arrival states at a signalized intersection in mixed traffic. Additionally, Bayesian optimization (BO) is utilized to achieve automated and rapid tuning of CAV model predictive control (MPC) parameters through learning human driving characteristics. Experimental results demonstrate that our proposed method significantly enhances overall traffic efficiency and optimization when CAVs operate within mixed traffic, showing strong feasibility and adaptability.
|
| |
| 15:00-16:30, Paper ThI2I.231 | Add to My Program |
| Robust Hand Tracking from Visual-Inertial Fusion |
|
| Choi, Hyelim | Seoul National University |
| Park, Hyunreal | Seoul National University |
| Ji, Harim | Seoul National University |
| Lee, Somang | Seoul National University |
| Lee, Youngseon | Seoul National University |
| Lee, Yongseok | Daegu Gyeongbuk Institute of Science and Technology |
| Lee, Dongjun | Seoul National University |
Keywords: Sensor Fusion, Human Detection and Tracking, Multifingered Hands
Abstract: Hand tracking plays a key role in capturing and transferring dexterous human manipulation skills to robots. However, achieving reliable tracking across diverse conditions and during complex interactions (e.g., object manipulation) remains challenging. A promising solution is to combine wearable sensors such as IMUs with vision, where previous studies have handled the vision input by attaching markers to wearables or by relying on depth data to avoid the domain gap in color images. In this work, we present a hand tracking framework that fuses inertial measurements with state-of-the-art vision methods, eliminating the need for markers while fully exploiting visual cues. For this, we introduce a dataset generation scheme that produces synthetic and real data for the target glove using a compact setup, without manual annotation. Using the dataset, we train the keypoint detection network that predicts the likelihood of an image for keypoints, designed based on a lightweight vision transformer (ViT) for real-time usage. Based on the network prediction, the IMU-propagated pose is used as a prior in probabilistic inference to estimate the keypoint positions and uncertainties. Tracking primarily relies on high-rate IMU updates for fast motion estimation, while the pose is corrected through factor graph optimization. The framework is validated in challenging scenarios, demonstrating its robustness and accuracy, and can be used for high-quality demonstration data acquisition and teleoperation for dexterous manipulation.
|
| |
| 15:00-16:30, Paper ThI2I.232 | Add to My Program |
| Learning Flexible Job Shop Scheduling under Limited Buffers and Material Kitting Constraints |
|
| Zhang, Shishun | National University of Defense Technology |
| Xu, Juzhan | Shenzhen University |
| Fan, Yidan | National University of Defense Technology |
| Zhu, Chenyang | National University of Defense Technology |
| Hu, Ruizhen | Shenzhen University |
| Wang, Yongjun | National University of Defense Technology |
| Xu, Kai | Institute of AI for Industries, Chinese Academy of Sciences |
Keywords: Intelligent and Flexible Manufacturing
Abstract: The Flexible Job Shop Scheduling Problem (FJSP) originates from real production lines, while some practical constraints are often ignored or idealized in current FJSP studies, among which the limited buffer problem has a particular impact on production efficiency. To this end, we study an extended problem that is closer to practical scenarios—the Flexible Job Shop Scheduling Problem with Limited Buffers and Material Kitting. In recent years, deep reinforcement learning (DRL) has demonstrated considerable potential in scheduling tasks. However, its capacity for state modeling remains limited when handling complex dependencies and long-term constraints. To address this, we leverage a heterogeneous graph network within the DRL framework to model the global state. By constructing efficient message passing among machines, operations, and buffers, the network focuses on avoiding decisions that may cause frequent pallet changes during long-sequence scheduling, thereby helping improve buffer utilization and overall decision quality. Experimental results on both synthetic and real production line datasets show that the proposed method outperforms traditional heuristics and advanced DRL methods in terms of makespan and pallet changes, and also achieves a good balance between solution quality and computational cost. Furthermore, a supplementary video is provided to showcase a simulation system that effectively visualizes the progression of the production line.
|
| |
| 15:00-16:30, Paper ThI2I.233 | Add to My Program |
| Masked IRL: LLM-Guided Reward Disambiguation from Demonstrations and Language |
|
| Hwang, Minyoung | Massachusetts Institute of Technology |
| Forsey-Smerek, Alexandra | Massachusetts Institute of Technology |
| Dennler, Nathaniel | Massachusetts Institute of Technology |
| Bobu, Andreea | MIT |
Keywords: Learning from Demonstration, Imitation Learning, Human-Centered Robotics
Abstract: Robots can adapt to user preferences by learning reward functions from demonstrations, but with limited data, reward models often overfit to spurious correlations and fail to generalize. This happens because demonstrations show robots how to do a task but not what matters for that task, causing the model to focus on irrelevant state details. Natural language can more directly specify what the robot should focus on, and, in principle, disambiguate between many reward functions consistent with the demonstrations. However, existing language-conditioned reward learning methods typically treat instructions as simple conditioning signals, without fully exploiting their potential to resolve ambiguity. Moreover, real instructions are often ambiguous themselves, so naive conditioning is unreliable. Our key insight is that these two input types carry complementary information: demonstrations show how to act, while language specifies what is important. We propose Masked Inverse Reinforcement Learning (Masked IRL), a framework that uses large language models (LLMs) to combine the strengths of both input types. Masked IRL infers state-relevance masks from language instructions and enforces invariance to irrelevant state components. When instructions are ambiguous, it uses LLM reasoning to clarify them in the context of the demonstrations. In simulation and on a real robot, Masked IRL outperforms prior language-conditioned IRL methods by up to 15% while using up to 4.7 times less data, demonstrating improved sample-efficiency, generalization, and robustness to ambiguous language.
|
| |
| 15:00-16:30, Paper ThI2I.234 | Add to My Program |
| Using Non-Expert Data to Robustify Imitation Learning Via Offline Reinforcement Learning |
|
| Huang, Kevin | University of Washington |
| Scalise, Rosario | University of Washington |
| Winston, Cleah | University of Washington |
| Agrawal, Ayush | University of Washington |
| Zhang, Yunchu | University of Washington |
| Baijal, Rohan | University of Washington |
| Grotz, Markus | University of Washington (UW) |
| Boots, Byron | University of Washington |
| Burchfiel, Benjamin | Toyota Research Institute |
| Dai, Hongkai | Toyota Research Institute |
| Itkina, Masha | Stanford University |
| Shah, Paarth | University of Oxford |
| Gupta, Abhishek | University of Washington |
Keywords: Imitation Learning, Reinforcement Learning, Deep Learning in Grasping and Manipulation
Abstract: Imitation learning has proven effective for training robots to perform complex tasks from expert human demonstrations. However, it remains limited by its reliance on high-quality, task-specific data, restricting adaptability to the diverse range of real-world object configurations and scenarios. In contrast, non-expert data---such as play data, suboptimal demonstrations, partial task completions, or rollouts from suboptimal policies---can offer broader coverage and lower collection costs. However, conventional imitation learning approaches fail to utilize this data effectively. To address these challenges, we posit that with right design decisions, offline reinforcement learning can be used as a tool to harness non-expert data to enhance the performance of imitation learning policies. We show that while standard offline RL approaches can be ineffective at actually leveraging non-expert data under the sparse data coverage settings typically encountered in the real world, simple algorithmic modifications can allow for the utilization of this data, without significant additional assumptions. Our approach shows that broadening the support of the policy distribution can allow imitation algorithms augmented by offline RL to solve tasks robustly, showing considerably enhanced recovery and generalization behavior. In manipulation tasks, these innovations significantly increase the range of initial conditions where learned policies are successful when non-expert data is incorporated. Moreover, we show that these methods are able to leverage all collected data, including partial or suboptimal demonstrations, to bolster task-directed policy performance. This underscores the importance of algorithmic techniques for using non-expert data for robust policy learning in robotics.
|
| |
| 15:00-16:30, Paper ThI2I.235 | Add to My Program |
| Adversarial Game-Theoretic Algorithm for Dexterous Grasp Synthesis |
|
| Chen, Yu | Carnegie Mellon University |
| He, Botao | University of Maryland |
| Mao, Yuemin | Carnegie Mellon University |
| Jakobsson, Arthur | Carnegie Mellon University |
| Ke, Jeffrey | Carnegie Mellon University |
| Aloimonos, Yiannis | University of Maryland |
| Shi, Guanya | Carnegie Mellon University |
| Choset, Howie | Carnegie Mellon University |
| Mao, Jiayuan | University of Pennsylvania |
| Ichnowski, Jeffrey | Carnegie Mellon University |
Keywords: Grasping, Multifingered Hands, Dexterous Manipulation
Abstract: For many complex tasks, multi-finger robot hands are poised to revolutionize how we interact with the world, but reliably grasping objects remains a significant challenge. We focus on the problem of synthesizing grasps for multi-finger robot hands that, given an target object's geometry and pose, computes a hand configuration. Existing approaches often struggle to produce reliable grasps that sufficiently constrain object motion, leading to instability under disturbances and failed grasps. A key reason is that during grasp generation, they typically focus on resisting a single wrench, while ignoring the object's potential for adversarial movements, such as escaping. We propose a new grasp-synthesis approach that explicitly captures and leverages the adversarial object motion in grasp generation by formulating the problem as a two-player game. One player controls the robot to generate feasible grasp configurations, while the other adversarially controls the object to seek motions that attempt to escape from the grasp. Simulation experiments on various robot platforms and target objects show that our approach achieves a success rate of 75.78%, up to 19.61% higher than the state-of-the-art baseline. The two-player game mechanism improves the grasping success rate by 27.40% over the method without the game formulation. Our approach requires only 0.28-1.04 seconds on average to generate a grasp configuration, depending on the robot platform, making it suitable for real-world deployment. In real-world experiments, our approach achieves an average success rate of 85.0% on ShadowHand and 87.5% on LeapHand, which confirms its feasibility and effectiveness in real robot setups. Code is publicly available at https://github.com/Neuling-jpg/Game4Grasp.
|
| |
| 15:00-16:30, Paper ThI2I.236 | Add to My Program |
| Temporal Action Representation Learning for Tactical Resource Control and Subsequent Maneuver Generation |
|
| Jung, Hoseong | Seoul National University |
| Son, Sungil | Seoul National University |
| Cho, Daesol | Georgia Institute of Technology |
| Park, Jonghae | Seoul National University |
| Choi, Changhyun | Seoul National University |
| Kim, H. Jin | Seoul National University |
Keywords: Reinforcement Learning, Robotics in Under-Resourced Settings, Autonomous Agents
Abstract: Autonomous robotic systems should reason about resource control and its impact on subsequent maneuvers, especially when operating with limited energy budgets or restricted sensing. Learning-based control is effective in handling complex dynamics and represents the problem as a hybrid action space unifying discrete resource usage and continuous maneuvers. However, prior works on hybrid action space have not sufficiently captured the causal dependencies between resource usage and maneuvers. They have also overlooked the multi-modal nature of tactical decisions, both of which are critical in fast-evolving scenarios. In this paper, we propose TART, a Temporal Action Representation learning framework for Tactical resource control and subsequent maneuver generation. TART leverages contrastive learning based on a mutual information objective, designed to capture inherent temporal dependencies in resource-maneuver interactions. These learned representations are quantized into discrete codebook entries that condition the policy, capturing recurring tactical patterns and enabling multi-modal and temporally coherent behaviors. We evaluate TART in two domains where resource deployment is critical: (i) a maze navigation task where a limited budget of discrete actions provides enhanced mobility, and (ii) a high-fidelity air combat simulator in which an F-16 agent operates weapons and defensive systems in coordination with flight maneuvers. Across both domains, TART consistently outperforms hybrid-action baselines, demonstrating its effectiveness in leveraging limited resources and producing context-aware subsequent maneuvers.
|
| |
| 15:00-16:30, Paper ThI2I.237 | Add to My Program |
| Efficient Frontier-Sampling-Mixed Autonomous Exploration Using Environmental Complexity |
|
| Lu, Liang | Huazhong University of Science and Technology |
| Xiang, Ming | Huazhong University of Science and Technology |
| Tang, Dongyang | Huazhong University of Science and Technology |
| Yan, Zefeng | Shanxi University |
| Wang, Hao | Huazhong University of Science and Technology |
| Han, Bin | Huazhong University of Science and Technology |
Keywords: Motion and Path Planning, Aerial Systems: Perception and Autonomy, Aerial Systems: Applications
Abstract: When exploring complex unknown environments, unmanned aerial vehicles (UAVs) often experience reduced efficiency and robustness due to unevenly distributed occlusions. This paper proposes an efficient hybrid autonomous exploration algorithm that adapts to environmental complexity, enabling effective frontier detection and viewpoint sampling to minimize overall exploration time. We introduce a frontier detection method based on a limited field of view (FOV), along with an unique ID-based frontier management mechanism, which ensures detection completeness while significantly reducing computational and memory overhead. Furthermore, an adaptive sampling strategy incorporating environmental complexity is introduced. By adaptively switching sampling modes and relaxing obstacle-free sphere generation constraints, the method improves both sampling efficiency and visibility evaluation performance. For path planning, a hierarchical planner based on a topological graph is constructed. It jointly optimizes global coverage paths and local frontier information to generate smooth and time-optimal trajectories. Both simulation and real-world experiments validate the advantages of the proposed approach in terms of exploration efficiency, computational overhead, and coverage rate.
|
| |
| 15:00-16:30, Paper ThI2I.238 | Add to My Program |
| Puzzle Piece Robots: Inverse-Designed Shape-Morphing Docking for Spherically Reconfigurable Soft Robots |
|
| Conzola, Justin | The University of Alabama |
| Vikas, Vishesh | University of Alabama |
Keywords: Soft Robot Materials and Design, Soft Sensors and Actuators, Multi-Robot Systems
Abstract: Modularity in robots enhances versatility, enabling shape morphing and reconfiguration. In modular soft robots, the use of soft materials allows dimensional transformations across different architectures - from chains (1D) to lattices (2D) and spheres (3D). All this enables a swarm of robots to exhibit multi-modal locomotion - such as millipede-like, starfish-like, and soccer-ball-like movement patterns. However, achieving such reconfiguration remains challenging, especially in soft robots, where docking is difficult to realize without compromising compliance. Conventional approaches - such as rigid inserts, magnetic actuators, and adhesives - face challenges due to rigid–soft fabrication mismatch, interference with body compliance and limited holding strength. To address these challenges, this work proposes a geometric, active shape-morphing docking mechanism for spherically reconfigurable soft robots, that combines concepts of topology design and mechanical metamaterials. The robot module edges are designed to create geometric interlocks between adjacent edges (similar to jigsaw puzzle pieces) with an internal structure that deforms under actuation by inlaid shape memory alloy (SMA) wires. The metamaterial internal structure is obtained through inverse design optimization of a computational deformation model created in Abaqus CAE. The constraint-aware optimization strategy blends random search and genetic algorithm features to handle a large number of bounded variables and nonlinear objective function, driving convergence toward a global minimum via geometric decay of the search space. The resulting optimal geometry is designed to buckle under high localized forces, enabling docking and undocking, while remaining minimally deformed under distributed forces, thereby passively maintaining coupling during operation. The docking mechanism is experimentally validated by confirming that the deformation achieved under actuation can facilitate the docking operation and th
|
| |
| 15:00-16:30, Paper ThI2I.239 | Add to My Program |
| Task Scheduling Optimization for Multi-Human Multi-Robot Collaborative Remanufacturing |
|
| Herrera, Emilio | Montclair State University |
| Wang, Weitian | Montclair State University |
Keywords: Human-Centered Automation, Planning, Scheduling and Coordination, Human-Centered Robotics
Abstract: The increasing proliferation and evolution of robotics and its capabilities is having a significant impact on smart manufacturing and remanufacturing. Within the popular frameworks of Industry 4.0 and Industry 5.0, Human-robot collaboration (HRC) has emerged to integrate the best capabilities of humans like their problem solving with those of robots like their precision. These systems are continuing to scale rapidly and are beginning to introduce multi-human multi-robot collaboration (MHMRC) environments, offering a greater degree of productivity and flexibility. Both HRC and MHMRC are still faced with underexplored challenges, such as task allocation and scheduling. In this study, we propose a nature-inspired, objective function-constrained task scheduling optimization solution for multi-human multi-robot collaborative remanufacturing. Different objective functions for the Dingo Optimization Algorithm are developed to investigate how human participants perceive task assignments and interpret the disassembly process under varying objectives in MHMRC. We conduct a real-world multi-human multi-robot collaborative remanufacturing user study in which participants disassemble an end-of-life desktop computer in a shared workspace with two robots to test and validate the proposed approach. Participants are surveyed using the NASA-TLX, along with additional questions. Experimental results demonstrate the effectiveness of the developed approach, and directions for future work are also discussed.
|
| |
| 15:00-16:30, Paper ThI2I.240 | Add to My Program |
| Flip Stunts on Bicycle Robots Using Iterative Motion Imitation |
|
| Kim, Jeonghwan | Georgia Institute of Technology |
| Fahmi, Shamel | Robotics and AI Institute |
| Rho, Seungeun | Georgia Institute of Technology |
| Ha, Sehoon | Georgia Institute of Technology |
| Nelson, Gabriel | Robotics and AI Institute |
Keywords: Wheeled Robots, Reinforcement Learning, Imitation Learning
Abstract: This work demonstrates a front-flip on bicycle robots via reinforcement learning, particularly by imitating reference motions that are infeasible and imperfect. To address this, we propose Iterative Motion Imitation (IMI), a method that iteratively imitates trajectories generated by prior policy rollouts. Starting from an initial reference that is kinematically or dynamically infeasible, IMI helps train policies that lead to feasible and agile behaviors. We demonstrate our method on Ultra-Mobility Vehicle (UMV), a bicycle robot that is designed to enable agile behaviors. From a self-colliding table-to-ground flip reference generated by a model-based controller, we are able to train policies that enable ground-to-ground and ground-to-table front-flips. We show that compared to a single-shot motion imitation, IMI results in policies with higher success rates and can transfer robustly to the real world. To our knowledge, this is the first unassisted acrobatic flip behavior on such a platform.
|
| |
| 15:00-16:30, Paper ThI2I.241 | Add to My Program |
| Beyond Reference Trajectories: A Waypoint-Based Model Predictive Path Integral Control for Agile Drone Racing |
|
| Zhao, Fangguo | Zhejiang University |
| Guan, Xin | Zhejiang University |
| Li, Shuo | Zhejiang University |
Keywords: Motion and Path Planning, Integrated Planning and Control
Abstract: While model-based controllers have demonstrated remarkable performance in autonomous drone racing, their performance is often constrained by the reliance on pre-computed reference trajectories. Conventional approaches, such as trajectory tracking, demand a dynamically feasible, full-state reference, whereas contouring control relaxes this requirement to a geometric path but still necessitates a reference. Recent advancements in reinforcement learning (RL) have revealed that many model-based controllers optimize surrogate objectives, such as trajectory tracking, rather than the primary racing goal of directly maximizing progress through gates. Inspired by these findings, this work introduces a reference-free method for time-optimal racing by incorporating this gate progress objective, derived from RL reward shaping, directly into the Model Predictive Path Integral (MPPI) formulation, which only depends on waypoint positions. The sampling-based nature of MPPI makes it uniquely capable of optimizing the discontinuous and non-differentiable objective in real-time. We also establish an empirical testbed that leverages MPPI to systematically and fairly compare three distinct objective functions with a consistent dynamics model and parameter set: classical trajectory tracking, contouring control, and the proposed gate progress objective. We compare the performance of these three objectives when solved via both MPPI and a traditional gradient-based solver. Our results demonstrate that the proposed reference-free approach achieves competitive racing performance, rivaling or exceeding reference-based methods.
|
| |
| 15:00-16:30, Paper ThI2I.242 | Add to My Program |
| Mapping Pamir: Multi-Session Visual/Inertial SLAM and 3D Reconstruction of an Underwater Shipwreck |
|
| Chatzispyrou, Michalis | University of Delaware |
| Horgan, Luke | Stevens Institute of Technology |
| Hwang, Hyunkil | Autonomous Field Robotics Laboratory, University of Delaware |
| Sathishchandra, Harish | Stevens Institute of Technology |
| Burgul, Chinmay | University of Delaware |
| Roznere, Monika | Binghamton University |
| Quattrini Li, Alberto | Dartmouth College |
| Mordohai, Philippos | Stevens Institute of Technology |
| Rekleitis, Ioannis | University of Delaware |
Keywords: Marine Robotics, Mapping, Visual-Inertial SLAM
Abstract: This paper presents a framework for multi-session mapping of underwater environments utilizing an affordable action camera. The Visual-Inertial data are augmented by water depth recordings from a dive computer. SVIn2, an open-source VI-SLAM framework is utilized to generate a trajectory and a sparse reconstruction for each session. Utilizing the keyframes extracted from SVIn2, and the estimated camera poses, a Structure-from-Motion (SfM) framework -- COLMAP -- is employed for global optimization and produce a dense reconstruction of the target environment. The presence of calibration targets at fixed locations, when available, is used to estimate the coordinate transformation between different data collection sessions, thus transforming the different sessions into the same coordinate frame. The proposed pipeline is employed for the mapping of a shipwreck off the coast of Barbados. For the first time, both the exterior and the accessible interior parts of the wreck were mapped in two sessions, while a third session employed two cameras with different fields of view.
|
| |
| 15:00-16:30, Paper ThI2I.243 | Add to My Program |
| CMoE: Contrastive Mixture of Experts for Motion Control and Terrain Adaptation of Humanoid Robots |
|
| Ma, Shihao | Fudan University |
| Chen, Hongjin | Fudan University |
| Xu, Zijun | Fudan University |
| Zhao, Yi | Fudan University |
| Wu, Ke | Fudan University |
| Yang, Ruichen | Fudan University |
| Zou, Leyao | Fudan University |
| Gan, Zhongxue | Fudan University |
| Ding, Wenchao | Fudan University |
Keywords: Humanoid and Bipedal Locomotion, Reinforcement Learning, Legged Robots
Abstract: For effective deployment in real-world environments, humanoid robots must autonomously navigate a diverse range of complex terrains with abrupt transitions. While the Vanilla mixture of experts (MoE) framework is theoretically capable of modeling diverse terrain features, in practice, the gating network exhibits nearly uniform expert activations across different terrains, weakening the expert specialization and limiting the model's expressive power. To address this limitation, we introduce CMoE, a novel single-stage reinforcement learning framework that integrates contrastive learning to refine expert activation distributions. By imposing contrastive constraints, CMoE maximizes the consistency of expert activations within the same terrain while minimizing their similarity across different terrains, thereby encouraging experts to specialize in distinct terrain types. We validated our approach on the Unitree G1 humanoid robot through a series of challenging experiments. Results demonstrate that CMoE enables the robot to traverse continuous steps up to 20 cm high and gaps up to 80 cm wide, while achieving robust and natural gait across diverse mixed terrains, surpassing the limits of existing methods. To support further research and foster community development, we will release our code publicly.
|
| |
| 15:00-16:30, Paper ThI2I.244 | Add to My Program |
| Robust SG-NeRF: Robust Scene Graph Aided Neural Surface Reconstruction |
|
| Gu, Yi | Hong Kong University of Science and Technology (Guangzhou) |
| Ye, Dongjun | Hong Kong University of Science and Technology (Guangzhou) |
| Wang, Zhaorui | Hong Kong University of Science and Technology (Guangzhou) |
| Wang, Jiaxu | Hong Kong University of Science and Technology; Hong Kong University of Science and Technology (GZ); |
| Cao, Jiahang | The University of Hong Kong |
| Zhao, Mingle | University of Macau |
| Xu, Renjing | The Hong Kong University of Science and Technology (Guangzhou) |
|
|
| |
| 15:00-16:30, Paper ThI2I.245 | Add to My Program |
| Learning Load-Balanced Distributed Coverage for Robot Swarms Via Graph Attention Networks |
|
| Gao, Yun | Hong Kong University of Science and Technology(Guangzhou) |
| Gao, Hao | Hong Kong University of Science and Technology (Guangzhou) |
| Ma, Wenzong | Beihang University |
| Xiong, Hui | The Hong Kong University of Science and Technology (Guangzhou) |
| Ji, Yiding | Hong Kong University of Science and Technology (Guangzhou) |
|
|
| |
| 15:00-16:30, Paper ThI2I.246 | Add to My Program |
| Robust Localization for Autonomous Vehicles in Highway Scenes |
|
| Cheng, Daqian | Bot Auto |
| Ding, Xuchu | Archer Aviation |
| Wu, Yujia | Bot Auto |
| Zhang, Xiang | Bot Auto |
| Wang, Lei | Prexa AI |
Keywords: Localization, Autonomous Vehicle Navigation, Data Sets for SLAM
Abstract: Localization for autonomous vehicles on highways remains under-explored compared to urban roads, and state-of-the-art methods for urban scenes degrade when directly applied to highways. We identify key challenges including environment change under information homogeneity, heavy occlusion, degraded GNSS signals, and stringent downstream requirements on accuracy and latency. We propose a robust localization system to address highway challenges, which uses a dual-likelihood LiDAR front end that decouples 3D geometry structure and 2D road-texture cues to handle environment changes; a Control-EKF further leverages steering and acceleration commands to reduce lag and improve closed-loop behavior. An automated offline mapping and ground-truth pipeline keep maps fresh at high cadence for optimal localization performance. To catalyze progress, we release a public dataset covering both urban roads and highways while focusing on representative challenging highway clips, totaling 163 km; benchmarking is standardized using product-oriented accuracy metrics and certified ground truth. Compared to Apollo and Autoware, our system performs similarly on urban roads but shows superior robustness on challenging highway scenarios. The system has been validated by over one million kilometers of road testing.
|
| |
| 15:00-16:30, Paper ThI2I.248 | Add to My Program |
| DAG-Plan: Generating Directed Acyclic Dependency Graphs for Dual-Arm Cooperative Planning |
|
| Gao, Zeyu | Institute of Automation, Chinese Academy of Sciences |
| Mu, Yao | Shanghai Jiao Tong University |
| Qu, Jinye | Institute of Automation, Chinese Academy of Sciences |
| Hu, Mengkang | University of Hong Kong |
| Peng, Shijia | Shanghai AI Laboratory |
| Hou, Chengkai | Peking University |
| Guo, Lingyue | Institute of Automation, Chinese Academy of Science |
| Luo, Ping | The University of Hong Kong |
| Zhang, Shanghang | Peking University |
| Lu, Yanfeng | Institute of Automation, Chinese Academy of Sciences |
Keywords: Task Planning, Bimanual Manipulation, Autonomous Agents
Abstract: Dual-arm robots promise greater efficiency but require planning for complex tasks with nonlinear sub-task dependencies. Current methods using Large Language Models (LLMs) suffer from a fundamental trade-off: generating linear sequences is efficient but fails to model parallelism and adapt to changes, while iterative querying is adaptive but too slow and costly. To bridge this gap, we introduce DAG-Plan, a novel task planning framework that for the first time employs a Directed Acyclic Graph (DAG) as the central representation for dual-arm coordination. The key insight is that a DAG natively captures complex sub-task dependencies and explicitly reveals opportunities for parallel execution. Within this framework, an LLM is used only once as a powerful semantic parser to translate a natural language instruction into a structured DAG. During execution, our system dynamically assigns candidate nodes to the suitable arm based on real-time environmental observations, enabling truly adaptive and parallel operation. Extensive evaluation on a dual-arm kitchen benchmark shows that DAG-Plan's structured approach fundamentally outperforms existing paradigms. It achieves a 48% higher success rate than single-query linear sequence methods with dual arm by robustly managing dependencies, and an 84.1% higher execution efficiency than iterative querying methods by eliminating the latency of repeated LLM calls. Our work demonstrates that a principled, graph-based representation is the key to unlocking efficient and reliable LLM-based planning for complex robotic systems. More demos and information are available on https://sites.google.com/view/dag-plan.
|
| |
| 15:00-16:30, Paper ThI2I.249 | Add to My Program |
| CRAFT: Adapting VLA Models to Contact-Rich Manipulation Via Force-Aware Curriculum Fine-Tuning |
|
| Zhang, Yike | Hunan University |
| Wang, Yaonan | Hunan University |
| Sun, Xinxin | Hunan University |
| Huang, Kaizhen | Hunan University |
| Xu, Zhiyuan | Beijing Innovation Center of Humanoid Robotics |
| Junjie, Ji | Beijing Innovation Center of Humanoid Robotics |
| Che, Zhengping | Beijing Innovation Center of Humanoid Robotics |
| Tang, Jian | Beijing Innovation Center of Humanoid Robotics |
| Liu, Kangcheng | Hunan University |
| Sun, Jingtao | National University of Singapore |
Keywords: AI-Enabled Robotics, Imitation Learning, Force Control
Abstract: Vision-Language-Action (VLA) models have shown a strong capability in enabling robots to execute general instructions, yet they struggle with contact-rich manipulation tasks, where success requires precise alignment, stable contact maintenance,and effective handling of deformable objects. A fundamental challenge arises from the imbalance between high-entropy vision and language inputs and low-entropy but critical force signals, which often leads to over-reliance on perception and unstable control. To address this, we introduce CRAFT,a force-aware curriculum fine-tuning framework that integrates a variational information bottleneck module to regulate vision and language embeddings during early training. This curriculum strategy encourages the model to prioritize force signals initially, before progressively restoring access to the full multimodal information. To enable force-aware learning, we further design a homologous leader–follower teleoperation system that collects synchronized vision, language, and force data across diverse contact-rich tasks. Real-world experiments demonstrate that CRAFT consistently improves task success,generalizes to unseen objects and novel task variations,and adapts effectively across diverse VLA architectures,enabling robust and generalizable contact-rich manipulation.
|
| |
| 15:00-16:30, Paper ThI2I.250 | Add to My Program |
| Resource Mapping with a Mobile Exploration Robot Using Spectral Mixture Ergodic Search |
|
| Hansen, Margaret | Carnegie Mellon University |
| Rao, Ananya | Carnegie Mellon University |
| Breitfeld, Abigail | Carnegie Mellon University |
| Wettergreen, David | Carnegie Mellon University |
Keywords: Mapping, Motion and Path Planning, Space Robotics and Automation
Abstract: Resource mapping and prospecting has become the focus of a number of proposed planetary exploration missions, particularly to locate water ice at the lunar south pole. Mobile robots, which are employed for exploration tasks in environments that are inaccessible to humans, collect the information in such missions. In these scenarios, intelligent and adaptive trajectory planning algorithms increase the accuracy of the resulting resource map, along with the efficiency with which information is gathered. In this work, we use ergodic search to generate a mobile robot trajectory that balances exploration and exploitation, while simultaneously mapping the spatial distribution of a resource by using Gaussian process regression with a spectral mixture kernel. The spatial correlation structure learned via Gaussian process regression informs the ergodic search about regions of high information, as well as the frequency components that appear in the map distribution. We call this method spectral mixture ergodic search (SM-ES) and demonstrate how it learns a map and updates the trajectory accordingly on three datasets: synthetic maps, an ice favorability index map for the lunar south polar region, and real mineral data from Cuprite, Nevada.
|
| |
| 15:00-16:30, Paper ThI2I.251 | Add to My Program |
| RGS: Reflection-Aware Gaussian Splatting Via Learning Geometry Continuity for Reflective Objects |
|
| Du, Xiaobiao | University of Technology Sydney |
| Wang, Yida | Li Auto Inc |
| Bi, Cheng | Li Auto Inc |
| Zhan, Kun | LiAuto |
| Xin, Yu | Adelaide University |
Keywords: Visual Learning, RGB-D Perception
Abstract: Gaussian Splatting has significantly improved the quality of novel view synthesis with explicit Gaussian representation. However, we observed that existing 3D Gaussian Splatting methods (3DGS) often suffer from surface collapse issues on reflective regions, and thus produce inferior geometry and low-quality specular. In this work, we propose a physically-based deferred rendering framework, named Reflection-aware Gaussian Splatting (RGS), that can accurately model specular regions and improve novel view synthesis performance. Specifically, we found that a powerful 3D foundation model can provide a strong 3D geometric prior to foster correct geometric modeling. Based on this, we propose a cross-view shape consistency regularization to regularize the geometry surface with the large model prior and cross-view constraints. In this manner, our RGS can produce smoother geometric surfaces on reflective regions while reducing geometric hollows. To further improve rendering results on reflective regions, we present a reflection-aware densification strategy that is designed to capture specular variations across various views. With this strategy, our RGS is able to render novel views of objects in higher quality. Extensive experiments demonstrate our method consistently renders high-quality reflective objects, achieving state-of-the-art performance.
|
| |
| 15:00-16:30, Paper ThI2I.252 | Add to My Program |
| RelMap: Enhancing Online Map Construction with Class-Aware Spatial Relation and Semantic Priors |
|
| Cai, Tianhui | University of California, Los Angeles |
| Zhang, Yun | University of California, Los Angeles |
| Zhou, Zewei | University of California, Los Angeles |
| Huang, Zhiyu | University of California, Los Angeles |
| Ma, Jiaqi | University of California, Los Angeles |
Keywords: Intelligent Transportation Systems, Computer Vision for Automation, Computer Vision for Transportation
Abstract: Online high-definition (HD) map construction is crucial for scaling autonomous driving systems. While Transformer-based methods have become prevalent in online HD map construction, most existing approaches overlook the inherent spatial dependencies and semantic relationships among map elements, which constrains their accuracy and generalization capabilities. To address this, we propose RelMap, an end-to-end framework that explicitly models both spatial relations and semantic priors to enhance online HD map construction. Specifically, we introduce a Class-aware Spatial Relation Prior, which explicitly encodes relative positional dependencies between map elements using a learnable class-aware relation encoder. Additionally, we design a Mixture-of-Experts-based Semantic Prior, which routes features to class-specific experts based on predicted class probabilities, refining instance feature decoding. RelMap is compatible with both single-frame and temporal perception backbones, achieving state-of-the-art performance on both the nuScenes and Argoverse 2 datasets.
|
| |
| 15:00-16:30, Paper ThI2I.253 | Add to My Program |
| Leveraging Geometric Prior Uncertainty and Complementary Constraints for High-Fidelity Neural Indoor Surface Reconstruction |
|
| Feng, Qiyu | Shanghai Jiao Tong University |
| Shan, Jiwei | The Chinese University of Hong Kong |
| Cheng, Shing Shin | The Chinese University of Hong Kong |
| Wang, Hesheng | Shanghai Jiao Tong University |
Keywords: RGB-D Perception, Deep Learning for Visual Perception, Visual Learning
Abstract: Neural implicit surface reconstruction with signed distance function has made significant progress, but recovering fine details such as thin structures and complex geometries remains challenging due to unreliable or noisy geometric priors. Existing approaches rely on implicit uncertainty that arises during optimization to filter these priors, which is indirect and inefficient, and masking supervision in high-uncertainty regions further leads to under-constrained optimization. To address these issues, we propose GPU-SDF, a neural implicit framework for indoor surface reconstruction that leverages geometric prior uncertainty and complementary constraints. We introduce a self-supervised module that explicitly estimates prior uncertainty without auxiliary networks. Based on this estimation, we design an uncertainty-guided loss that modulates prior influence rather than discarding it, thereby retaining weak but informative cues. To address regions with high prior uncertainty, GPU-SDF further incorporates two complementary constraints: an edge distance field that strengthens boundary supervision and a multi-view consistency regularization that enforces geometric coherence. Extensive experiments confirm that GPU-SDF improves the reconstruction of fine details and serves as a plug-and-play enhancement for existing frameworks. The source code will be released upon acceptance.
|
| |
| 15:00-16:30, Paper ThI2I.254 | Add to My Program |
| PROFusion: Robust and Accurate Dense Reconstruction Via Camera Pose Regression and Optimization |
|
| Dong, Siyan | The University of Hong Kong |
| Wang, Zijun | The University of Hong Kong |
| Cai, Lulu | The University of HongKong |
| Ma, Yi | The University of Hong Kong |
| Yang, Yanchao | The University of Hong Kong |
Keywords: SLAM, Localization, Mapping
Abstract: Real-time dense scene reconstruction during unstable camera motions is crucial for robotics, yet current RGB-D SLAM systems fail when cameras experience large viewpoint changes, fast motions, or sudden shaking. Classical optimization-based methods deliver high accuracy but fail with poor initialization during large motions, while learning-based approaches provide robustness but lack sufficient accuracy for dense reconstruction. We address this challenge through a combination of learning-based initialization with optimization-based refinement. Our method employs a camera pose regression network to predict metric-aware relative poses from consecutive RGB-D frames, which serve as reliable starting points for a randomized optimization algorithm that further aligns depth images with the scene geometry. Extensive experiments demonstrate promising results: our approach outperforms the best competitor on challenging benchmarks, while maintaining comparable accuracy on stable motion sequences. The system operates in real-time, showcasing that combining simple and principled techniques can achieve both robustness for unstable motions and accuracy for dense reconstruction. Code released: https://github.com/siyandong/PROFusion.
|
| |
| 15:00-16:30, Paper ThI2I.255 | Add to My Program |
| EIMC: Efficient Instance-Aware Multi-Modal Collaborative Perception |
|
| Yang, Kang | Renmin University of China |
| Wang, Peng | Renmin University of China |
| Li, Lantao | Sony (China) Limited |
| Bu, Tianci | National University of Defense and Technology |
| Sun, Chen | Sony |
| Li, Deying | Renmin University of China |
| Wang, Yongcai | Renmin University of China |
Keywords: Computer Vision for Transportation, Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization
Abstract: Multi-modal collaborative perception calls for great attention to enhancing the safety of autonomous driving. However, current multi-modal approaches remain a ``local fusion to communication” sequence, which fuses multi-modal data locally and needs high bandwidth to transmit an individual's feature data before collaborative fusion. EIMC innovatively proposes an early collaborative paradigm. It injects lightweight collaborative voxels, transmitted by neighbor agents, into the ego’s local modality-fusion step, yielding compact yet informative 3D collaborative priors that tighten cross-modal alignment. Next, a heatmap-driven consensus protocol identifies exactly where cooperation is needed by computing per-pixel confidence heatmaps. Only the Top-K instance vectors located in these low-confidence, high-discrepancy regions are queried from peers, then fused via cross-attention for completion. Afterwards, we apply a refinement fusion that involves collecting the top-K most confident instances from each agent and enhancing their features using self-attention. The above instance-centric messaging reduces redundancy while guaranteeing that critical occluded objects are recovered. Evaluated on OPV2V and DAIR-V2X, EIMC attains 73.01% AP@0.5 while reducing byte bandwidth usage by 87.98% compared with the best published multi-modal collaborative detector. Code will be publicly released.
|
| |
| 15:00-16:30, Paper ThI2I.256 | Add to My Program |
| Robustly Constrained Dynamic Games Via System Level Synthesis |
|
| Zhan, Shuyu | Georgia Institute of Technology |
| Chiu, Chih-Yuan | Georgia Institute of Technology |
| Leeman, Antoine | ETH Zurich |
| Chou, Glen | Georgia Institute of Technology |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Optimization and Optimal Control, Robot Safety
Abstract: We propose a system-level synthesis (SLS) framework for robust dynamic games with nonlinear dynamics corrupted by state-dependent additive noise, and nonlinear agent-specific and shared constraints. Each agent designs a nominal trajectory and a causal affine error feedback law to minimize their own cost while ensuring that its own constraints and the shared constraints are satisfied, even under worst-case noise realizations. Building on these nonlinear safety certificates, we define the novel notion of a robustly constrained Nash equilibrium (RCNE). We then present an Iterative Best Response (IBR)-based algorithm that iteratively refines the optimal trajectory and controller for each agent until approximate convergence to the RCNE. We evaluated our method on simulations and hardware experiments involving large numbers of robots with high-dimensional nonlinear dynamics, as well as state-dependent dynamics noise. Across all experiment settings, our method generated trajectory rollouts which robustly avoid collisions, while a baseline game-theoretic algorithm for producing open-loop motion plans failed to generate trajectories that satisfy constraints.
|
| |
| 15:00-16:30, Paper ThI2I.257 | Add to My Program |
| ManipForce: Force-Guided Policy Learning with Frequency-Aware Representation for Contact-Rich Manipulation |
|
| Lee, Geonhyup | Gwangju Institute of Science and Technology |
| Lee, Youngjin | Gwangju Institute of Science and Technology |
| Kim, Kangmin | Gwangju Institute of Science and Technology |
| Lee, Seongju | Gwangju Institue of Science and Technology (GIST) |
| Noh, Sangjun | Gwangju Institute of Science and Technology (GIST) |
| Back, Seunghyeok | Korea Institute of Machinery & Materials |
| Lee, Kyoobin | Gwangju Institute of Science and Technology |
Keywords: Imitation Learning, Learning from Demonstration, Force and Tactile Sensing
Abstract: Contact-rich manipulation tasks such as precision assembly require precise control of interaction forces, yet existing imitation learning methods rely mainly on vision-only demonstrations. We propose ManipForce, a handheld system designed to capture high-frequency force–torque (F/T) and RGB data during natural human demonstrations for contact-rich manipulation. Building on these demonstrations, we introduce the Frequency-Aware Multimodal Transformer (FMT). FMT encodes asynchronous RGB and F/T signals using frequency- and modality-aware embeddings and fuses them via bi-directional cross-attention within a transformer diffusion policy. Through extensive experiments on six real-world contact-rich manipulation tasks—such as gear assembly, box flipping, and battery insertion—FMT trained on ManipForce demonstrations achieves robust performance with an average success rate of 83% across all tasks, substantially outperforming RGB-only baselines. Ablation and sampling-frequency analyses further confirm that incorporating high-frequency F/T data and cross-modal integration improves policy performance, especially in tasks demanding high precision and stable contact. Hardware, software, and video demos are available at: https://sites.google.com/view/manipforce/.
|
| |
| 15:00-16:30, Paper ThI2I.258 | Add to My Program |
| VistaBot: View-Robust Robot Manipulation Via Spatiotemporal-Aware View Synthesis |
|
| Gu, Songen | Fudan University |
| Zheng, Yuhang | National University of Singapore |
| Li, Weize | Tsinghua University |
| Zheng, Yupeng | University of Chinese Academy of Sciences |
| Feng, Yating | National University of Singapore |
| Li, Xiang | Tsinghua University |
| Chen, Yilun | Tsinghua University |
| Li, Pengfei | Tsinghua University |
| Ding, Wenchao | Fudan University |
Keywords: Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation, Imitation Learning
Abstract: Recently, end-to-end robotic manipulation models have gained significant attention for their generalizability and scalability. However, they often suffer from limited robustness to camera viewpoint changes when training with a fixed camera. In this paper, we propose VistaBot, a novel framework that integrates feed-forward geometric models with video diffusion models to achieve view-robust closed-loop manipulation without requiring camera calibration at test time. Our approach consists of three key components: 4D geometry estimation, view synthesis latent extraction, and latent action learning. VistaBot is integrated into both action-chunking (ACT) and diffusion-based (pi_0) policies and evaluated across simulation and real-world tasks. We further introduce the View Generalization Score (VGS) as a new metric for comprehensive evaluation of cross-view generalization. Results show that VistaBot improves VGS by 2.79× and 2.63× over ACT and pi_0, respectively, while also achieving high-quality novel view synthesis. Our contributions include a geometry-aware synthesis model, a latent action planner, a new benchmark metric, and extensive validation across diverse environments. The code and models of VistaBot will be open-sourced to facilitate future research on view-robust robotic manipulation.
|
| |
| 15:00-16:30, Paper ThI2I.259 | Add to My Program |
| Data Scaling Laws for Imitation Learning-Based End-To-End Autonomous Driving |
|
| Zheng, Yupeng | Institute of Automation, Chinese Academy of Sciences |
| Yang, Pengxuan | University of Chinese Academy of Sciences (UCAS) |
| Xia, Zhongpu | Institute of Automation, Chinese Academy of Sciences |
| Zhang, Qichao | Institute of Automation, Chinese Academy of Sciences |
| Zheng, Yuhang | National University of Singapore |
| Lu, Ben | Li Auto |
| Zhang, Teng | Li Auto |
| Han, Chao | Li Auto |
| Li, Weize | Clemson University |
| Gu, Songen | Fudan University |
| Lang, Xianpeng | Li Auto |
| Lan, Xiangyuan | Pengcheng Laboratory |
| Zhao, Dongbin | Chinese Academy of Sciences |
Keywords: Vision-Based Navigation, Autonomous Vehicle Navigation, Imitation Learning
Abstract: The end-to-end autonomous driving paradigm has recently attracted lots of attention due to its scalability. However, existing methods are constrained by the limited scale of real-world data, which hinders a comprehensive exploration of the scaling laws associated with end-to-end autonomous driving. To address this issue, we collected substantial data from various driving scenarios and behaviors and conducted an extensive study on the scaling laws of existing imitation learning-based end-to-end autonomous driving paradigms. Specifically, approximately 4 million demonstrations from 23 different scenario types were gathered, amounting to over 30,000 hours of driving demonstrations. We performed open-loop evaluations and closed-loop simulation evaluations in 1,400 diverse driving demonstrations (1,300 for open-loop and 100 for closed-loop) under stringent assessment conditions. Through experimental analysis, we discovered that (1) the performance of the driving model exhibits a power-law relationship with the amount of data, but this is not the case in closed-loop evaluation. The inconsistency between the two assessments shifts our focus toward the distribution of data rather than merely expanding its volume. (2) a small increase in the quantity of long-tailed data can significantly improve the performance for the corresponding scenarios; (3) appropriate scaling of data enables the model to achieve combinatorial generalization in novel scenes and actions. Our results highlight the critical role of data scaling in improving the generalizability of models across diverse autonomous driving scenarios, assuring safe deployment in the real world.
|
| |
| 15:00-16:30, Paper ThI2I.260 | Add to My Program |
| A Dual-Field Framework for Urban Low-Altitude UAV Traffic Planning and Management |
|
| Dong, Zhaoqi | Beijing Institute of Technology |
| Chen, Lei | Beijing Institute of Technology |
Keywords: Intelligent Transportation Systems, Aerial Systems: Applications, Planning, Scheduling and Coordination
Abstract: Urban Air Mobility emerges as a transformative mode of transportation, but its integration into complex low-altitude urban environments requires systematic consideration of safety and efficiency. This study aims to develop a computational framework that enables structured traffic organization while accounting for spatially variant risks. The framework introduces a dual-field environmental model that couples a traversability field, which quantifies continuous anisotropic risk, with a scalar potential field, which encodes macroscopic traffic flow. The path planning formulation computes geodesics under an anisotropic metric derived from the dual-field, and the centralized coordination mechanism updates the fields to maintain real-time deconfliction. Simulation results demonstrate that the proposed framework generates paths that reduce exposure to high-risk regions to a negligible level and achieve a substantial reduction in average curvature compared to a baseline planner. Furthermore, the local update mechanism provides significant computational speedup for dynamic real-time scenarios. These results validate the capability of the dual-field framework to unify safety and efficiency in urban airspace management, providing a scalable foundation for future unmanned traffic management systems.
|
| |
| 15:00-16:30, Paper ThI2I.261 | Add to My Program |
| A Novel Dual-Spherical Intelligent Pipeline Robot for Leak Detection |
|
| Yan, Zefeng | Shanxi University |
| Wei, Lei | Huazhong University of Science and Technology |
| Lu, Liang | Huazhong University of Science and Technology |
| Yang, Zhou | Huazhong University of Science and Technology |
| Wang, Jiacheng | Huazhong University of Science and Technology |
| Han, Bin | Huazhong University of Science and Technology |
Keywords: Industrial Robots, Product Design, Development and Prototyping, Robotics in Hazardous Fields
Abstract: To address the challenge of achieving both low drag and high maneuverability in complex water-filled pipeline environments, this study proposes a novel dual-spherical pipeline robot with integrated leak detection and mapping capabilities. A multi-objective optimization framework was established to simultaneously improve hydrodynamic performance, motion stability, and internal spatial layout, while adopting a streamlined shell design to achieve both low-drag and sensor integration requirements. Based on a task-driven configuration optimization method, an energy-efficient propeller arrangement was derived under the constraint of maintaining maneuvering performance. The robot employs a helical differential propulsion system and integrates multiple sensors, including a vision module, an inertial navigation unit, and a pressure sensor, to enable leak detection and mapping. Its fully sealed spherical housing ensures stable operation in water-filled pipelines. Based on the proposed configuration, an experimental platform incorporating four representative pipeline environments was constructed, and a series of inspection, mapping, and environmental adaptability tests were conducted. The results demonstrate that the robot can achieve agile turning and stable locomotion in water-filled pipelines, showing strong potential for practical engineering applications.
|
| |
| 15:00-16:30, Paper ThI2I.262 | Add to My Program |
| WorldPlanner: Monte Carlo Tree Search and MPC with Action-Conditioned Visual World Models |
|
| Khorrambakht, Rooholla | New York University |
| Ortiz-Haro, Joaquim | New York University |
| Amigo, Joseph | New York University |
| Mostafa, Omar | New York University Abu Dhabi |
| Dugas, Daniel | Meta FAIR |
| Meier, Franziska | Meta FAIR |
| Righetti, Ludovic | New York University |
Keywords: Integrated Planning and Learning, Learning from Demonstration, Learning from Experience
Abstract: Robots must understand their environment from raw sensory inputs and reason about the consequences of their actions in it to solve complex tasks. Behavior Cloning (BC) leverages task-specific human demonstrations to learn this knowledge as end-to-end policies. However, these policies are difficult to transfer to new tasks, and generating training data is challenging because it requires careful demonstrations and frequent environment resets. In contrast to such policy-based view, in this paper we take a model-based approach where we collect a few hours of unstructured easy-to-collect play data to learn an action-conditioned visual world model, a diffusion-based action sampler, and optionally a reward model. The world model -- in combination with the action sampler and a reward model -- is then used to optimize long sequences of actions with a Monte Carlo Tree Search (MCTS) planner. The resulting plans are executed on the robot via a zeroth-order Model Predictive Controller (MPC). We show that the action sampler mitigates hallucinations of the world model during planning and validate our approach on 3 real-world robotic tasks with varying levels of planning and modeling complexity. Our experiments support the hypothesis that planning leads to a significant improvement over BC baselines on a standard manipulation test environment.
|
| |
| 15:00-16:30, Paper ThI2I.263 | Add to My Program |
| StreamVLN: Streaming Vision-And-Language Navigation Via SlowFast Context Modeling |
|
| Wei, Meng | The University of Hong Kong |
| Wan, Chenyang | Zhejiang University |
| Yu, Xiqian | University of Science and Technology of China |
| Wang, Tai | Shanghai AI Laboratory |
| Yang, Yuqiang | South China University of Technology |
| Mao, Xiaohan | Shanghai Jiao Tong University |
| Zhu, Chenming | University of Hong Kong |
| Cai, Wenzhe | Southeast University |
| Wang, Hanqing | Beijing Institute of Technology |
| Chen, Yilun | The Chinese University of Hong Kong |
| Liu, Xihui | The University of Hong Kong |
| Pang, Jiangmiao | Shanghai AI Laboratory |
Keywords: Vision-Based Navigation, Deep Learning for Visual Perception, Visual Learning
Abstract: Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of multi-turn dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves real-time dialogues through KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks show state-of-the-art performance with low latency, ensuring robustness and efficiency in real-world deployment.
|
| |
| 15:00-16:30, Paper ThI2I.265 | Add to My Program |
| UTTG: A Universal Teleoperation Framework Via Online Trajectory Generation |
|
| Fang, ShengJian | Shanghai Jiao Tong University |
| Zhou, Yixuan | Shanghai Jiao Tong University |
| Zheng, Yu | Shanghai Jiao Tong University |
| Jiang, Pengyu | Shanghai Jiao Tong University |
| Liu, Siyuan | Shanghai Jiao Tong University |
| Wang, Hesheng | Shanghai Jiao Tong University |
Keywords: Telerobotics and Teleoperation, Task and Motion Planning, Physical Human-Robot Interaction
Abstract: Teleoperation is crucial for hazardous environment operations and serves as a key tool for collecting expert demonstrations in robot learning. However, existing methods face robotic hardware dependency and control frequency mismatches between teleoperation devices and robotic platforms. Our approach introduces a unified interface that automatically extracts kinematic parameters from Unified Robot Description Format (URDF) files, enabling plug-and-play deployment across diverse robotic systems. The proposed interpolation algorithm bridges the frequency gap between low-rate human inputs and high-frequency robotic control commands through online continuous trajectory generation, without requiring access to the closed, bottom-level control loop. To further reduce latency, a joint prediction module is incorporated to anticipate operator intent and compensate for delays. Moreover, we introduce a minimum-stretch spline to optimize motion smoothness and quality. The system supports both precision and rapid operation modes for different task requirements. Experiments on three robotic platforms, including dual-arm setups, demonstrate our framework's generality, smoothness, and responsiveness. Teleoperation latency remains below 50ms at 30Hz input and approaches 15ms at 200Hz input. The code is developed in C++ with a Python interface, and available at https://github.com/IRMV-Manipulation-Group/UTTG.
|
| |
| 15:00-16:30, Paper ThI2I.266 | Add to My Program |
| Transformation-Domain Gaussian Smoothing for Translational Direct Visual Servoing |
|
| Nasir, Amneh | Université De Picardie Jules Verne (UPJV) |
| Kachi, Djemaa | University of Picardy |
| André, Antoine N. | AIST |
| Caron, Guillaume | CNRS |
Keywords: Visual Servoing, Optimization and Optimal Control
Abstract: Direct visual servoing (DVS) uses raw pixel intensities to control robot motion, yielding high accuracy at convergence. However, the associated photometric cost function is highly nonconvex, which leads to a narrow domain of convergence due to local minima. This work addresses that issue by adapting a Gaussian homotopy framework for cost function smoothing from cross-correlation to the sum of squared differences (SSD) objective used in DVS. The result is a spatially varying, transformation-domain kernel that depends on the motion model, producing smoother cost landscapes and enlarging the convergence basin. We first apply the smoothing to an SSD cost, derive its corresponding transformation kernel for the motion model in the camera domain, and then incorporate it into a DVS control law. The method is compared against uniform image domain blurring via Photometric Gaussian Mixtures. Experiments with an eye-in-hand robotic arm setup over three degrees of freedom translation and with different initial poses show that cost smoothing significantly increases the convergence domain while preserving the accuracy of DVS.
|
| |
| 15:00-16:30, Paper ThI2I.267 | Add to My Program |
| Electrospun TPU/LCE Composite Fibers for High-Performance Biomimetic Tendon Actuation |
|
| Luo, Yongzheng | Shanghai University |
| Gao, Shen | Shanghai University |
| Tang, Mingjun | Shanghai University |
| Wang, Yue | Shanghai University |
| Yue, Tao | Shanghai University |
Keywords: Soft Robot Materials and Design, Additive Manufacturing, Tendon/Wire Mechanism
Abstract: Traditional rigid actuators in soft robotics, particularly for bionic hands, suffer from structural complexity, bulkiness, and limited biomimetic motion. To address these limitations, we developed an electrospun composite fiber membrane composed of thermoplastic polyurethane (TPU) and liquid crystal elastomer (LCE), and demonstrated its feasibility as a tendon-like soft actuator in an artificial finger. TPU provides elasticity and mechanical robustness, while LCE contributes reversible thermal contraction as the actuation unit. The resulting TPU-LCE fibers exhibit high flexibility comparable to biological muscle and outstanding actuation performance. Under thermal stimulation, the actuator achieved a contraction strain of up to 44.4% and a load-bearing capacity exceeding 3,500 times its own weight, while maintaining durability over 120 actuation cycles without significant degradation. Integrated into a tendon-driven biomimetic finger, the actuator enabled smooth and natural joint motions, closely resembling human finger flexion–extension gestures. This work presents a reliable and scalable bio-inspired actuation strategy, offering promising potential for soft robotics applications.
|
| |
| 15:00-16:30, Paper ThI2I.268 | Add to My Program |
| Foundational World Models Accurately Detect Bimanual Manipulator Failures |
|
| Ward, Isaac Ronald | Stanford University |
| Ho, Michelle | Stanford University |
| Liu, Houjun | Stanford |
| Feldman, Aaron | Stanford University |
| Vincent, Joseph | Stanford University |
| Kruse, Liam | Stanford University |
| Cheong, Sean | Watney Robotics |
| Eddy, Duncan | Stanford University |
| Kochenderfer, Mykel | Stanford University |
| Schwager, Mac | Stanford University |
Keywords: Big Data in Robotics and Automation, Failure Detection and Recovery, AI-Based Methods
Abstract: It is currently challenging to deploy visuomotor robots at scale due to the potential of anomalous failures degrading performance, causing damage, or endangering human life. Bimanual manipulators are no exception; these robots have vast state spaces comprised of high-dimensional images and proprioceptive signals. Explicitly defining failure modes within such state spaces is infeasible. In this work, we overcome these challenges by training a probabilistic, history informed, world model within the compressed latent space of a pretrained vision foundation model (NVIDIA's Cosmos Tokenizer). The model outputs uncertainty estimates alongside its predictions that serve as non-conformity scores within a conformal prediction framework. We use these scores to develop a runtime monitor, correlating periods of high uncertainty with anomalous failures. To test these methods, we use the simulated Push-T environment and the Bimanual Cable Manipulation dataset, the latter of which we introduce in this work. This new dataset features trajectories with multiple synchronized camera views, proprioceptive signals, and annotated failures on a challenging data center maintenance task. We benchmark our methods against baselines from the anomaly detection and out-of-distribution detection literature, and show that our approach considerably outperforms statistical techniques. Furthermore, we show that our approach outperforms the next best learning-based approach by 3.8% in terms of failure detection rate, despite requiring approximately one twentieth of the trainable parameters (due to our use of foundation models for image compression). This level of robustness is a crucial step toward safely deploying manipulator robots at scale in real-world environments where reliability is non-negotiable.
|
| |
| 15:00-16:30, Paper ThI2I.269 | Add to My Program |
| Learning Collision-Free Object Goal Pushing for Quadruped Robots with Safe Corridors |
|
| Lai, Gabriel | City University of Hong Kong |
| Wong, Yi | City University of Hong Kong |
| Yeung, Chung Yui | City University of Hong Kong |
| Xu, Shaohang | City University of Hong Kong |
| Chen, Zhi | The Chinese University of Hong Kong |
| Ho, Chin Pang | City University of Hong Kong |
Keywords: Reinforcement Learning, Deep Learning in Grasping and Manipulation, Collision Avoidance
Abstract: While recent advancements in reinforcement learning have enabled quadrupedal robots to perform non-prehensile manipulation tasks like pushing, existing methods have largely overlooked the critical challenge of obstacle avoidance. In this paper, we address this significant limitation by introducing a novel reinforcement learning (RL) framework that controls a quadrupedal robot to push large objects in cluttered, real-world environments. In particular, obstacle avoidance is integrated as a primary objective directly into the policy training process. To achieve this, we propose to represent the traversable space with a low-dimensional safe corridor, a method that is both computationally efficient and highly effective. This approach avoids the need for complex and resource-intensive training pipelines typically required for processing high-dimensional sensor data. We validate our policy through extensive experiments in both simulation and the real world. The implementation code will be released to benefit the research community.
|
| |
| 15:00-16:30, Paper ThI2I.270 | Add to My Program |
| Gotta Scoop 'Em All: Sim-And-Real Co-Training of Graph-Based Neural Dynamics for Long-Horizon Scooping |
|
| Hong, Kaiwen | University of Illinois at Urbana Champaign |
| Chen, Haonan | University of Illinois at Urbana-Champaign |
| Xu, Jiaming | University of Illinois Urbana-Champaign |
| Wang, Runxuan | University of Illinois at Urbana Champaign |
| Wang, Kaylan | University of Illinois at Urbana Champaign |
| Zhang, Mingtong | UIUC |
| Liu, Shuijing | The University of Texas at Austin |
| Zhu, Yifan | University of Illinois Chicago |
| Li, Yunzhu | Columbia University |
| Driggs-Campbell, Katherine | University of Illinois at Urbana-Champaign |
Keywords: Service Robotics, Behavior-Based Systems
Abstract: Robotic manipulation of granular objects is crucial in various fields, yet modeling their complex dynamics and diverse physical properties remains challenging. Simulation plays an important role in learning robotic manipulation policies, but it exhibits challenge to accurately model the complex dynamics and physical properties given only visual observations. The difficulty is further compounded in tasks involving intricate contact mechanisms, particularly when using tools with complex shapes like spoons to interact with granular objects, resulting in a significant sim-to-real gap. To address this, we introduce a novel task of scooping all objects out of a storage container using a spoon, which requires sophisticated modeling of multi-object interactions. We propose a unified framework that combines rich simulation data with a small amount of real-world data. Rather than optimizing physical parameters in simulation, we learn a graph-based neural dynamics model in simulation and fine-tune it on real-world data. We then employ a Monte-Carlo Tree Search (MCTS)- based planner to accomplish long-horizon decision-making. Our system successfully scoops out three types of objects, demonstrating its potential for real-world applications. This work highlights the benefits of leveraging both simulation and real-world data to tackle the sim-to-real gap in contact-rich manipulation tasks.
|
| |
| 15:00-16:30, Paper ThI2I.271 | Add to My Program |
| See, Plan, Cut: MPC-Based Autonomous Volumetric Robotic Laser Surgery with OCT Guidance |
|
| Prakash, Ravi | Duke University |
| Wang, Vincent | Duke University |
| Mishra, Arpit | Duke University |
| Yuliarti, Devi | Duke University |
| Zhong, Pei | Duke University |
| McNabb, Ryan | Duke University |
| Codd, Patrick | Duke University |
| Bridgeman, Leila | Duke University |
Keywords: Medical Robots and Systems, Surgical Robotics: Planning, Hardware-Software Integration in Robotics
Abstract: Robotic laser systems enable sub-millimeter, non- contact tissue resection, yet existing platforms lack volumetric planning and intraoperative feedback. We present RATS (Robot-Assisted Tissue Surgery), an intelligent optical coherence tomography (OCT)-guided robotic platform for autonomous volumetric soft tissue resection. RATS integrates macro-scale RGB-D imaging, micro-scale OCT, and a fiber- coupled surgical laser, calibrated through a novel multistage alignment pipeline that achieves OCT-to-laser calibration accuracy of 0.161 ± 0.031 mm. A super-Gaussian laser–tissue interaction (LTI) model characterizes ablation morphology with an average RMSE of 0.231 ± 0.121 mm, outperforming Gaussian baselines. A sampling-based model predictive control (MPC) framework operates directly on OCT voxel data to generate closed-loop, constraint-aware resection trajectories, achieving 0.842 mm RMSE (root-mean-square error) and improving intersection-over-union agreement by 64.8% compared to feedforward execution. RATS also detects and preserves subsurface structures, demonstrating the first closed-loop autonomous volumetric robotic laser resection with OCT guidance. To our knowledge, this is the first demonstration of closed-loop autonomous volumetric robotic laser resection with OCT guidance, enabling precise, obstacle-aware tissue removal with potential in neurosurgery.
|
| |
| 15:00-16:30, Paper ThI2I.272 | Add to My Program |
| Learning Task-Invariant Properties Via Dreamer: Enabling Efficient Policy Transfer for Quadruped Robots |
|
| Liang, Junyang | Shenzhen University |
| Liu, Yuxuan | Shenzhen University |
| Chang, Yanbin | Shenzhen University |
| Lin, Junfan | Peng Cheng Laboratory |
| Ji, JunKai | Shenzhen University |
| Li, Hui | Shenzhen University |
| Huang, Changxin | Shenzhen University |
| Li, Jianqiang | Shenzhen University, |
Keywords: Machine Learning for Robot Control, Reinforcement Learning, Legged Robots
Abstract: Achieving quadruped robot locomotion across diverse and dynamic terrains presents significant challenges, primarily due to the discrepancies between simulation environments and real-world conditions. Traditional sim-to-real transfer methods often rely on manual feature design or costly real-world fine-tuning. To address these limitations, this paper proposes the DreamTIP framework, which incorporates Task-Invariant Properties learning within the Dreamer world model architecture to enhance sim-to-real transfer capabilities. Guided by large language models, DreamTIP identifies and leverages Task-Invariant Properties, such as contact stability and terrain clearance, which exhibit robustness to dynamic variations and strong transferability across tasks. These properties are integrated into the world model as auxiliary prediction targets, enabling the policy to learn representations that are insensitive to underlying dynamic changes. Furthermore, an efficient adaptation strategy is designed, employing a mixed replay buffer and regularization constraints to rapidly calibrate to real-world dynamics while effectively mitigating representation collapse and catastrophic forgetting. Extensive experiments on complex terrains, including Stair, Climb, Tilt, and Crawl, demonstrate that DreamTIP significantly outperforms state-of-the-art baselines in both simulated and real-world environments. Our method achieves an average performance improvement of 28.1% across eight distinct simulated transfer tasks. In the real-world Climb task, the baseline method achieved only a 10% success rate, whereas our method attained a 100% success rate. These results indicate that incorporating Task-Invariant Properties into Dreamer learning offers a novel solution for achieving robust and transferable robot locomotion.
|
| |
| 15:00-16:30, Paper ThI2I.273 | Add to My Program |
| Deep Reinforcement Learning for Reach-Avoid-Stay Problems |
|
| Chenevert, Gabriel | North Carolina State University |
| Li, Jingqi | Berkeley |
| Kannan, Achyuta | North Carolina State University |
| Bae, Sangjae | Honda Research Institute, USA |
| Lee, Donggun | North Carolina State University |
Keywords: Optimization and Optimal Control, Machine Learning for Robot Control, Deep Learning Methods
Abstract: Reach-Avoid-Stay (RAS) tasks are essential in applications where systems must safely reach a target set and remain within it under all bounded disturbances. Existing approaches either struggle to compute the maximal robust RAS set—the set of all states from which the RAS task is achievable—or are limited in handling general dynamic systems. To address these challenges, this paper proposes a two-step deep reinforcement learning framework that jointly learns the maximal robust RAS set and the corresponding control policy. The first step identifies the maximal robust control-invariant set within the target set and derives a policy that ensures the system remains within it. The second step computes the maximal robust reach-avoid (RA) set using this invariant set as the target, and it is proven that this RA set is equivalent to the maximal robust RAS set. Leveraging this result, a switching policy is constructed from the two step-wise policies, which constitutes a valid policy guaranteeing completion of the RAS task. Simulation results demonstrate that the proposed framework (1) computes the exact maximal robust RAS set in the absence of training errors, yielding the least restrictive RAS policy, and (2) identifies the RAS set with high accuracy while outperforming baseline methods on RAS tasks.
|
| |
| 15:00-16:30, Paper ThI2I.274 | Add to My Program |
| Higher Order Reasoning for Collaborative Communicationless Mobile Robot Operations |
|
| Reasoner, Jonathan | University of Virginia |
| Bezzo, Nicola | University of Virginia |
Keywords: Cooperating Robots, Path Planning for Multiple Mobile Robots or Agents, Distributed Robot Systems
Abstract: In communicationless environments, multi-robot systems must operate without the constant information exchange that many coordination strategies typically assume. This paper presents a novel dynamic epistemic planning framework that enables implicit coordination and long horizon planning through higher-order reasoning among robots. With our approach, robots form and propagate higher-order belief particles, update world beliefs using Bayesian inference, and select actions via a behavior tree that anticipates teammates’ likely decisions. A temporally aware Model Predictive Path Integral (MPPI) controller integrates this reasoning into low-level execution, allowing robots to plan intercepts and adapt trajectories under partial observability. The proposed framework is evaluated in both simulations and physical experiments, where it consistently reduces task completion time compared to a first-order baseline, demonstrating that epistemic logic can serve as a robust foundation for resilient coordination in communication-restricted domains.
|
| |
| 15:00-16:30, Paper ThI2I.275 | Add to My Program |
| Direct Contact-Tolerant Motion Planning with Vision Language Models |
|
| Li, He | University of Macau |
| Sun, Jian | University of Macau |
| Li, Chengyang | The University of Hong Kong |
| Li, Guoliang | University of Macau |
| Ruan, Qiyu | University of Macau |
| Wang, Shuai | Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences |
| Xu, Chengzhong | University of Macau |
|
|
| |
| 15:00-16:30, Paper ThI2I.276 | Add to My Program |
| Learning Dexterous Manipulation Skills from Imperfect Simulations |
|
| Hsieh, Elvis | UC Berkeley |
| Hsieh, Wen-Han | UC Berkeley |
| Wang, Yen-Jen | UC Berkeley |
| Lin, Toru | UC Berkeley |
| Malik, Jitendra | UC Berkeley |
| Sreenath, Koushil | UC Berkeley |
| Qi, Haozhi | UC Berkeley |
Keywords: Dexterous Manipulation, Force and Tactile Sensing, Imitation Learning
Abstract: Reinforcement learning and sim-to-real transfer have made significant progress in dexterous manipulation. However, progress remains limited by the difficulty of simulating complex contact dynamics and multisensory signals, especially tactile feedback. In this work, we propose DexScrew, a sim-to-real framework that addresses these limitations and demonstrates its effectiveness on nut-bolt fastening and screwdriving with multi-fingered hands. The framework has three stages. First, we train reinforcement learning policies in simulation using simplified object models that lead to the emergence of correct finger gaits. We then use the learned policy as a skill primitive within a teleoperation system to collect real-world demonstrations that contain tactile and proprioceptive information. Finally, we train a behavior cloning policy that incorporates tactile sensing and show that it generalizes to nuts and screwdrivers with diverse geometries. Experiments across both tasks show high task progress ratios compared to direct sim-to-real transfer and robust performance even on unseen object shapes and under external perturbations. Videos and code are available on dexscrew.github.io
|
| |
| 15:00-16:30, Paper ThI2I.278 | Add to My Program |
| Robust Robot Navigation through Failure-Aversion Learning |
|
| Yu, Zhifeng | Xi 'an Jiaotong University |
| Li, Xuyang | Xi'an Jiaotong University |
| Fang, Jianwu | Xian Jiaotong University |
| Li, Guangliang | Ocean University of China |
| Xue, Jianru | Xi'an Jiaotong University |
Keywords: Reinforcement Learning, Learning from Experience, Integrated Planning and Control
Abstract: Autonomous navigation in complex dynamic environments remains a fundamental challenge in robotics, and many reinforcement learning (RL) algorithms have demonstrated promising results, especially the on-policy ones. However, the inherent sample efficiency issue is still a fundamental problem to be solved. Methods integrating off-policy approaches into on-policy frameworks have been proposed to improve the sample efficiency by focusing on imitating the agent’s past exemplary experiences while discarding less optimal ones. However, these methods overlook the valuable insights embedded within failures. Although some research has begun to explore learning from failures, it is usually done at a point-by-point level, ignoring the rich sequence context inherent in the trajectory. In this paper, we introduce DFPS-Nav, a training framework that utilizes Failure-Aversion Learning (FAL) to perform segmented, trend-based credit assignment, identifying both failure-inducing actions and valuable recovery behaviors within failed trajectories. We further improve successful imitation by adopting Prioritized Self-Imitation Learning (PSIL), which scores trajectories and prioritizes high-quality behaviors so that successful behaviors are reliably reproduced. Extensive simulation and real-world experiments demonstrate that using both FAL and PSIL to extract and refine information from the sequential context within trajectories, DFPS-Nav achieves up to 29.5% and 27% higher success rates in static and dynamic environments compared to the strong baseline method and successfully is applied in the real world. This work underscores how systematically deconstructing failures while prioritizing successes leads to more efficient and robust autonomous navigation.
|
| |
| 15:00-16:30, Paper ThI2I.279 | Add to My Program |
| Think Fast: Real-Time Kinodynamic Belief Space Planning for Projectile Interception |
|
| Olin, Gabriel | Carnegie Mellon University |
| Chen, Lu | Northeastern University |
| Gandotra, Nayesha | Carnegie Mellon University |
| Likhachev, Maxim | Carnegie Mellon University |
| Choset, Howie | Carnegie Mellon University |
Keywords: Reactive and Sensor-Based Planning, Planning under Uncertainty, Motion and Path Planning
Abstract: Intercepting fast moving objects, by its very nature, is challenging because of its tight time constraints. This problem becomes further complicated in the presence of sensor noise because noisy sensors provide, at best, incomplete information, which results in a distribution over target states to be intercepted. Since time is of the essence, to hit the target, the planner must begin directing the interceptor, in this case a robot arm, while still receiving information. We introduce an tree-like structure, which is grown using kinodynamic motion primitives in state-time space. This tree-like structure encodes reachability to multiple goals from a single origin, while enabling real-time value updates as the target belief evolves and seamless transitions between goals. We evaluate our framework on an interception task on a 6 DOF industrial arm (ABB IRB-1600) with an onboard stereo camera (ZED 2i). A robust Innovation-based Adaptive Estimation Adaptive Kalman Filter (RIAE-AKF) is used to track the target and perform belief updates.
|
| |
| 15:00-16:30, Paper ThI2I.280 | Add to My Program |
| ECAHD: Efficient Collision-Aware Hierarchical Diffusion Navigation |
|
| Pahk, Jinu | Seoul National University |
| Kim, Theo Taeyeong | Seoul National University |
| Lee, Jun Ki | Seoul National University |
| Zhang, Byoung-Tak | Seoul National University |
Keywords: Deep Learning Methods, Imitation Learning, Integrated Planning and Learning
Abstract: In this work, we propose Efficient Collision-Aware Hierarchical Diffusion Navigation (ECAHD), a hierarchical diffusion-based framework designed for both safety and computational efficiency. ECAHD generates a sparse trajectory for global path planning and a dense trajectory for local path refinement. The robot follows a rapidly sampled sparse global trajectory, and when a potential collision is detected, a collision-aware guidance diffusion mechanism—which accounts for the robot’s shape—adjusts the local trajectory accordingly. Conventional full-sequence diffusion planners suffer from slow sampling speeds and performance degradation when collision-aware guidance is applied across the entire trajectory. ECAHD addresses these issues by significantly reducing the number of waypoints predicted by the global diffusion planner, while delegating robot shape aware collision guidance to the local diffusion planner. This separation not only accelerates planning but also preserves global trajectory quality, as goal-conditioned sampling is no longer disrupted by collision-related constraints. Furthermore, ECAHD allows for increasing the number of global trajectory samples to enhance performance, without incurring substantial computational overhead. In maze2d-large planning tests, ECAHD improved success rates by approximately 1.3% while reducing collision rates by more than 50%, all while cutting inference time by nearly half.
|
| |
| 15:00-16:30, Paper ThI2I.281 | Add to My Program |
| X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations |
|
| Pace, Maximus | Cornell University |
| Dan, Prithwish | Cornell University |
| Ning, Chuanruo | Cornell University |
| Bhardwaj, Atiksh | Cornell University |
| Du, Moyun | Cornell University |
| Duan, Edward | Cornell University |
| Ma, Wei-Chiu | Cornell University |
| Kedia, Kushal | Cornell University |
Keywords: Imitation Learning, Learning from Demonstration, Transfer Learning
Abstract: Human videos are a scalable source of training data for robot learning. However, humans and robots significantly differ in embodiment, making many human actions infeasible for direct execution on a robot. Still, these demonstrations convey rich object-interaction cues and task intent. Our goal is to learn from this coarse guidance without transferring embodiment-specific, infeasible execution strategies. Recent advances in generative modeling tackle a related problem of learning from low-quality data. In particular, Ambient Diffusion is a recent method for diffusion modeling that incorporates low-quality data only at high-noise timesteps of the forward diffusion process. Our key insight is to view human actions as noisy counterparts of robot actions. As noise increases along the forward diffusion process, embodiment-specific differences fade away while task-relevant guidance is preserved. Based on these observations, we present X-Diffusion, a cross-embodiment learning framework based on Ambient Diffusion that selectively trains diffusion policies on noised human actions. This enables effective use of easy-to-collect human videos without sacrificing robot feasibility. Across five real-world manipulation tasks, we show that X-Diffusion improves average success rates by 16% over naive co-training and manual data filtering.
|
| |
| 15:00-16:30, Paper ThI2I.282 | Add to My Program |
| CG-THWM: Curriculum-Guided Temporal Haptic World Modeling for Peg-In-Hole Tasks |
|
| Zhong, Xinli | University of Chinese Academy of Sciences,Beihang University |
| Han, Feng | Beihang University |
| Xu, Manya | University of Chinese Academy of Sciences |
| Li, Mu | Beihang University |
| Zhang, Daqiang | Tongji University |
| Niu, Jianwei | Beihang University |
Keywords: Assembly, Reinforcement Learning, Manipulation Planning
Abstract: Fine-tolerance peg-in-hole manipulation demands high precision under contact-rich, nonsmooth dynamics, where irregular geometries, inclinations, and tight-clearance interference often cause model-free reinforcement learning (RL) to fail. We propose the Curriculum-Guided Temporal Haptic World Model (CG-THWM), which couples a world model with temporal haptic information and trains it via a staged curriculum. The world model supports efficient long-horizon planning with value estimation, while temporal haptic signals expose critical contact events; the curriculum stabilizes training and improves generalization. To enable rigorous evaluation, we construct a dataset for complex insertions that covers irregular, inclined, and interference-rich settings. In simulation, CG-THWM attains a 100% success rate on standard baselines and a 70% mean success rate in scenarios where conventional RL fails. These results highlight CG-THWM's potential for industrial and service applications.
|
| |
| 15:00-16:30, Paper ThI2I.283 | Add to My Program |
| Cross-Embodiment Transfer Via Behavior-Aligned Representations |
|
| Sridhar, Ajay | University of California, Berkeley |
| Gao, Jensen | Stanford University |
| Yang, Jonathan | Stanford University |
| Mercat, Jean | 1991 |
| Belkhale, Suneel | Stanford University |
| Sadigh, Dorsa | Stanford University |
Keywords: Transfer Learning, Big Data in Robotics and Automation, Imitation Learning
Abstract: Recent progress in large-scale imitation learning for robot manipulation has been driven by leveraging datasets across a wide range of robot embodiments. However, achieving significant cross-embodiment transfer is often still challenging. In this work, we study the role of using behavior-aligned representations (e.g., object bounding boxes, language motions, end-effector traces of robot motion) in vision-language-action (VLA) models to promote cross-embodiment transfer. We hypothesize that by possessing invariances across embodiments while being predictive of robot actions, these representations can help unify large-scale cross-embodiment data to enhance transfer. To assess our hypothesis, we develop a simulation-based benchmark designed to assess transfer with diverse cross-embodiment data to new embodiments. Using this benchmark, we compare different representations and ways of incorporating them. We identify that end-effector traces can be particularly beneficial for transfer, representations are generally more useful with larger prior datasets, and can be used to benefit from action-free data. We also demonstrate that they can enhance sim-to-real cross-embodiment transfer, improving task completion progress of real robot policies pre-trained on simulation data by 28%. We provide videos of our evaluations at our website https://ajaysridhar.com/barx.github.io/.
|
| |
| 15:00-16:30, Paper ThI2I.284 | Add to My Program |
| UnsOcc: 3D Semantic Occupancy Prediction in Unstructured Scene Via Rendering Fusion |
|
| Wu, Ye | University of Chinese Academy of Sciences |
| Song, Ruiqi | Tongji University |
| Ding, Baiyong | Waytous |
| Zeng, Nanxing | University of Chinese Academy of Sciences |
| Cheng, JunJie | University of Chinese Academy of Sciences |
| Ai, Yunfeng | University of Chinese Academy of Sciences |
Keywords: Mining Robotics, Field Robots
Abstract: Unstructured scenes present unique challenges for autonomous driving, as irregular obstacles and sparse scene layouts undermine the effectiveness of traditional perception methods such as 3D object detection. 3D semantic occupancy prediction has emerged as a prominent focus due to its ability to provide dense spatial representations by assigning semantic labels to individual voxels in 3D space. However, directly applying 3D semantic occupancy prediction to unstructured scenes remains challenging because scene sparsity hinders effective cross-modal fusion and the more severe long-tail distribution in these scenarios further degrades prediction performance. To validate the effectiveness of our approach, we construct a dedicated dataset of unstructured scenes collected from open-pit mines. Based on this, we propose UnsOcc, a multi-modal 3D semantic occupancy prediction framework that improves robustness in unstructured environments. At its core, we introduce a rendering-based fusion module, RenderFusion, which enhances cross-modal feature alignment through bidirectional rendering supervision. Furthermore, we propose GSRefinement, a detail-aware auxiliary supervision method based on Gaussian Splatting that projects sparse 3D occupancy predictions into dense 2D semantic segmentation maps, enabling effective supervision for long-tail categories. Extensive experiments on both the open-pit mine dataset and the nuScenes dataset demonstrate that our method significantly outperforms existing state-of-the-art approaches.
|
| |
| 15:00-16:30, Paper ThI2I.285 | Add to My Program |
| Deep Reinforcement Learning for Hip Exoskeleton Control Via Predictive Simulation of Reflex-Based Human Gait |
|
| Barati, Hossein | Gyeongsang National University |
| Kim, Sangdo | Korea Institute of Science and Technology |
| Nguyen, Thanh Xuan | Gyeongsang National University |
| Lee, Jongwon | Korea Institute of Science and Technology |
| Park, Young Jin | Gyeongsang National University(GNU) |
Keywords: Prosthetics and Exoskeletons, Wearable Robotics, Reinforcement Learning
Abstract: Lower-limb exoskeletons have the potential to enhance mobility and reduce the metabolic cost of walking,while conventional control strategies often lack adaptability and require labor-intensive tuning. Recent advances in reinforcement learning (RL) provide new opportunities for generating efficient and personalized assistance. In this study, we propose a predictive simulation framework that integrates a reflex-based musculoskeletal walking model with a hip exoskeleton controller trained using Proximal Policy Optimization (PPO) with a Long Short-Term Memory (LSTM) actor network. The reflex-based model reproduces realistic gait kinematics without relying on experimental motion data, while the LSTM-PPO controller learns to map kinematic states directly to assistive torques. Domain randomization was applied during training to enhance robustness and facilitate sim-to-real transfer. The learned controller was deployed onto a physical hip exoskeleton and evaluated in human subject experiments. Results showed that the LSTM-PPO controller reduced the metabolic cost of walking by an average of 9.1%. These findings highlight the potential of predictive simulation and deep RL for developing intelligent, experiment-free exoskeleton controllers that improve walking efficiency and robustness in real-world conditions.
|
| |
| 15:00-16:30, Paper ThI2I.286 | Add to My Program |
| Pocket-SLAM: Rendering-Area-Aware Pruning for Memory-Efficient 3DGS-SLAM |
|
| Li, Leshu | University of Minnesota, Twin Cities |
| Peng, Jie | University of Science and Technology of China |
| Zhao, Yang | University of Minnesota, Twin Cities |
Keywords: SLAM, Visual-Inertial SLAM, Intelligent Transportation Systems
Abstract: 3D Gaussian Splatting (3DGS) has garnered significant attention in Simultaneous Localization and Mapping (SLAM) due to its advances in capturing fine-grained geometry features and synthesizing novel views. For SLAM in large-scale scenes, such as autonomous driving, 3DGS-SLAM faces a critical limitation. The memory consumption increases continuously over time as Gaussian points accumulate, leading to poor memory efficiency and limiting its applicability. In this work, we propose a rendering-area–aware pruning strategy that selectively removes Gaussians based on their contribution to the effective rendering area, rather than solely relying on Gaussian-level heuristics (e.g., opacity or gradient magnitude). This perspective directly targets the sources of memory redundancy, effectively reducing the peak memory footprint of 3DGS-SLAM during runtime. Evaluations on the EuRoC and KITTI datasets demonstrate that our method consistently outperforms existing pruning approaches in large-scale outdoor scenes, achieving over 60% memory reduction and more than 2times FPS improvement while preserving localization and mapping accuracy. These results highlight rendering-area–aware pruning as a promising direction for scaling 3DGS-SLAM to real-world autonomous driving scenarios. Our code is publicly available at https://github.com/UMN-ZhaoLab/Pocket-SLAM.git.
|
| |
| 15:00-16:30, Paper ThI2I.287 | Add to My Program |
| Long-Term Mapping of the Douro River Plume with Multi-Agent Reinforcement Learning |
|
| Dal Fabbro, Nicolo | University of Pennsylvania |
| Mesbahi, Milad | University of Pennsylvania |
| Renato, Mendes | Lsts - +atlantic |
| Sousa, João | Universidade Porto - Faculdade Engenharia |
| Pappas, George J. | University of Pennsylvania |
Keywords: Control Architectures and Programming, Energy and Environment-Aware Automation, Marine Robotics
Abstract: We study the problem of long-term (multiple days) mapping of a river plume using multiple autonomous underwater vehicles (AUVs), focusing on the Douro river representative use-case. We propose an energy - and communication - efficient multi-agent reinforcement learning approach in which a central coordinator intermittently communicates with the AUVs, collecting measurements and issuing commands. Our approach integrates spatiotemporal Gaussian process regression (GPR) with a multi-head Q-network controller that regulates direction and speed for each AUV. Simulations using the Delft3D ocean model demonstrate that our method consistently outperforms both single- and multi-agent benchmarks, with scaling the number of agents both improving mean squared error (MSE) and operational endurance. In some instances, our algorithm demonstrates that doubling the number of AUVs can more than double endurance while maintaining or improving accuracy, underscoring the benefits of coordination. Our learned policies generalize across unseen seasonal regimes over different months and years, demonstrating promise for future developments of data-driven long-term monitoring of dynamic plume environments.
|
| |
| 15:00-16:30, Paper ThI2I.288 | Add to My Program |
| Real-To-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions |
|
| Zhang, Kaifeng | Columbia University |
| Sha, Shuo | Columbia University |
| Jiang, Hanxiao | Columbia University |
| Loper, Matthew | Scenix.ai |
| Song, Hyunjong | Scenix |
| Cai, Guangyan | SceniX |
| Xu, Zhuo | Google Deepmind |
| Xiaochen, Hu | SceniX |
| Zheng, Changxi | Columbia University |
| Li, Yunzhu | Columbia University |
Keywords: Perception for Grasping and Manipulation, Simulation and Animation, Imitation Learning
Abstract: Robotic manipulation policies are advancing rapidly, but their direct evaluation in the real world remains costly, time-consuming, and difficult to reproduce, particularly for tasks involving deformable objects. Simulation provides a scalable and systematic alternative, yet existing simulators often fail to capture the coupled visual and physical complexity of soft-body interactions. We present a real-to-sim policy evaluation framework that constructs soft-body digital twins from real-world videos and renders robots, objects, and environments with photorealistic fidelity using 3D Gaussian Splatting. We validate our approach on representative deformable manipulation tasks, including plush toy packing, rope routing, and T-block pushing, demonstrating that simulated rollouts correlate strongly with real-world execution performance and reveal key behavioral patterns of learned policies. Our results suggest that combining physics-informed reconstruction with high-quality rendering enables reproducible, scalable, and accurate evaluation of robotic manipulation policies. Website: https://real2sim-eval.github.io/
|
| |
| 15:00-16:30, Paper ThI2I.289 | Add to My Program |
| Measurement and Potential Field-Based Patient Modelling for Model-Mediated Tele-Ultrasound |
|
| Yeung, Ryan S. | University of British Columbia |
| Black, David G. | University of British Columbia |
| Salcudean, Septimiu E. | University of British Columbia |
Keywords: Medical Robots and Systems, Telerobotics and Teleoperation, Haptics and Haptic Interfaces
Abstract: Teleoperated ultrasound can improve diagnostic medical imaging access for remote communities. Having accurate force feedback is important for enabling sonographers to apply the appropriate probe contact force to optimize ultrasound image quality. However, large time delays in communication make direct force feedback impractical. Prior work investigated using point cloud-based model-mediated teleoperation and internal potential field models to estimate contact forces and torques. We expand on this by introducing a method to update the internal potential field model of the patient with measured positions, forces and torques for more transparent model-mediated tele-ultrasound. We first generate a point cloud model of the patient's surface and transmit this to the sonographer in a compact data structure. This is converted to a static voxelized volume where each voxel contains a potential field value. These values determine the forces and torques, which are rendered based on overlap between the voxelized volume and a point shell model of the ultrasound transducer. We solve for the potential field using a convex quadratic that combines the spatial Laplace operator with measured forces and torques. This was evaluated on volunteers (n=4) by assessing the accuracy of rendered forces and torques. Results showed the addition of measurements to the model reduced the force magnitude RMSE by an average of 7.42 N, the force vector angle error by an average of 3.71 o, and the torque vector angle error by an average of 64.0 o compared to using only Laplace's equation.
|
| |
| 15:00-16:30, Paper ThI2I.290 | Add to My Program |
| Accelerating Trajectory Optimization by Exploiting B-Spline Gradient Structure |
|
| Doiron, Nikos | Université De Moncton |
| Duquette, Thomas | Université De Moncton |
| Tchane Djogdom, Gilde Vanel | Université De Moncton |
| Gallant, Andre | Université De Moncton |
Keywords: Optimization and Optimal Control, Constrained Motion Planning, Collision Avoidance
Abstract: This work presents a discrete-time trajectory optimization framework that achieves near real-time performance for robotic manipulators. This is achieved by drastically speeding up constraint gradient computations. The approach leverages the analytical and structural properties of B-splines to introduce three key speedups: exploiting gradient sparsity from local control, using a hybrid-analytical method to replace most finite differences with closed-form derivatives, and aggregating constraints per knot-span to reduce the problem size. Validated on a simulated UR5e across 64 tasks in a cluttered workspace, these cumulative speedups reduce computation time by up to 96.3% (a 26.9x speedup) relative to a finite-difference baseline, without compromising trajectory quality, success rate, or fidelity to kinematic, dynamic, and collision constraints.
|
| |
| 15:00-16:30, Paper ThI2I.291 | Add to My Program |
| Design and Control of Modular Magnetic Millirobots for Multimodal Locomotion and Shape Reconfiguration |
|
| Garcia Oyono, Erik | Imperial College London |
| Lin, Jialin | Imperial College London |
| Zhang, Dandan | Imperial College London |
Keywords: Cellular and Modular Robots, Micro/Nano Robots, Medical Robots and Systems
Abstract: Modular small-scale robots offer the potential for on-demand assembly and disassembly, enabling task-specific adaptation in dynamic and constrained environments. However, existing modular magnetic platforms often depend on workspace collisions for reconfiguration, employ bulky three-dimensional electromagnetic systems, and lack robust single-module control, which limits their applicability in biomedical settings. In this work, we present a modular magnetic millirobotic platform comprising three cube-shaped modules with embedded permanent magnets, each designed for a distinct functional role: a free module that supports self-assembly and reconfiguration, a fixed module that enables flip-and-walk locomotion, and a gripper module for cargo manipulation. Locomotion and reconfiguration are actuated by programmable combinations of time-varying two-dimensional uniform and gradient magnetic field inputs. Experiments demonstrate closed-loop navigation using real-time vision feedback and A* path planning, establishing robust single-module control capabilities. Beyond locomotion, the system achieves self-assembly, multimodal transformations, and disassembly at low field strengths. Chain-to-gripper transformations succeeded in 90% of trials, while chain-to-square transformations were less consistent, underscoring the role of module geometry in reconfiguration reliability. These results establish a versatile modular robotic platform capable of multimodal behavior and robust control, suggesting a promising pathway toward scalable and adaptive task execution in confined environments.
|
| |
| 15:00-16:30, Paper ThI2I.292 | Add to My Program |
| Generate, Transfer, Adapt: Learning Functional Dexterous Grasping from a Single Human Demonstration |
|
| He, Xingyi | Zhejiang University |
| Polavaram, Adhitya | Cornell University |
| Cao, Yunhao | Cornell University |
| Deshmukh, Om | Cornell University |
| Wang, Tianrui | Cornell University |
| Zhou, Xiaowei | Zhejiang University |
| Fang, Kuan | Cornell University |
Keywords: Dexterous Manipulation, Deep Learning in Grasping and Manipulation
Abstract: Functional grasping with dexterous robotic hands is a key capability for enabling tool use and complex manipulation, yet progress has been constrained by two persistent bottlenecks: the scarcity of large-scale datasets and the absence of integrated semantic and geometric reasoning in learned models. In this work, we present CorDex, a framework that robustly learns dexterous functional grasps of novel objects from synthetic data generated from just a single human demonstration. At the core of our approach is a correspondence-based data engine that generates diverse, high-quality training data in simulation. Based on the human demonstration, our data engine generates diverse object instances of the same category, transfers the expert grasp to the generated objects through correspondence estimation, and adapts the grasp through optimization. Building on the generated data, we introduce a multimodal prediction network that integrates visual and geometric information. By devising a local–global fusion module and an importance-aware sampling mechanism, we enable robust and computationally efficient prediction of functional dexterous grasps. Through extensive experiments across various object categories, we demonstrate that CorDex generalizes well to unseen object instances and significantly outperforms state-of-the-art baselines.
|
| |
| 15:00-16:30, Paper ThI2I.293 | Add to My Program |
| JetsonCompletion: Real-Time Depth Completion on Resource-Constrained Edge Devices |
|
| Wang, Kailin | Nanjing Normal University |
| Zhu, Xiaozhou | Chinese Academy of Military Science |
| Yang, Benyi | Chinese Academy of Military Science |
| Zhang, Tian | Chinese Academy of Military Science |
| Zhang, Haoxin | Sun Yat-Sen University |
| Xie, Fei | Nanjing Normal University |
| Li, Shuaixin | Chinese Academy of Military Science |
Keywords: Deep Learning for Visual Perception, Sensor Fusion, RGB-D Perception
Abstract: Depth completion from sparse LiDAR points and images is a key perception task for autonomous robots, enabling dense 3D understanding in challenging environments. However, most recent researches achieve accuracy gains by greatly enlarging network size, making them unsuitable for realtime deployment on power- and compute-constrained platforms. This paper proposes an ultra-lightweight depth completion framework optimized for embedded systems. Our approach integrates a re-parameterized encoder–decoder with fewer than 5M parameters and a two-stage hybrid distillation strategy. The first stage progressively densifies sparse depth supervision, while the second preserves edge fidelity through a combination of metric and structural losses. A full TensorRT FP16 pipeline further ensures efficient deployment. Extensive experiments on KITTI Depth Completion, NYU-v2 . demonstrate that our method achieves competitive accuracy while maintaining high efficiency. On a Jetson Xavier NX, the system runs at over 30 FPS with sub-33 ms latency within a 20 W power envelope, showing strong potential for real-world micro-robotic platforms. We will open-source the code to benefit the community. Our open source website: https://github.com/2463450186Q/JetsonCompletion.git
|
| |
| 15:00-16:30, Paper ThI2I.294 | Add to My Program |
| Contact-Driven Localization in a Freeform Robotic Self-Assembled Structure |
|
| Rashidioun, Mohammadali | New Jersey Institute of Technology |
| Sosa, Michael | New Jersey Institute of Technology |
| Swissler, Petras | New Jersey Institute of Technology |
Keywords: Localization, Swarm Robotics, Multi-Robot Systems
Abstract: Accurate localization remains a key challenge in swarm robotics, particularly for self-reconfigurable systems that must identify relative positions to form diverse structures. Most existing approaches rely on external tracking infrastructure or high-cost sensors, which limit scalability and deployment in unstructured environments. In this paper, we propose a novel contact-driven localization method for modular robots that leverages only local communication through binary contact information (whether two robots are physically connected or not). To exploit these contact cues, we introduce a virtual-force framework in which robots iteratively refine their poses—attracting toward dock-connected neighbors and repelling from non-connected ones. The method requires no external infrastructure and relies only on minimal onboard sensing. Simulations show effective localization during the assembly of towers and cantilevers, enabling accurate, scalable, free-form self-assembly.
|
| |
| 15:00-16:30, Paper ThI2I.295 | Add to My Program |
| A Contact-Driven Framework for Manipulating in the Blind |
|
| Saleem, Muhammad Suhail | Carnegie Mellon University |
| Yuan, Lai | Carnegie Mellon University |
| Likhachev, Maxim | Carnegie Mellon University |
Keywords: Manipulation Planning, Motion and Path Planning
Abstract: Robots often face manipulation tasks in environments where vision is inadequate due to clutter, occlusions, or poor lighting—for example, reaching a shutoff valve at the back of a sink cabinet or locating a light switch above a crowded shelf. In such settings, robots, much like humans, must rely on contact feedback to distinguish free from occupied space and navigate around obstacles. Many of these environments often exhibit strong structural priors—for instance, pipes often span across sink cabinets—that can be exploited to anticipate unseen structure and avoid unnecessary collisions. We present a theoretically complete and empirically efficient framework for manipulation in the blind that integrates contact feedback with structural priors to enable robust operation in unknown environments. The framework comprises three tightly coupled components: (i) a contact detection and localization module that utilizes joint torque sensing with a contact particle filter to detect and localize contacts, (ii) an occupancy estimation module that uses the history of contact observations to build a partial occupancy map of the workspace and extrapolate it into unexplored regions with learned predictors, and (iii) a planning module that accounts for the fact that contact localization estimates and occupancy predictions can be noisy, computing paths that avoid collisions and complete tasks efficiently without eliminating feasible solutions. We evaluate the system in simulation and in the real world on a UR10e manipulator across two domestic tasks—(i) manipulating a valve under a kitchen sink surrounded by pipes and (ii) retrieving a target object from a cluttered shelf. Results show that the framework reliably solves these tasks, achieving up to a 2x reduction in task completion time compared to baselines, with ablations confirming the contribution of each module.
|
| |
| 15:00-16:30, Paper ThI2I.296 | Add to My Program |
| TransDexNet: An End-To-End Motion Retargeting Network with Transformer for Dexterous Hand Teleoperation from RGB Images |
|
| Tan, Jiaying | Sun Yat-Sen University |
| Gao, Qing | Sun Yat-Sen University |
| Lai, Yuanchuan | Sun Yat-Sen University |
Keywords: Telerobotics and Teleoperation, Dexterous Manipulation
Abstract: Dexterous hand teleoperation is becoming increasingly common, yet existing methods rarely provide both efficiency and convenience. The core challenge is to achieve motion retargeting from the human hand to a dexterous hand. To address this, we introduce TransDexNet, an end-toend vision-based motion retargeting architecture for dexterous hands. Equipped with a Vision Transformer backbone, it takes a single RGB image of a human hand and directly regresses the joint angles of a dexterous hand without any intermediate pose estimation. The architecture employs dual branches bridged by an alignment layer to close the gaps in degrees of freedom (DoFs), geometry, and kinematics between the human and dexterous hands, enabling domain-invariant latent features. To train TransDexNet, we built a dataset named TransDexData, consisting of 91,000 RGB images of human hands paired with the corresponding dexterous hand RGB images and joint angles. In evaluation, the proposed network achieves an average joint angle error of 0.076 rad. Both simulation and real-world experiments demonstrate accurate and efficient performance.
|
| |
| 15:00-16:30, Paper ThI2I.297 | Add to My Program |
| SA-MPPI: Sensitivity-Aware Model Predictive Path Integral Control for Robust and Agile Quadrotor Flight |
|
| Gu, Fuqiang | Chongqing University |
| Lu, Xu | Chongqing University |
| Liu, Huidong | The College of Computer Science, Chongqing University, |
| Ai, Jiangshan | Chongqing University |
| Long, Xianlei | Chongqing University |
| Jiang, Tao | Chongqing University |
| Chen, Chao | Chongqing University |
| Huang, Zhao | University of Aberdeen |
Keywords: Aerial Systems: Mechanics and Control, Motion Control, Robust/Adaptive Control
Abstract: Reliable quadrotor control in dynamic environments remains challenging due to external disturbances and internal uncertainties. While Model Predictive Path Integral (MPPI) control enables agile maneuvers through samplingbased optimization, its performance often degrades under such unmodeled uncertainties, leading to brittle and unsafe behavior. To address this, we propose SA-MPPI, a novel robust MPC framework that integrates asynchronous guidance with a novel open-loop sensitivity metric. The asynchronous module leverages a slower auxiliary controller to generate an informed sampling distribution, improving convergence without introducing latency. The sensitivity metric penalizes high-variance trajectories under sampled disturbances via nested Monte Carlo rollouts, embedding robustness directly into the optimization. Extensive simulations and real-world quadrotor experiments demonstrate that SA-MPPI outperforms adaptive baselines, reducing tracking errors by up to 47% under significant wind disturbances while achieving over 2× higher computational efficiency. These results highlight SA-MPPI’s ability to deliver low-latency, safe, and predictable control in uncertain, dynamic environments.
|
| |
| 15:00-16:30, Paper ThI2I.298 | Add to My Program |
| A Two-Stage Framework for Ego-Centric Key Object Identification Via Object State Prediction |
|
| Ling, Shihong | University of Pittsburgh |
| Wan, Yue | University of Pittsburgh |
| Jia, Xiaowei | University of Pittsburgh |
| Du, Na | University of Pittsburgh |
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception, Computer Vision for Transportation
Abstract: This paper presents a novel framework designed to enhance key object identification in autonomous driving. Existing methods primarily focus on either detecting objects independently or leveraging visual relationships, but they do not explicitly consider the ego vehicle's perspective in determining object importance. To address this gap, we propose a structured approach that integrates a virtual ego-vehicle representation and a modular object state predictor, enabling a more accurate estimation of object behaviors relative to the ego-vehicle. Subsequently, our framework employs spatial-temporal reasoning to refine key object identification, prioritizing objects based on their states and relative spatial information rather than relying solely on visual relationships. Experimental results on real-world driving datasets demonstrate the effectiveness of our approach in accurately detecting critical objects in complex traffic environments.
|
| |
| 15:00-16:30, Paper ThI2I.299 | Add to My Program |
| EnhanceERASOR: Two-Stage Static 3D Point Cloud Mapping in Dynamic Scenes |
|
| Yu, Shuyang | Zhejiang University |
| Wu, Yi | Zhejiang University |
| Guan, Xiaoqing | Zhejiang University |
| Jin, Song | Zhejiang University |
| Liu, Haoxiang | Zhejiang University |
| Wang, You | Zhejiang University |
| Li, Guang | Zhejiang University |
Keywords: Mapping, Range Sensing
Abstract: A clean map of the surrounding environment is essential for autonomous driving systems to ensure reliable localization and safe path planning. However, the existence of dynamic objects introduces ghost traces into the map, significantly degrading its quality. To address this issue, we propose EnhanceERASOR, a two-stage framework for static 3D point cloud mapping, consisting of a lightweight OnlineERASOR stage for real-time static mapping and an OfflineRefinement stage for global optimization. The Online-ERASOR stage utilizes the egocentric ratio of pseudo occupancy between consecutive scans to identify dynamic points, followed by verification and post-processing strategies to suppress false positives and false negatives. The Offline-Refinement stage introduces a submap-to-map consistency check to suppress semi-dynamic and slow-moving objects, and adopts a voxel-guided strategy for dense static mapping. Extensive experiments on diverse datasets with different scenarios and sensors demonstrate the superior performance, robustness, and generalization ability of our proposed method in static map construction.
|
| |
| 15:00-16:30, Paper ThI2I.300 | Add to My Program |
| GeoFIK: A Fast and Reliable Geometric Solver for the IK of the Franka Arm Based on Screw Theory Enabling Multiple Redundancy Parameters |
|
| Lopez-Custodio, Pablo | Nottingham Trent University |
| Gong, Yuhe | University of Nottingham |
| Figueredo, Luis | University of Nottingham (UoN) |
Keywords: Kinematics, Redundant Robots, Manipulation Planning
Abstract: Modern robotics applications require an inverse kinematics (IK) solver that is fast, robust and consistent, and that provides all possible solutions. Currently, the Franka robot arm is the most widely used manipulator in robotics research. With 7 DOFs, the IK of this robot is not only complex due to its redundancy, but also due to the link offsets at the wrist and elbow. Due to this complexity, none of the Franka IK solvers available in the literature provide satisfactory results when used in real-world applications. Therefore, in this paper we introduce GeoFIK (Geometric Franka IK), an analytical IK solver that allows the use of different joint variables to resolve the redundancy. The approach uses screw theory to describe the entire geometry of the robot, computing the Jacobian matrix prior to the joint angles. All singularities are handled. As an example of how the geometric elements obtained by the IK can be exploited, a solver with the swivel angle as the free variable is provided. Several experiments are carried out to validate the speed, robustness, and reliability of GeoFIK against three state-of-the-art solvers.
|
| |
| 15:00-16:30, Paper ThI2I.301 | Add to My Program |
| OccLLaMA: A Unified Occupancy-Language-Action World Model for Enhancing Motion Planning Via Multi-Task Learning |
|
| Wei, Julong | Fudan University |
| Yuan, Shanshuai | Fudan University |
| Li, Pengfei | Institute for AI Industry Research (AIR), Tsinghua University |
| Quan, Xinyi | Fudan University |
| Tai, Lei | Hong Kong University of Science and Technology |
| Zhao, Jieru | Shanghai Jiao Tong University |
| Gan, Zhongxue | Fudan University |
| Ding, Wenchao | Fudan University |
Keywords: Computer Vision for Automation, Deep Learning for Visual Perception, Autonomous Vehicle Navigation
Abstract: Scene understanding via multi-modal large language models and scene forecasting with world models have advanced the development of autonomous driving. The former maps visual inputs to driving-specific outputs, neglecting spatial reasoning and world dynamics. The latter captures world dynamics, lacking comprehensive scene understanding. In contrast, human divers seamlessly integrate understanding, forecasting, and decision-making through multi-modal representations. To this end, we propose OccLLaMA, a unified occupancy-language-action world model to enhance motion planning via multi-task learning. It uses semantic occupancy as a unified 3D visual representation, effectively integrating spatial scene understanding and forecasting. Specifically, we first introduce a tailored scene tokenizer that auto-encodes semantic occupancy into latent tokens for invertible compression. Furthermore, we enhance LLaMA to enable joint learning across both understanding and generation tasks within a unified auto-regressive framework, incorporating multi-task pretraining and motion-planning–oriented fine-tuning. Extensive experiments demonstrate that OccLLaMA not only achieves competitive performance on scene understanding and occupancy forecasting, but also enhances motion planning by integrating multi-task inference, showcasing its effectiveness and potential as a foundation model for autonomous driving. Project page: href{https://vilonge.github.io/OccLLaMA_Page/}{OccLLaMA}
|
| |
| 15:00-16:30, Paper ThI2I.302 | Add to My Program |
| Active Dynamic Load Adaptation for Quadruped Locomotion on Complex Terrain |
|
| Xiao, Yimin | College of Informatics, Huazhong Agricultural University |
| Li, Dianzhong | College of Informatics, Huazhong Agricultural University |
| Huang, Wangjun | College of Informatics, Huazhong Agricultural University |
| Sha, Ying | College of Informatics, Huazhong Agricultural University |
| Qin, Li | College of Informatics, Huazhong Agricultural University |
Keywords: Legged Robots, Reinforcement Learning
Abstract: Quadruped robots show important potential for load carrying tasks due to their terrain adaptability, and a unique challenge of these tasks is to maintain quadrupedal stability when the load has active and dynamic characteristics. Their mass and center of mass change dynamically, rather than being integrated as a whole-body component of the quadruped. Unlike traditional load-carrying tasks, where the load is typically passive and its influence on the robot's movement is predictable and static, active dynamic loads can actively alter the robot's balance control in real-time, posing load disturbances to locomotion. These load disturbances, when combined with the fundamental attitude changes induced by complex terrain, create dual dynamic disturbances for the robot. To address these dual disturbances, we propose an active dynamic load modeling approach that captures the active and dynamic characteristics of the load, enabling the robot to adapt to the real-time changes in load movement. This approach is integrated into a Reinforcement Learning (RL) framework that leverages dynamic models: an Inverse Dynamic Model (IDM) that learns the dynamic characteristics of the active load, and a Forward Dynamic Model (FDM) that predicts the effects of complex terrain on the robot's motion, enabling synchronous adaptation to both types of dynamic disturbances. Extensive comparative simulations and physical experiments across diverse terrains, with active dynamic loads of varying movements, demonstrate the effectiveness of our method in enhancing balance control and adaptability.
|
| |
| 15:00-16:30, Paper ThI2I.303 | Add to My Program |
| Switch: Learning Agile Skills Switching for Humanoid Robots |
|
| Lau, Yuen-Fui | The Hong Kong University of Science and Technology |
| Zhao, Qihan | The Hong Kong University of Science and Technology |
| Wang, Yinhuai | HKUST |
| Yu, Runyi | Hong Kong University of Science and Technology |
| Tsui, Hok Wai | University |
| Chen, Qifeng | HKUST |
| Tan, Ping | Simon Fraser University |
Keywords: Humanoid and Bipedal Locomotion, Humanoid Robot Systems, Legged Robots
Abstract: Recent advancements in whole-body control through deep reinforcement learning have enabled humanoid robots to achieve remarkable progress in real-world challenging locomotion skills. However, existing approaches often struggle with flexible transitions between distinct skills, creating safety concerns and practical limitations. To address this challenge, we introduce a hierarchical multi-skill system, Switch, enabling seamless skill transitions at any moment. Our approach comprises three key components: (1) a Skill Graph (SG) that establishes potential cross-skill transitions based on kinematic similarity within multi-skill motion data, (2) a whole-body tracking policy trained on this skill graph through deep reinforcement learning, and (3) an online skill scheduler to drive the tracking policy for robust skill execution and smooth transitions. For skill switching or significant tracking deviations, the scheduler performs online graph search to find the optimal feasible path, which ensures efficient, stable, and real-time execution of diverse locomotion skills. Comprehensive experiments demonstrate that Switch empowers humanoid to execute agile skill transitions with high success rates while maintaining strong motion imitation performance.
|
| |
| 15:00-16:30, Paper ThI2I.304 | Add to My Program |
| GaussTwin: Unified Simulation and Correction with Gaussian Splatting for Robotic Digital Twins |
|
| Cai, Yichen | Technische Universität Darmstadt |
| Jansonnie, Paul | Technische Universität Darmstadt |
| de Farias, Cristiana | Technische Universität Darmstadt |
| Arenz, Oleg | Technische Universität Darmstadt |
| Peters, Jan | Technische Universität Darmstadt |
Keywords: Visual Tracking, RGB-D Perception, Contact Modeling
Abstract: Digital twins promise to enhance robotic manipulation by maintaining a consistent link between real-world perception and simulation. However, most existing systems struggle with the lack of a unified model, complex dynamic interactions, and the real-to-sim gap, which limits downstream applications such as model predictive control. Thus, we propose GaussTwin, a real-time digital twin that combines position-based dynamics with discrete Cosserat rod formulations for physically grounded simulation, and Gaussian splatting for efficient rendering and visual correction. By anchoring Gaussians to physical primitives and enforcing coherent SE(3) updates driven by photometric error and segmentation masks, GaussTwin achieves stable prediction–correction while preserving physical fidelity. Through experiments in both simulation and on a Franka Research 3 platform, we show that GaussTwin consistently improves tracking accuracy and robustness compared to shape-matching and rigid-only baselines, while also enabling downstream tasks such as push-based planning. These results highlight GaussTwin as a step toward unified, physically meaningful digital twins that can support closed-loop robotic interaction and learning.
|
| |
| 15:00-16:30, Paper ThI2I.305 | Add to My Program |
| Tilt-X: Enabling Compliant Aerial Manipulation through a Tiltable-Extensible Continuum Manipulator |
|
| Uthayasooriyan, Anuraj | Queensland University of Technology |
| Digumarti, Krishna Manaswi | Queensland University of Technology |
| Breward, Jack Ronald | Queensland University of Technology |
| Vanegas Alvarez, Fernando | Queensland University of Technology |
| Galvez-Serna, Julian | Research Engineer, Queensland University of Technology (QUT) |
| Gonzalez, Felipe | Queensland University of Technology |
Keywords: Mechanism Design, Aerial Systems: Mechanics and Control, Soft Robot Applications
Abstract: Aerial manipulators extend the reach and manipulation capabilities of uncrewed multirotor aerial vehicles for inspection, agriculture, sampling, and delivery. Continuum arm aerial manipulation systems offer lightweight, dexterous, and compliant interaction opportunities. Existing designs allow manipulation only below the UAV which restricts their deployability in multiple directions and through clutter. They are also sensitive to propeller downwash. Addressing these limitations, we present Tilt-X, a continuum arm aerial manipulator that integrates a tilting mechanism, a telescopic stage, and a cable-driven continuum section. We present its design and kinematic model and validate it through flight demonstrations. Tilt-X enables a volumetric workspace with up to 75 mm extension and planar orientations between 0^circ to 90^circ. Experiments comparing end effector pose with and without downwash quantitatively measure its accuracy, providing critical evidence to guide the design and control of reliable aerial manipulators. Results show stabilisation of end effector pose as the manipulator extends out of the propeller influence zone.
|
| |
| 15:00-16:30, Paper ThI2I.306 | Add to My Program |
| LeHome: A Simulation Environment for Deformable Object Manipulation in Household Scenarios |
|
| Li, Zeyi | Chinese Academy of Sciences |
| Yang, Yushi | Chinese Academy of Sciences |
| Xie, Shawn | Chinese Academy of Sciences |
| Xu, Jingkai | Institute of Automation, Chinese Academy of Sciences, Beijing |
| Chen, Tianxing | The University of Hong Kong |
| Wang, Yuran | Peking University |
| Shen, Zhenhao | Peking University |
| Shen, Yan | Peking University |
| Chen, Yue | Peking University |
| Li, Wenjun | The University of Hong Kong |
| Zheng, Yukun | The University of Hong Kong |
| Zhang, Chaorui | Lightwheel |
| Lin, Siyi | LightWheel |
| Teng, Fei | Lightwheel |
| Yang, Hongjun | Institute of Automation Chinese Academy of Sciences |
| Chen, Ming | Lightwheel |
| Xie, Steve | Lightwheel |
| Wu, Ruihai | Peking University |
Keywords: Simulation and Animation, Deep Learning in Grasping and Manipulation, Domestic Robotics
Abstract: Household environments present one of the most common, impactful yet challenging application domains for robotics. Within household scenarios, manipulating deformable objects is particularly difficult, both in simulation and real-world execution, due to varied categories and shapes, complex dynamics, and diverse material properties, as well as the lack of reliable deformable-object support in existing simulations. We introduce LeHome, a comprehensive simulation environment designed for deformable object manipulation in household scenarios. LeHome covers a wide spectrum of deformable objects, such as garments and food items, offering high-fidelity dynamics and realistic interactions that existing simulators struggle to simulate accurately. Moreover, LeHome supports multiple robotic embodiments and emphasizes low-cost robots as a core focus, enabling end-to-end evaluation of household tasks on resource-constrained hardware. By bridging the gap between realistic deformable object simulation and practical robotic platforms, LeHome provides a scalable testbed for advancing household robotics. Webpage: lehome-web.github.io/.
|
| |
| 15:00-16:30, Paper ThI2I.307 | Add to My Program |
| HiCrowd: Hierarchical Crowd Flow Alignment for Dense Human Environments |
|
| Zhu, Yufei | Örebro University |
| Yang, Shih-Min | Örebro University |
| Magnusson, Martin | Örebro University |
| Wang, Allan | Miraikan |
Keywords: Human-Aware Motion Planning
Abstract: Navigating through dense human crowds remains a significant challenge for mobile robots. A key issue is the freezing robot problem, where the robot struggles to find safe motions and becomes stuck within the crowd. To address this, we propose HiCrowd, a hierarchical framework that integrates reinforcement learning (RL) with model predictive control (MPC). HiCrowd leverages surrounding pedestrian motion as guidance, enabling the robot to align with compatible crowd flows. A high-level RL policy generates a follow point to align the robot with a suitable pedestrian group, while a low-level MPC safely tracks this guidance with short horizon planning. The method combines long-term crowd aware decision making with safe short-term execution. We evaluate HiCrowd against reactive and learning-based baselines in offline setting (replaying recorded human trajectories) and online setting (human trajectories are updated to react to the robot in simulation). Experiments on a real-world dataset and a synthetic crowd dataset show that our method outperforms in navigation efficiency and safety, while reducing freezing behaviors. We further validate through real-world deployment in a public museum and Expo 2025 Osaka, where it navigates dense pedestrian flows without retraining, demonstrating robust and socially aware behavior. Our results suggest that leveraging human motion as guidance, rather than treating humans solely as dynamic obstacles, provides a powerful principle for safe and efficient robot navigation in crowds.
|
| |
| 15:00-16:30, Paper ThI2I.308 | Add to My Program |
| Automatic Physically-Based Sim2Real for Tactile Images through Differentiable Path-Tracing Rendering |
|
| Duret, Guillaume | Ecole Centrale De Lyon |
| Samsonenko, Anna | Ecole Centrale De Lyon |
| Zara, Florence | Université Lyon 1 |
| Peters, Jan | Technische Universität Darmstadt |
| Chen, Liming | Ecole Centrale De Lyon |
Keywords: Force and Tactile Sensing, Simulation and Animation, Performance Evaluation and Benchmarking
Abstract: High-fidelity simulation of vision-based tactile sensors is essential for developing data-driven robotic manipulation algorithms. However, a significant sim-to-real gap persists due to the difficulty in modeling complex optical effects, such as refraction through protective glass layers, and in accurately estimating physical parameters like sensor pose and lighting. To bridge this gap, we introduce a novel, fully differentiable pipeline for visual tactile simulation. Leveraging a differentiable path tracer, our method optimizes critical parameters—including camera pose, lighting conditions, and object texture—directly from just three real images. This approach achieves highly realistic simulations with physically accurate light transport and glass refraction. We validate our method through a comprehensive benchmark against real-world data, demonstrating state-of-the-art sim-to-real accuracy. We also enable novel applications, such as mesh reconstruction from a single tactile image via inverse rendering. To overcome the computational cost of path tracing, we further use a image-to-image translation model. This model uses high-fidelity simulated data alongside Normalized Object Coordinate Space (NOCS) maps as input, preserving crucial deformation information while enabling rapid inference. The code is available on https://tacdiffrend.github.io.
|
| |
| 15:00-16:30, Paper ThI2I.309 | Add to My Program |
| T-NACD: A Tactile-Friendly Novel Anomaly Class Discovery Framework for Black Rubber Products |
|
| Xiao, Long | Institute of Automation,Chinese Academy of Sciences |
| Lyu, Kailin | Institute of Automation, Chinese Academy of Sciences |
| Zeng, Jianing | Institute of Automation, Chinese Academy of Sciences, University of Chinese Academy of Sciences |
| Liu, Xuexin | Institute of automation, Chinese academy of science, Chinese Academy of Sciences |
| Zou, Zhuojun | Institute of Automation, Chinese Academy of Sciences, Beijing, China |
| Shu, Lin | Institute of Automation, Chinese Academy of Sciences |
| Hao, Jie | Institute of Automation, Chinese Academy of Sciences |
|
|
| |
| 15:00-16:30, Paper ThI2I.310 | Add to My Program |
| FW-NKF: Frequency-Weighted Neural Kalman Filters |
|
| Dogan, Adnan Harun | ETH ZUrich |
| Demirel, Berken Utku | ETH Zurich |
| Holz, Christian | ETH Zürich |
Keywords: Sensor Fusion, Model Learning for Control, AI-Based Methods
Abstract: Robust state estimation is central to robotic autonomy, yet classical Kalman filters struggle with frequency-dependent disturbances and model mismatch such as sensor vibrations, electromagnetic interference, and periodic noise. Although Deep Kalman Filter (DKF) variants extend the Extended Kalman Filtering (EKF) framework by learning latent transitions, they lack explicit mechanisms to suppress band-limited noise components that typically corrupt sensor measurements in real-world scenarios. We introduce the Frequency-Weighted Neural Kalman Filter (FW-NKF), a unified hybrid approach that embeds a causal spectral-shaping operator into the Kalman measurement residual and jointly learns observation, and transition networks. By adapting both the filter spectrum and the latent state representation, FW-NKF attenuates the noise-dominated frequency bands while capturing complex residual structures. We conduct extensive experiments on four heterogeneous benchmarks, including chaotic systems such as multi-dimensional Lorenz systems and full-body inertial pose estimation, and find a reduction in localization error of up to 10% as well as marked improvements in orientation accuracy. Our ablation studies confirm that frequency weighting and deep latent-state modeling contribute to overall performance.
|
| |
| 15:00-16:30, Paper ThI2I.311 | Add to My Program |
| Diffusion Stabilizer Policy for Automated Surgical Robot Manipulations |
|
| Ho, Chon-Lam | Shanghai Jiao Tong University |
| Hu, Jianshu | Shanghai Jiao Tong University |
| Song, Lei | The Chinese University of Hong Kong |
| Wang, Hesheng | Shanghai Jiao Tong University |
| Dou, Qi | The Chinese University of Hong Kong |
| Ban, Yutong | Shanghai Jiao Tong University |
Keywords: Surgical Robotics: Planning, Surgical Robotics: Laparoscopy, Learning from Demonstration
Abstract: Intelligent surgical robots have the potential to revolutionize clinical practice by enabling more precise and automated surgical procedures. However, the automation of such robot for surgical tasks remains under-explored compared to recent advancements in solving household manipulation tasks. These successes have been largely driven by (1) advanced models, such as transformers and diffusion models, and (2) large-scale data utilization. Aiming to extend these successes to the domain of surgical robotics, we propose a diffusion-based policy learning framework, called Diffusion Stabilizer Policy (ours), which enables training with imperfect, perturbed or even failed trajectories. Our approach consists of two stages: first, we train the diffusion stabilizer policy using only clean data. Then, the policy is continuously updated using a mixture of clean and perturbed data, with filtering based on the prediction error on actions. Comprehensive experiments conducted in both simulation and real-world demonstrate the superior performance of our method under different types of perturbations. Code will be released upon acceptance.
|
| |
| 15:00-16:30, Paper ThI2I.312 | Add to My Program |
| An Adaptive Inspection Planning Approach towards Routine Monitoring in Uncertain Environments |
|
| Kottayam Viswanathan, Vignesh | Lulea University of Technology |
| Bai, Yifan | Luleå University of Technology |
| Fredriksson, Scott | Luleå University of Technology |
| Satpute, Gajanan Sumeet | Luleå University of Technology |
| Kanellakis, Christoforos | LTU |
| Nikolakopoulos, George | Luleå University of Technology |
Keywords: Autonomous Agents, Mining Robotics, Field Robots
Abstract: In this work, we present a hierarchical framework designed to support robotic inspection under environment uncertainty. By leveraging a known environment model, existing methods plan and safely track inspection routes to visit points of interest. However, discrepancies between the model and actual site conditions, caused by either natural or human activities, can alter the surface morphology or introduce path obstructions. To address this challenge, the proposed framework divides the inspection task into: (a) generating the initial global view-plan for region of interests based on a historical map and (b) local view replanning to adapt to the current morphology of the inspection scene. The proposed hierarchy preserves global coverage objectives while enabling reactive adaptation to the local surface morphology. This enables the local autonomy to remain robust against environment uncertainty and complete the inspection tasks. We validate the approach through deployments in real-world subterranean mines using quadrupedal robot. A supplementary media highlighting the proposed method can be found here https://youtu.be/6TxK8S_83Lw.
|
| |
| 15:00-16:30, Paper ThI2I.313 | Add to My Program |
| Semantic and Terrain-Aware Trajectory Optimization for Uniform Coverage in Obstacle-Laden Environments |
|
| Fan, Zexuan | Midea Group |
| Yang, Hengye | Midea Group |
| Zhou, Sunchun | Midea Group |
| Cai, Junyi | Midea Group Co., Ltd |
| Sun, Tao | Midea Group |
| Liu, Chang | Peking University |
Keywords: Motion and Path Planning, Collision Avoidance, Robotics and Automation in Agriculture and Forestry
Abstract: Achieving efficient and uniform coverage in obstacle-laden unknown environments is essential for au- tonomous robots in cleaning, inspection and agricultural op- erations. Unlike most existing approaches that prioritize path length and time optimality, we propose the SHIFT planner framework, which integrates semantic mapping, adaptive cov- erage planning, and real-time obstacle avoidance to ensure comprehensive coverage across diverse terrains and seman- tic features. We first develop an innovative Radiant-Field- Informed Coverage Planning (RFICP) algorithm, which gen- erates trajectories that adapt to terrain variations. A Gaussian diffusion field is employed to adaptively adjust the robot’s speed, ensuring efficient coverage under varying environmental conditions influenced by semantic attributes. Next, we present a novel incremental KD-tree sliding window optimization (IKD- SWOpt) method to effectively handle unforeseen obstacles. IKD-SWOpt leverages an enhanced A* algorithm guided by the IKD-tree distance field to generate initial local avoidance tra- jectories. Subsequently, it optimizes trajectory segments within and outside waypoint safety zones by evaluating and refining non-compliant segments using an adaptive sliding window. This method not only reduces computational overhead but also guarantees high-quality real-time obstacle avoidance. Extensive experiments were conducted using drones in simulated envi- ronments and robotic vacuum cleaners in real-world settings.
|
| |
| 15:00-16:30, Paper ThI2I.314 | Add to My Program |
| ConTact: Contrastive Tactile Alignment for Sim-To-Real Robotic Manipulation |
|
| Lai, Yanlin | Tencent, Tsinghua University |
| Dong, Yinzhao | Tencent |
| Yuan, Chun | Tsinghua University |
| Zhou, Cheng | Tencent |
Keywords: Reinforcement Learning, Force and Tactile Sensing, Machine Learning for Robot Control
Abstract: Deep reinforcement learning (DRL) has achieved remarkable success in robot control. However, DRL with tactile feedback still faces challenges in contact-rich tasks involving visual occlusion or high-speed dynamics. The challenges stem from two primary sources. First, the complexity and diversity of real-world tactile sensors make them difficult to simulate and transfer to reality. Second, existing high-fidelity simulators are often too computationally intensive for large-scale DRL, forcing a trade-off between accuracy and speed. To address this, we design a high-speed tactile simulation model for tactile arrays, enabling efficient, large-scale DRL training on GPUs. We then propose the Contrastive Tactile (ConTact) framework, which leverages contrastive learning to align tactile features for sim-to-real transfer. ConTact employs a dedicated spatiotemporal encoder that explicitly models temporal changes to capture the dynamic features of contact events. We then validate it on two kinds of manipulation tasks, Single and Composite Object Tracking (SOT/COT), which rely solely on tactile information and proprioception. Moreover, policies trained with ConTact from simulation are directly deployed in the real world without finetuning, achieving zero-shot transfer.
|
| |
| 15:00-16:30, Paper ThI2I.315 | Add to My Program |
| TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics |
|
| Han, Yi | Beihang University |
| Zhou, Enshen | Beihang University |
| Rong, Shanyu | Peking University |
| An, Jingkun | Beihang University |
| Wang, Pengwei | Beijing Academy of Artificial Intelligence |
| Wang, Zhongyuan | Beijing Academy of Artificial Intelligence |
| Chi, Cheng | Beijing Academy of Artificial Intelligence |
| Sheng, Lu | Beihang University |
| Zhang, Shanghang | Peking University |
Keywords: Visual Learning, Deep Learning for Visual Perception
Abstract: Vision-Language Models (VLMs) have shown remarkable capabilities in spatial reasoning, yet they remain fundamentally limited to qualitative assessments and lack the computational precision required for real-world robotics. Current approaches fail to leverage metric information from depth sensors and camera calibration, instead reducing geometric problems to pattern recognition tasks that cannot deliver the centimeter-level accuracy essential for robotic manipulation. We present TIGeR (Tool-Integrated Geometric Reasoning), a novel framework that transforms VLMs from perceptual estimators to geometric computers by enabling them to generate and execute precise geometric computations through external tools. Rather than attempting to internalize complex geometric operations within neural networks, TIGeR empowers models to recognize geometric reasoning requirements, synthesize appropriate computational code, and invoke specialized libraries for exact calculations. To support this paradigm, we introduce TIGeR, a comprehensive tool-invocation–oriented dataset covering point transformations, pose estimation, trajectory generation, and spatial compatibility verification, complete with tool invocation sequences and intermediate computations. Through a two-stage training pipeline combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), TIGeR achieves state-of-the-art performance on geometric reasoning benchmarks while demonstrating centimeter-level precision in real-world robotic manipulation tasks.
|
| |
| 15:00-16:30, Paper ThI2I.316 | Add to My Program |
| MASt3R-Nav: WayPixel Navigation in Relative 3D Maps |
|
| Garg, Vansh | International Institute of Information Technology, Hyderabad |
| Jayanti, Rohit | Robotics Research Center, IIIT Hyderabad |
| Pandya, Krish | International Institute of Information Technology - Hyderabad |
| Chittawar, Sarthak | International Institute of Information Technology, Hyderabad |
| Tourani, Siddharth | IIIT Hyderabad |
| Khan, Muhammad Haris | Mohamed Bin Zayed University of Artificial Intelligence |
| Garg, Sourav | International Institute of Information Technology Hyderabad (IIITH) |
| Krishna, Madhava | IIIT Hyderabad |
|
|
| |
| 15:00-16:30, Paper ThI2I.317 | Add to My Program |
| Learning Motion Skills with Adaptive Assistive Curriculum Force in Humanoid Robots |
|
| Cao, Zhanxiang | Shanghai Jiao Tong University |
| Zhang, Yang | Shanghai Jiao Tong University |
| Nie, Buqing | Shanghai Jiao Tong University |
| Lin, Huangxuan | Shanghai Jiao Tong University |
| Li, Haoyang | Shanghai Jiao Tong University |
| Chen, Yizhi | Tongji University |
| Yang, Xiaokang | Shanghai Jiao Tong University |
| Gao, Yue | Shanghai JiaoTong University |
Keywords: Humanoid and Bipedal Locomotion, Whole-Body Motion Planning and Control, Reinforcement Learning
Abstract: Learning policies for complex humanoid tasks remains both challenging and compelling. Inspired by how infants and athletes rely on external support—such as parental walkers or coach-applied guidance—to acquire skills like walking, dancing, and performing acrobatic flips, we propose A2CF: Adaptive Assistive Curriculum Force for humanoid motion learning. A2CF trains a dual-agent system, in which a dedicated assistive force agent applies state-dependent forces to guide the robot through difficult initial motions and gradually reduces assistance as the robot's proficiency improves. Across three benchmarks—bipedal walking, choreographed dancing, and backflips—A2CF achieves convergence 30% faster than baseline methods, lowers failure rates by over 40%, and ultimately produces robust, support-free policies. Real-world experiments further demonstrate that adaptively applied assistive forces significantly accelerate the acquisition of complex skills in high-dimensional robotic control.
|
| |
| 15:00-16:30, Paper ThI2I.318 | Add to My Program |
| Open-Vocabulary Spatio-Temporal Scene Graph for Robot Perception and Teleoperation Planning |
|
| Wang, Yi | South China University of Technology |
| Xue, Zeyu | South China University of Technology(SCUT) |
| Liu, Mujie | South China University of Technology |
| Zhang, Tongqin | South China University of Technology |
| Hu, Yan | Chinese Academy of Science |
| Zhao, Zhou | Central China Normal University |
| Yang, Chenguang | Hong Kong Polytechnic University |
| Lu, Zhenyu | South China University of Technology |
Keywords: Semantic Scene Understanding, Telerobotics and Teleoperation, RGB-D Perception
Abstract: Teleoperation via natural-language reduces operator workload and enhances safety in high-risk or remote settings. However, in dynamic remote scenes, transmission latency during bidirectional communication creates gaps between remote perceived states and operator intent, leading to command misunderstanding and incorrect execution. To mitigate this, we introduce the Spatio-Temporal Open-Vocabulary Scene Graph (ST-OVSG), a representation that enriches open-vocabulary perception with temporal dynamics and lightweight latency annotations. ST-OVSG leverages LVLMs to construct open-vocabulary 3D object representations, and extends them into the temporal domain via Hungarian assignment with our temporal matching cost, yielding a unified spatio-temporal scene graph. A latency tag is embedded to enable LVLM planners to retrospectively query past scene states, thereby resolving local–remote state mismatches caused by transmission delays. To further reduce redundancy and highlight task-relevant cues, we propose a task-oriented subgraph filtering strategy that produces compact inputs for the planner. ST-OVSG generalizes to novel categories and enhances planning robustness against transmission latency without requiring fine-tuning. Experiments show that our method achieves 74% node accuracy on Replica benchmark, outperforming ConceptGraph. Notably, in latency-robustness experiment, the LVLM planner assisted by ST-OVSG achieved a planning success rate of 70.5%. We refer to the project for the code and results.
|
| |
| 15:00-16:30, Paper ThI2I.319 | Add to My Program |
| Task and Skill Planning: Hierarchical Robot Planning with Black-Box Skills |
|
| Hedegaard, Benned | Brown University |
| Wei, Yichen | Brown University |
| Yang, Ziyi | Brown University |
| Jaafar, Ahmed | Brown University |
| Tellex, Stefanie | Brown |
| Konidaris, George | Brown University |
| Shah, Naman | Ai2 |
Keywords: Task and Motion Planning, Mobile Manipulation, Manipulation Planning
Abstract: Task and motion planning (TAMP) is a well-established approach for solving long-horizon robot planning problems. Although TAMP methods have historically assumed that each task-level robot action, or skill, can be reduced to kinematic motion planning, recent work has explored integrating closed-loop controllers and learned skills into TAMP-style systems. Our approach integrates pre-existing, heterogeneous robot skills--including learned, force-controlled, and black-box policies--into a hierarchical planner while preserving the object-centric failure reasoning of typical TAMP solvers. We leverage Composable Interaction Primitives (CIPs) to synthesize head and tail motion plans bridging consecutive skills, facilitating both planning-time refinement and execution-time adjustment. We validate our Task and Skill Planning (TASP) approach through real-world experiments on a bimanual manipulator and a mobile manipulator, demonstrating that CIPs enable diverse robots to combine heterogeneous skills to solve complex, long-horizon tasks, including multi-room mobile manipulation problems with non-monotonic task structure.
|
| |
| 15:00-16:30, Paper ThI2I.320 | Add to My Program |
| PERAL: Perception-Aware Motion Control for Passive LiDAR Excitation in Spherical Robots |
|
| Yuan, Shenghai | Nanyang Technological University |
| Yee, Jason Wai Hao | Nanyang Technological University |
| Guo, Weixiang | Nanyang Technological University |
| Liu, Zhongyuan | Nanyang Technological University |
| Nguyen, Thien-Minh | The University of Queensland |
| Xie, Lihua | NanyangTechnological University |
Keywords: Search and Rescue Robots, Education Robotics, Energy and Environment-Aware Automation
Abstract: Autonomous mobile robots increasingly rely on LiDAR–IMU odometry for navigation and mapping, yet horizontally mounted LiDARs (e.g., MID360) capture limited near-ground returns, reducing terrain awareness and degrading performance in feature-scarce environments. Prior solutions, such as static tilt, active rotation, or higher-density sensors, either compromise horizontal perception or introduce extra actuation, cost, weight, and power. We introduce PERAL, a perception-aware motion control framework for spherical robots that provides passive LiDAR excitation without dedicated hardware. By modeling the coupling between the internal differential-drive actuation and sensor attitude, PERAL superimposes bounded, non-periodic oscillations onto nominal goal- or trajectory-tracking commands to increase vertical scan diversity while preserving navigation accuracy. Implemented on a compact spherical robot, PERAL is validated in laboratory, corridor, and tactical environments. Experiments show up to 96% map completeness and a 27% reduction in trajectory tracking error (relative to fixed-horizontal baselines), while improving the observability of near-ground targets in the reconstructed map, at lower weight, power, and cost than static tilt and active rotation. Design and code are available at https://github. com/snakehaihai/PERAL_robot_design.
|
| |
| 15:00-16:30, Paper ThI2I.321 | Add to My Program |
| IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance |
|
| Park, Jongwoo | Stony Brook University |
| Ranasinghe, Kanchana | Stony Brook University |
| Jang, Jinhyeok | ETRI |
| Mata, Cristina | Stony Brook University |
| Jang, Yoo Sung | Stony Brook University |
| Ryoo, Michael S. | Stony Brook University; Salesforce AI |
Keywords: Deep Learning for Visual Perception, Visual Learning, Recognition
Abstract: Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the model’s built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geometric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA improves average success by +4.2% over the baseline LLaRA in a low-data regime. On 3D LIBERO, it yields consistent gains over the OpenVLA and FLOWER baselines, including improvements when baseline accuracy is near saturation (96.3%→97.1%). Code and visualizations are available at: jongwoopark7978.github.io/IVRA
|
| |
| 15:00-16:30, Paper ThI2I.322 | Add to My Program |
| Contrastive Auditory Knowledge Transfer for Tool-Mediated Robot Interaction with Granular Objects |
|
| Liu, Si | TUFTS |
| Huang, Jindan | Tufts University |
| Huan, Zhengyan | Tufts University |
| Hughes, Michael | Tufts University |
| Sinapov, Jivko | Tufts University |
Keywords: Transfer Learning, Robot Audition, Representation Learning
Abstract: Tool-mediated interactions enable robotics to manipulate and explore granular objects, producing informative auditory signals. A central challenge is transferring this perceptual knowledge across different tools and behaviors without costly data collection for each new context. We address this problem in the domain of audio-based recognition of granular and liquid-like objects. In this work, we leverage audio signals from tool-mediated interactions and learn context-agnostic representations for object recognition. We propose two contrastive learning approaches: a shared-object transfer method that performs supervised contrastive learning using audio data, and a zero-shot transfer method that integrates both audio and natural language descriptions of interaction contexts. Experiments on real-world data show that both methods achieve strong object recognition performance in unseen contexts, sometimes matching or exceeding a supervised baseline despite limited target context data. Furthermore, the learned latent spaces exhibit clearly separable clusters by object identity, and the zero-shot method successfully recognizes novel objects, offering a practical solution for robot perception in data-scarce scenarios. The code for this paper is available at: https://github.com/siliu6487/AuditoryKnowledgeTransfer.
|
| |
| 15:00-16:30, Paper ThI2I.323 | Add to My Program |
| Bio-Inspired Liquid Crystal Elastomer Suction Actuator for Intelligent Robotic Grasping |
|
| Gao, Shen | Shanghai University |
| Luo, Yongzheng | Shanghai University |
| Tang, Mingjun | Shanghai University |
| Wang, Yue | Shanghai University |
| Yue, Tao | Shanghai University |
Keywords: Soft Sensors and Actuators, Biomimetics, Grippers and Other End-Effectors
Abstract: Grasping operations constitute a fundamental mechanism for robotic interaction with the environment and task execution, playing a critical role in logistics, unmanned systems, and complex terrain exploration. Conventional rigid grasping devices are often bulky and exhibit limited adaptability and controllability in unstructured environments. Suction-based grippers offer improved environmental compliance but typically require extensive tubing and vacuum pumps, constraining their integration into lightweight and soft robotic platforms. Inspired by octopus suction cups, recent bioinspired designs have leveraged geometrical optimization and flexible materials to enhance adhesion, yet most still rely on external actuation or complex vacuum systems, failing to replicate the rapid, reversible adhesion achieved through muscular contraction. To address this challenge, we present a bioinspired suction actuator based on liquid crystal elastomer (LCE), exploiting their reversible anisotropic–isotropic phase transition under thermal stimuli to dynamically modulate the cavity volume and generate controllable negative pressure. The proposed design closely emulates octopus muscle mechanics while significantly simplifying structural complexity, achieving a combination of light weight, compliance, and programmability. Experiments demonstrate stable adhesion of 56 kPa on glass over 300 cycles, with rapid and reliable attachment/detachment under varying conditions, highlighting potential applications in climbing ro-bots, aerial grasping, and underwater exploration.
|
| |
| 15:00-16:30, Paper ThI2I.324 | Add to My Program |
| Distributed Multi-Robot Active-Sensing of a Diffusive Source |
|
| Pagano, Francesca | University of Naples Federico II |
| De Carli, Nicola | KTH |
| Restrepo, Esteban | CNRS |
| Marino, Antonio | University of Cambridge |
| Robuffo Giordano, Paolo | Irisa Cnrs Umr6074 |
Keywords: Cooperating Robots, Distributed Robot Systems, Multi-Robot Systems
Abstract: This paper considers the problem of coordinating a group of mobile robots for distributedly estimating the parameters of a diffusion model that generates a time-varying spatial field. We assume that each robot can measure the local concentration of a substance continuously released in the environment and base the proposed distributed estimation strategy on an Extended Information Consensus Filter (E-ICF) with a forgetting factor. We then develop a decentralized online motion strategy aimed at minimizing a Gramian-based information metric that improves the E-ICF convergence. Additional constraints, among which collision avoidance, are integrated as Control Barrier Functions (CBFs) in a Quadratic Program (QP). Finally, we present statistical comparisons against three baselines which show the improved performance of the proposed method in a range of simulated scenarios, and we also report the results of experiments carried out with quadcopters to demonstrate the actual implementability of the approach and its effectiveness in generating online, collision-free, and informative motions.
|
| |
| 15:00-16:30, Paper ThI2I.325 | Add to My Program |
| Passive Multi-Task Compliance Control with Strict Priority through Energy Tanks |
|
| Tveter, Erling | Norwegian University of Science and Technology (NTNU) |
| Sæbø, Bjørn Kåre | Norwegian University of Science and Technology (NTNU) |
| Ott, Christian | TU Wien |
| Pettersen, Kristin Y. | Norwegian University of Science and Technology |
| Gravdahl, Jan Tommy | Norwegian University of Science and Technology |
Keywords: Compliance and Impedance Control, Redundant Robots
Abstract: A robot with kinematical redundancy with respect to a main task may perform additional tasks simultaneously with the main one. Often, it is desirable to prioritize the performance of some tasks over that of others. To create a strict priority between the different tasks, meaning the performance of higher-prioritized tasks is unaffected by lower-prioritized tasks, null-space projections are often used. Null-space projections may, however, cause the closed-loop system to lose the desirable passivity property, which is necessary to ensure stable interactions with passive environments. In previous works, an energy tank has therefore been introduced to compensate for the potential activity stemming from the null-space projections. However, if the energy tank becomes empty when using these previous methods, the performance of the lower-prioritized tasks suffers more than when using a classical, non-passive hierarchical control scheme. Thus, a new approach to handling this case is proposed in this work. In the event of the energy tank becoming empty and unable to compensate for any null-space projection-induced activity, the hierarchy is ceded to preserve the passivity of the system, leading to better performance of the lower-prioritized tasks compared to previous passivation schemes. Output strict passivity of the closed-loop system is proven irrespective of the amount of energy available from the energy tank, and the performance of the proposed method is validated and compared to that of a classical hierarchical impedance controller and that of an earlier passivation method through simulation and experiments of redundant robotic manipulators.
|
| |
| 15:00-16:30, Paper ThI2I.326 | Add to My Program |
| Open-Vocabulary Online Semantic Mapping for SLAM |
|
| Berriel Martins, Tomas | University of Zaragoza |
| Oswald, Martin R. | University of Amsterdam |
| Civera, Javier | Universidad De Zaragoza |
Keywords: Semantic Scene Understanding, Mapping, SLAM
Abstract: This paper presents an Open-Vocabulary Online 3D semantic mapping pipeline, that we denote by its acronym OVO. Given a sequence of posed RGB-D frames, we detect and track 3D segments, which we describe using CLIP vectors. These are computed from the viewpoints where they are observed by a novel CLIP merging method. Notably, our OVO has a significantly lower computational and memory footprint than offline baselines, while also showing better segmentation metrics than offline and online ones. Along with superior segmentation performance, we also show experimental results of our mapping contributions integrated with two different full SLAM backbones (Gaussian-SLAM and ORB-SLAM2), being the first ones using a neural network to merge CLIP descriptors and demonstrating end-to-end open-vocabulary online 3D mapping with loop closure.
|
| |
| 15:00-16:30, Paper ThI2I.327 | Add to My Program |
| DualGuard MPPI: Safe and Performant Optimal Control by Combining Sampling-Based MPC and Hamilton-Jacobi Reachability |
|
| Borquez, Javier | Universidad De Santiago De Chile |
| Raus, Luke | Olin College of Engineering |
| Ciftci, Yusuf Umut | University of Southern California |
| Bansal, Somil | Stanford University |
Keywords: Robot Safety, Collision Avoidance, Motion and Path Planning
Abstract: Designing controllers that are both safe and performant is inherently challenging. This co-optimization can be formulated as a constrained optimal control problem, where the cost function represents the performance criterion and safety is specified as a constraint. While sampling-based methods, such as Model Predictive Path Integral (MPPI) control, have shown great promise in tackling complex optimal control problems, they often struggle to enforce safety constraints. To address this limitation, we propose DualGuard-MPPI, a novel framework for solving safety-constrained optimal control problems. Our approach integrates Hamilton-Jacobi reachability analysis within the MPPI sampling process to ensure that all generated samples are provably safe for the system. On the one hand, this integration allows DualGuard-MPPI to enforce strict safety constraints; at the same time, it facilitates a more effective exploration of the environment with the same number of samples, reducing the effective sampling variance and leading to better performance optimization. Through several simulations and hardware experiments, we demonstrate that the proposed approach achieves much higher performance compared to existing MPPI methods, without compromising safety.
|
| |
| 15:00-16:30, Paper ThI2I.328 | Add to My Program |
| SwarmGPT: Combining Large Language Models with Safe Motion Planning for Drone Swarm Choreography |
|
| Schuck, Martin | Technical University of Munich |
| Dahanaggamaarachchi, Dinushka Orrin | University of Toronto |
| Sprenger, Ben | ETH Zurich |
| Vyas, Vedant | University of Alberta |
| Zhou, Siqi | Technical University of Munich |
| Schoellig, Angela P. | Technical University of Munich |
Keywords: Art and Entertainment Robotics, Swarm Robotics, AI-Enabled Robotics
Abstract: Drone swarm performances---synchronized, expressive aerial displays set to music---have emerged as a captivating application of modern robotics. Yet designing smooth, safe choreographies remains a complex task requiring expert knowledge. We present SwarmGPT, a language-based choreographer that leverages the reasoning power of large language models (LLMs) to streamline drone performance design. The LLM is augmented by a safety filter that ensures deployability by making minimal corrections when safety or feasibility constraints are violated. By decoupling high-level choreographic design from low-level motion planning, our system enables non-experts to iteratively refine choreographies using natural language without worrying about collisions or actuator limits. We validate our approach through simulations with swarms up to 200 drones and real-world experiments with up to 20 drones performing choreographies to diverse types of songs, demonstrating scalable, synchronized, and safe performances. Beyond entertainment, this work offers a blueprint for integrating foundation models into safety-critical swarm robotics applications.
|
| |
| 15:00-16:30, Paper ThI2I.329 | Add to My Program |
| Efficient 3D Reconstruction in Noisy Agricultural Environments: A Bayesian Optimization Perspective for View Planning |
|
| Bacharis, Athanasios | University of Minnesota |
| Polyzos, Konstantinos | University of California San Diego |
| Nelson, Henry J. | University of Minnesota |
| Giannakis, Georgios B. | University of Minnesota |
| Papanikolopoulos, Nikos | University of Minnesota |
Keywords: Agricultural Automation, Robotics and Automation in Agriculture and Forestry, Computer Vision for Automation
Abstract: 3D reconstruction is a fundamental task in robotics that has gained attention due to its major impact in a wide variety of practical settings, including agriculture, underwater, and urban environments. While this task can be carried out using a large number of arbitrarily taken 2D images, their processing may become laborious, time-consuming, and in some instances may not provide the necessary information about the object of interest. An efficient alternative is the so-called view planning (VP), which aims to optimally place a certain number of cameras in positions that maximize the visual information. Nonetheless, in most real-world settings, existing environmental noise can significantly affect the performance of 3D reconstruction. To that end, this work advocates a novel geometric-based reconstruction quality function for VP, that accounts for the existing noise of the environment, without requiring its closed-form expression. With no analytic expression of the objective function, this work puts forth an adaptive Bayesian optimization algorithm for accurate 3D reconstruction in the presence of noise. Numerical tests on simulated and real noisy agricultural environments showcase the merits of the proposed VP approach for efficient 3D reconstruction with even a small number of available cameras.
|
| |
| 15:00-16:30, Paper ThI2I.330 | Add to My Program |
| Robust Real-Time Sampling-Based Motion Planner for Autonomous Vehicles in Narrow Environments (I) |
|
| Kim, Minsoo | Seoul National University |
| Esquerre-Pourtère, Arthur | Seoul National University |
| Park, Jaeheung | Seoul National University |
Keywords: Integrated Planning and Learning, Nonholonomic Motion Planning, Motion and Path Planning
Abstract: Real-time sampling-based planners increasingly use learned sampling distributions for faster planning in autonomous vehicles. These planners employ a neural network to predict the optimal path and bias some samples toward the path. However, inherent prediction inaccuracies of the network often lead to suboptimal paths, especially in narrow spaces. Learned samples should be used carefully based on accuracy, as inaccurate samples can degrade planning performance. To address this problem, this paper proposes Learned Adaptive Anytime TargetTree-RRT* (LA3T*) algorithm. The proposed planner introduces the adaptive biasing ratio. The approach learns to assess the reliability of the learned distribution using the network's confidence. This confidence approximates a proper ratio of learned samples used, thereby adaptively maximizing planning performance while considering a level of prediction accuracy. Furthermore, the LA3T* algorithm incorporates the target tree algorithm. The goal pose is replaced with a set (target tree) of pre-defined optimal path segments, reducing computational efforts in narrow regions. Experiments in various driving tasks explore the benefits of each component through ablation studies. The proposed algorithm significantly increases the success rate and reduces the path length in simulated and real-world scenarios compared to other sampling-based methods.
|
| |
| 15:00-16:30, Paper ThI2I.331 | Add to My Program |
| SVN-ICP: Uncertainty Estimation of ICP-Based LiDAR Odometry Using Stein Variational Newton |
|
| Ma, Shiping | Technische Universität Berlin |
| Zhang, Haoming | Technical University of Munich |
| Toussaint, Marc | TU Berlin |
Keywords: SLAM, Probabilistic Inference, Sensor Fusion
Abstract: This letter introduces SVN-ICP, a novel Iterative Closest Point (ICP) algorithm with uncertainty estimation that leverages Stein Variational Newton (SVN) on manifold. Designed specifically for fusing LiDAR odometry in multisensor systems, the proposed method ensures accurate pose estimation and consistent noise parameter inference, even in LiDAR-degraded environments. By approximating the posterior distribution using particles within the Stein Variational Inference framework, SVN-ICP eliminates the need for explicit noise modeling or manual parameter tuning. To evaluate its effectiveness, we integrate SVN-ICP into a simple error-state Kalman filter alongside an IMU and test it across multiple datasets spanning diverse environments and vehicle platforms. Extensive experimental results demonstrate that our approach outperforms best-in-class methods on challenging scenarios while providing reliable uncertainty estimates. We release our code at https://anonymous.4open.science/r/SVN-ICP-5B77.
|
| |
| 15:00-16:30, Paper ThI2I.332 | Add to My Program |
| Learning-Based Joint Control with Hierarchical Reinforcement Learning and On-Device Execution |
|
| Yagi, Satoshi | Kyoto University |
| Morimoto, Jun | Kyoto University |
Keywords: Machine Learning for Robot Control, Embedded Systems for Robotic and Automation, Neural and Fuzzy Control
Abstract: In typical robot learning, deep reinforcement learning policies are employed in the upper control layer to generate target joint angles for robot motion, while conventional controllers are used in the fast lower control layer to control each joint motor. This paper presents a fully neural network-based hierarchical reinforcement learning approach for real-time robot joint control. The proposed method divides joint control into two layers: a high-frequency current control policy and a low-frequency position control policy. The current control policy drives the motor to follow the target current while learning the dynamic characteristics of the joint. The position control policy generates the target current to achieve a desired joint angle, allowing learning and inference at a slower frequency. By decoupling motor dynamics from position control, our method improves learning performance and enables policy generalization across joints. Experimental results on a three-joint robotic arm demonstrate the effectiveness of the proposed approach, including posture control using a shared position control policy across joints.
|
| |
| 15:00-16:30, Paper ThI2I.333 | Add to My Program |
| SPIRA: Small Gas Pipeline Inspection Robot with Spiral Leg-Wheel Mechanism and Single Bending Joint |
|
| Kamezaki, Mitsuhiro | The University of Tokyo |
| Yamaguchi, Kaoru | Waseda University |
| Zhao, Wen | Waseda University |
| Yoshida, Kento | Waseda University |
| Koike, Toshitaka | Waseda University |
| Miyake, Shota | Waseda University |
| Sugano, Shigeki | Waseda University |
Keywords: Field Robots, Search and Rescue Robots, Mechanism Design
Abstract: Gas pipelines damaged by aging or earthquakes need a robotic system that can quickly inspect 50-mm-diameter service lines, consisting of horizontal and vertical pipes connected by elbow joints, tees, or sockets, from the inside. However, conventional pipe inspection robots do not target 50-mm pipes and various pipe types or lack sufficient speed and a reverse function. Thus, this study develops an inpipe inspection robot, SPIRA, capable of traveling through 50-mm pipelines while meeting the above requirements. SPIRA has three wheels inclined at 30 degrees to the horizontal, arranged around a cylinder. The cylinder is rotated by a motor around the robot’s axis, and the wheels move in a spiral motion while pushing the pipe wall, enabling the robot to move stably and quickly. To overcome socket steps of up to 3 mm and diameter changes in elbow joints, the cylinder and each wheel are connected by a leg with a spring pantograph mechanism. SPIRA has two traveling units linked by a bending joint with a servomotor. When the front and rear parts rotate clockwise (counterclockwise), SPIRA moves forward (backward). When both units rotate in the opposite direction, only the central part of SPIRA rotates on the spot, changing the bending direction so that SPIRA can select any travel direction in a tee. We evaluated the travel performance of SPIRA in a pipeline of horizontal, vertical, elbow joints, tees, or sockets, and found that it could smoothly travel through the pipelines, which is difficult for conventional robot systems to achieve.
|
| |
| 15:00-16:30, Paper ThI2I.334 | Add to My Program |
| HI-SLAM2: Geometry-Aware Gaussian SLAM for Fast Monocular Scene Reconstruction |
|
| Zhang, Wei | University of Stuttgart |
| Cheng, Qing | Technical University of Munich |
| Skuddis, David | University of Stuttgart |
| Zeller, Niclas | Karlsruhe University of Applied Sciences |
| Cremers, Daniel | Technical University of Munich |
| Haala, Norbert | University of Stuttgart |
Keywords: SLAM, Mapping, Dense Reconstruction, Visual Learning
Abstract: We present HI-SLAM2, a geometry-aware Gaussian SLAM system that achieves fast and accurate monocular scene reconstruction using only RGB input. Existing Neural SLAM or 3DGS-based SLAM methods often trade off between rendering quality and geometry accuracy, our research demonstrates that both can be achieved simultaneously with RGB input alone. The key idea of our approach is to enhance the ability for geometry estimation by combining easy-to-obtain monocular priors with learning-based dense SLAM, and then using 3D Gaussian splatting as our core map representation to efficiently model the scene. Upon loop closure, our method ensures on-the-fly global consistency through efficient pose graph bundle adjustment and instant map updates by explicitly deforming the 3D Gaussian units based on anchored keyframe updates. Furthermore, we introduce a grid-based scale alignment strategy to maintain improved scale consistency in prior depths for finer depth details. Through extensive experiments on Replica, ScanNet, Waymo Open, ETH3D SLAM and ScanNet++ datasets, we demonstrate significant improvements over existing Neural SLAM methods and even surpass RGB-D-based methods in both reconstruction and rendering quality.
|
| |
| 15:00-16:30, Paper ThI2I.335 | Add to My Program |
| Curvature-Based Continuous Steering of Stiffness-Dominant Concentric Tube Robots (I) |
|
| Xie, Luhao | Southeast University |
| Zhu, Lifeng | Southeast University |
| Jin, Xiaoliang | Southeast University |
| Song, Aiguo | Southeast University |
Keywords: Surgical Robotics: Steerable Catheters/Needles, Surgical Robotics: Planning, Modeling, Control, and Learning for Soft Robots
Abstract: Existing works on controlling a concentric tube robot (CTR) mostly focus on the trajectory of its tip position or pose. In order to safely send CTRs in a confined lumen space, we propose to continuously steer the CTRs so that its entire shape will always attempt to approximate target curves over time. We focus on stiffness-dominant CTRs. Considering the differential geometry of such CTR shapes, we propose to work on the curvature domain to reduce the computational cost in searching the configuration of the CTRs. With our formulation, we model the curvature control of the CTR to find the optimal translation of each tube and then search for the rotation of the tubes to fit the target shapes. We demonstrate our method using sets of different target paths. The computational time per frame, ranging between 0.1 to 0.3 seconds across all experiments, highlights the efficiency of our approach in aligning the complete shape of the CTR with specified paths. Notably, for time-varying trajectories that could be reproduced by the CTR with its maximum deployment length reaching 150 mm, the root mean square error and median error were 0.98mm and 0.46mm, respectively.
|
| |
| 15:00-16:30, Paper ThI2I.336 | Add to My Program |
| MIND - Multi-Feature Implicit Neural Descriptors for Robotic Surface Processing of 3D Objects with Variations in Geometry |
|
| Pratheepkumar, Anish | PROFACTOR GmbH, Vienna University of Technology |
| Hartl-Nesic, Christian | TU Wien |
| Ikeda, Markus | PROFACTOR GmbH |
| Widmoser, Fabian | Profactor Gmbh |
| Pichler, Andreas | Profactor Gmbh |
| Vincze, Markus | Vienna University of Technology |
Keywords: Computer Vision for Automation, Industrial Robots, Representation Learning
Abstract: The recent shift from mass production to mass personalization leads to a production environment in which workpieces have a high degree of geometric variations. The robotic process automation in such high-mix low-volume environments poses significant challenges since predetermined robot programs are not viable anymore. In this letter, we consider the automation of surface processing for category-level objects with significant variations in geometry by operating on point clouds without relying on CAD models. To achieve this, we present a novel multi-feature implicit neural descriptor (MIND) representation which leverages dense correspondence to generalize across diverse objects, enabling a one-shot transfer of process trajectories and associated process knowledge. The quantitative and qualitative evaluation shows that MIND outperforms other state-of-the-art dense correspondence approaches. A real-world application case study of robotic surface processing on geometry-varying basin molds validates the efficacy of the proposed approach.
|
| |
| 15:00-16:30, Paper ThI2I.337 | Add to My Program |
| PartPose: Attentive 6D Pose Estimation by Focusing on Graspable Parts of Multi-Part Deformable Objects |
|
| Okumura, Ryo | Panasonic Connect Co., Ltd |
| Taniguchi, Tadahiro | Kyoto University |
Keywords: Perception for Grasping and Manipulation, AI and Machine Learning in Manufacturing and Logistics Systems, Computer Vision for Automation
Abstract: This study tackles robotic picking of multi-part deformable objects--common in warehouses yet underexplored in the literature--such as cable-attached appliances and pouch drinks, which comprise both rigid and deformable components. Their deformability poses a challenge to model-based 6D pose estimators, such as FoundationPose, that assume rigid bodies. To address this, we present PartPose, which estimates the 6D pose of the multi-part deformable objects by focusing on the rigid components. PartPose uses Bayesian optimization to select an appropriate region of interest (ROI) and then estimates its pose with a render-and-compare pipeline. We evaluate pose-estimation and picking success rates on nine multi-part deformable objects, counting a pose estimate as successful if the translational error is <30 mm and the rotational error is <0.3 radians. PartPose significantly outperforms a FoundationPose baseline, achieving success rates of 98.2% (translational), 96.4% (rotational), and 87.2% (picking), versus 47.9%, 35.9%, and 22.8%, respectively. Moreover, PartPose generalizes category-level semantic knowledge to new instances within the same category without performance degradation when those instances have semantically similar components. This capability is crucial for large logistics centers that handle diverse and novel objects.
|
| |
| 15:00-16:30, Paper ThI2I.338 | Add to My Program |
| XPRESS: X-Band Radar Place Recognition Via Elliptical Scan Shaping |
|
| Jang, Hyesu | Seadronix Corp |
| Yang, Wooseong | Seoul National University |
| Kim, Ayoung | Seoul National University |
| Lee, Dongje | Seadronix Corp |
| Kim, Hanguen | Seadronix Corp |
Keywords: Marine Robotics, Range Sensing, Localization
Abstract: X-band radar serves as the primary sensor on maritime vessels, however, its application in autonomous navigation has been limited due to low sensor resolution and insufficient information content. To enable X-band radar-only autonomous navigation in maritime environments, this paper proposes a place recognition algorithm specifically tailored for X-band radar, incorporating an object density-based rule for efficient candidate selection and intentional degradation of radar detections to achieve robust retrieval performance. The proposed algorithm was evaluated on both public maritime radar datasets and our own collected dataset, and its performance was compared against state-of-the-art radar place recognition methods. An ablation study was conducted to assess the algorithm's performance sensitivity with respect to key parameters.
|
| |
| 15:00-16:30, Paper ThI2I.339 | Add to My Program |
| X-Nav: Learning End-To-End Cross-Embodiment Navigation for Mobile Robots |
|
| Wang, Haitong | University of Toronto |
| Tan, Aaron Hao | University of Toronto |
| Fung, Angus | University of Toronto |
| Nejat, Goldie | University of Toronto |
Keywords: Vision-Based Navigation, Sensorimotor Learning, AI-Enabled Robotics
Abstract: Existing navigation methods are primarily designed for specific robot embodiments, limiting their generalizability across diverse robot platforms. In this paper, we introduce X-Nav, a novel framework for end-to-end cross-embodiment navigation where a single unified policy can be deployed across various embodiments for both wheeled and quadrupedal robots. X-Nav consists of two learning stages: 1) multiple expert policies are trained using deep reinforcement learning with privileged observations on a wide range of randomly generated robot embodiments; and 2) a single general policy is distilled from the expert policies via navigation action chunking with transformer (Nav-ACT). The general policy directly maps visual and proprioceptive observations to low-level control commands, enabling generalization to novel robot embodiments. Simulated experiments demonstrated that X-Nav achieved zero-shot transfer to both unseen embodiments and photorealistic environments. A scalability study showed that the performance of X-Nav improves when trained with an increasing number of randomly generated embodiments. An ablation study confirmed the design choices of X-Nav. Furthermore, real-world experiments were conducted to validate the generalizability of X-Nav in real-world environments.
|
| |
| 15:00-16:30, Paper ThI2I.340 | Add to My Program |
| Spatiotemporal Calibration for Laser Vision Sensor in Hand-Eye System Based on Straight-Line Constraint |
|
| Yang, Peiwen | The Hong Kong Polytechnic University (PolyU) |
| Jiang, Mingquan | Hezhou University |
| Shen, Xinyue | Shanghai University |
| Zhang, Heping | Central China Normal University |
| |
| 15:00-16:30, Paper ThI2I.341 | Add to My Program |
| GENIE: A Generalizable Navigation System for In-The-Wild Environments |
|
| Wang, Jiaming | National University of Singapore |
| Liu, Diwen | National University of Sinagpore |
| Chen, Jizhuo | National University of Singapore |
| Da, Jiaxuan | National University of Singapore |
| Qian, Nuowen | National University of Singapore |
| Tram, Minh Man | National University of Singapore |
| Soh, Harold | National University of Singapore |
Keywords: Autonomous Vehicle Navigation, Vision-Based Navigation
Abstract: Reliable navigation in unstructured, real-world environments remains a significant challenge for embodied agents, especially when operating across diverse terrains, weather conditions, and sensor configurations. In this paper, we introduce GeNIE (Generalizable Navigation System for In-the-Wild Environments), a robust navigation framework designed for global deployment. GeNIE integrates a generalizable traversability prediction model built on SAM2 with a novel path fusion strategy that enhances planning stability in noisy and ambiguous settings. We deployed GeNIE in the Earth Rover Challenge (ERC) at ICRA 2025, where it was evaluated across six countries spanning three continents. GeNIE took first place and achieved 79% of the maximum possible score, outperforming the second-best team by 17%, and completed the entire competition without a single human intervention. These results set a new benchmark for robust, generalizable outdoor robot navigation. We will release the codebase, pretrained model weights, and newly curated datasets to support future research in real-world navigation.
|
| |
| 15:00-16:30, Paper ThI2I.342 | Add to My Program |
| Equilibrium State for a Tailless Flapping Wing Micro Air Vehicle in Forward Flight |
|
| Sanchez-Laulhe, Ernesto | University of Malaga |
| de Croon, Guido | TU Delft |
| Ollero, Anibal | AICIA. G41099946 |
Keywords: Biologically-Inspired Robots, Aerial Systems: Mechanics and Control, Dynamics
Abstract: Flapping wing Micro Air Vehicles (FWMAVs) hold great potential for real-world applications but are currently still hard to model. In this article, a simplified analysis of the equilibrium state of a tailless FWMAV in forward flight is presented. The definition of the equilibrium state complements previous dynamic and stability analysis, adding new information about the flight behavior of FWMAVs. A new aerodynamic decoupled model has been used for the analysis, considering separately the thrust force generated by the flapping movement and the lift and drag caused by the forward velocity. The aerodynamic forces are included in a dynamic model of the FWMAV, and the equilibrium state is derived. The formulation obtained is explicit in terms of the pitch actuator deflection, thus allowing its use for control corrections, and provides an estimation of the flight velocity. The thrust needed to maintain height is also formulated, demonstrating that forward flight is more efficient than hovering. The results are validated experimentally for the pitch angle, showing good agreement with the analytical results. Then, the dynamics of the FWMAV are simulated, comparing the results with experiments where the FWMAV goes from hovering to a specific pitch reference while maintaining its height. Additional simulations are performed with basic control considerations, showing how considering the equilibrium state for a feed-forward control significantly improves the flight behavior compared to PI and PID controllers, reducing the convergence time.
|
| |
| 15:00-16:30, Paper ThI2I.343 | Add to My Program |
| Unveiling Uncertainty-Aware Autonomous Cooperative Learning Based Planning Strategy |
|
| Zhang, Shiyao | Great Bay University |
| Deng, Liwei | Great Bay University |
| Zhang, Shuyu | The Hong Kong Polytechnic University |
| Yuan, Weijie | Southern University of Science and Technology |
| Zhang, Hong | Southern University of Science and Technology |
Keywords: Collision Avoidance, Motion and Path Planning, Reinforcement Learning
Abstract: In future intelligent transportation systems, autonomous cooperative planning (ACP), becomes a promising technique to increase the effectiveness and security of multi-vehicle interactions. However, multiple uncertainties cannot be fully addressed for existing ACP strategies, e.g. perception, planning, and communication uncertainties. To address these, a novel deep reinforcement learning-based autonomous cooperative planning (DRLACP) framework is proposed to tackle various uncertainties on cooperative motion planning schemes. Specifically, the soft actor-critic (SAC) with the implementation of gate recurrent units (GRUs) is adopted to learn the deterministic optimal time-varying actions with imperfect state information occurred by planning, communication, and perception uncertainties. In addition, the real-time actions of autonomous vehicles (AVs) are demonstrated via the Car Learning to Act (CARLA) simulation platform. Evaluation results show that the proposed DRLACP learns and performs cooperative planning effectively, which outperforms other baseline methods under different scenarios with imperfect AV state information.
|
| |
| 15:00-16:30, Paper ThI2I.344 | Add to My Program |
| Continuous Gaussian Process Pre-Optimization for Asynchronous Event-Inertial Odometry |
|
| Wang, Zhixiang | Northwestern Polytechnical University |
| Li, Xudong | Northwestern Polytechnical University |
| Zhang, Yizhai | Northwestern Polytechnical University |
| Zhang, Fan | Northwestern Polytechnical Univeristy |
| Huang, Panfeng | Northwestern Polytechnical University |
Keywords: Sensor Fusion, Visual-Inertial SLAM, Vision-Based Navigation
Abstract: Event cameras, as bio-inspired sensors, are asynchronously triggered with high-temporal resolution compared to intensity cameras. Recent work has focused on fusing the event measurements with inertial measurements to enable ego-motion estimation in high-speed and HDR environments. However, existing methods predominantly rely on IMU preintegration designed mainly for synchronous sensors and discrete-time frameworks. In this paper, we propose GPO, a continuous-time preintegration framework that can efficiently achieve tightly-coupled fusion of fully asynchronous sensors. Concretely, we model the preintegration as two local Temporal Gaussian Process (TGP) trajectories and leverage a light-weight two-step optimization to infer the continuous preintegration pseudo-measurements. We show that the Jacobians of arbitrary queried states can be naturally propagated using our framework, which enables GPO to be involved in the asynchronous fusion. Our method realizes a linear and constant time cost for optimization and query, respectively. To further validate the proposal, we leverage GPO to design an asynchronous event-inertial odometry and compare with other asynchronous fusion schemes. Experiments conducted on both public and own-collected datasets demonstrate that the proposed GPO offers significant advantages in terms of accuracy and efficiency, outperforming existing approaches in handling asynchronous sensor fusion. Our method will be made open source to benefit the community.
|
| |
| 15:00-16:30, Paper ThI2I.345 | Add to My Program |
| Modular Actuator for Multimodal Proprioceptive and Kinesthetic Feedback of Robotic Hands |
|
| Park, Sungwoo | Korea University, KIST |
| Lim, Myo-Taeg | Korea University |
| Hwang, Donghyun | Korea Institute of Science and Technology |
Keywords: Actuation and Joint Mechanisms, Multifingered Hands, Force and Tactile Sensing
Abstract: This study addresses the challenge of implementing proprioceptive and kinesthetic (PK) feedback in robotic hands, essential for grasping and manipulation tasks in unstructured environments. We developed a compact modular actuator featuring a low-module, high-transmission-ratio multistage gear mechanism that measures 25×10×24 mm, weighs only 10 grams, and maintains moderate backdrivability. The actuator provides multimodal PK feedback, capturing position, velocity, current, and torque data, which are critical for performing various grasping and manipulation tasks. To enable precise motion and force control, we introduced a new adaptive velocity estimator and a simplified Reaction Torque Observer (RTOB). Comprehensive experiments demonstrated the actuator’s ability to accurately detect surface shape, roughness, and stiffness of target objects, eliminating the need for additional sensors or space. Experimental results confirmed the actuator’s precision, achieving measurement errors of 5.8 mrad for position, 0.19 rad/s for velocity, and 0.011 N·m for torque. These findings highlight the actuator’s ability to leverage proprioceptive information, significantly enhancing the functionality and adaptability of robotic hands in diverse and dynamic scenarios.
|
| |
| 15:00-16:30, Paper ThI2I.346 | Add to My Program |
| Squeezing the Last Drop of Accuracy: Hand-Eye Calibration Via Deep Reinforcement Learning-Guided Pose Tuning |
|
| Shin, Seunghui | Kyung Hee University |
| Kim, Daeho | Kyung Hee University |
| Hwang, Hyoseok | Kyung Hee University |
Keywords: Calibration and Identification, Reinforcement Learning, Machine Learning for Robot Control
Abstract: Hand-eye calibration is a fundamental task in robotics, requiring high precision to ensure accurate manipulation. This is especially crucial for recent markerless methods, which depend on precise pose estimation for effective end-effector calibration. In this paper, we propose a novel approach that improves calibration performance by adjusting the end-effector's pose to reduce prediction error. Our method utilizes a reward structure derived from trained pose estimation networks, enabling a Soft Actor-Critic-Discrete agent to learn in a simulated environment how to enhance calibration performance through action selection. Our experiments show that calibration results achieved with our method outperform those from initial poses alone in both markerless and marker-based methods. Real-world experiments further validate the efficacy of our approach in actual robotic systems. These results demonstrate that our proposed method effectively enhances the performance of pose estimation-based hand-eye calibration.
|
| |
| 15:00-16:30, Paper ThI2I.347 | Add to My Program |
| A Practical Multi-Body Model Enabling a Flexible-Wheeled Robot to Learn Blind Stair Climbing |
|
| Yoon, Chan-Young | Kookmin University |
| Cho, Baek-Kyu | Kookmin University |
Keywords: Reinforcement Learning, Wheeled Robots, Flexible Robotics
Abstract: Controlling a flexible wheeled robot for complex tasks such as stair climbing is highly challenging. The nonlinearity inherent in soft materials hinders accurate modeling, creating a trade-off in Reinforcement Learning (RL) between simulation fidelity and learning speed. We propose an RL-friendly, multi-body model that approximates the deformation of the flexible wheel as a Mass-Spring-Damper (MSD) system composed of rigid links and joints. This model enables end-to-end RL within a fast rigid-body simulator, facilitating a blind control policy that relies solely on proprioceptive feedback. To reduce the reality gap and enhance policy robustness, we randomize the main parameters of the MSD system. In real-world experiments, a robot successfully climbed an 18 cm step, corresponding to approximately 51% of the wheel radius—a feat impossible for a rigid-wheeled equivalent. To our knowledge, this is the first successful application of RL-based blind control for stair climbing with a flexible wheeled robot. However, structural limitations in our model and challenges in parameter identification hinder sim-to-real transfer, and improving robustness remains a key issue for future work.
|
| |
| 15:00-16:30, Paper ThI2I.348 | Add to My Program |
| A Soft-Rigid Hybrid Robot-Assisted Feeding System with a Tendon-Driven Continuum Robot |
|
| Chen, Jingyi | University of Science and Technology of China |
| Qiu, Quecheng | School of Data Science, USTC, Hefei 230026, China |
| Ji, Jianmin | University of Science and Technology of China |
Keywords: Soft Robot Applications, Imitation Learning, Safety in HRI
Abstract: Active delivery of food to a human mouth in a controlled and safe manner remains a key challenge for robot‑assisted feeding systems (RAFSs). Existing RAFS designs struggle to simultaneously achieve efficiency and safety: rigid manipulators offer fast and accurate motion but risk hazardous contact, while soft robots provide passive compliance at the cost of limited speed or workspace. To meet the specific demands of feeding tasks, we design a tendon-driven continuum robot that allows precise orientation control of the utensil while exhibiting strong passive compliance in position. Integrating it with a 6-DoF rigid robot for fast and long-range positioning, we propose a hybrid RAFS architecture that achieves safe, efficient, and accurate food delivery. Controlling a passive-compliant RAFS to acquire various food is non‑trivial: physical modeling struggles with complex interactions between soft robot and food, while typical imitation learning methods lead to discontinuous or distorted movements out of the passive deformation. To handle this, we design a pose-torque learning policy that enables the soft and rigid robots to generate coherent and synchronized movements, offering a case-specific solution to the long-standing challenge of soft robot imitation learning. Experiments show that our method achieve a food acquisition success rate of 76.7%, while user tests with 14 volunteers confirm user preference, marking our RAFS as a practical step toward safe and efficient robotic feeding.
|
| |
| 15:00-16:30, Paper ThI2I.349 | Add to My Program |
| Rad-GS: Radar-Vision Integration for 3D Gaussian Splatting SLAM in Outdoor Environments |
|
| Xiao, Renxiang | Harbin Institute of Technology, Shenzhen |
| Liu, Wei | Harbin Institute of Technology, Shenzhen |
| Zhang, Yuanfan | Harbin Institute of Technology |
| Chen, YuShuai | Harbin Institute of Technology, Shenzhen |
| Chen, Jinming | Harbin Institute of Technology, Shenzhen |
| Wang, Zilu | Harbin Institute of Technology (Shenzhen) |
| Hu, Liang | Harbin Institute of Technology, Shenzhen |
| |
| 15:00-16:30, Paper ThI2I.350 | Add to My Program |
| Robust Admittance Control of an Electric Underwater Manipulator for Precise Motion and Safe Contact Inspection of Hydraulic Structures |
|
| Wang, Fei | Southeast University |
| Liu, Haixin | Southeast University |
| Cao, Lin | Southeast University |
| Hou, Shitong | Southeast University |
| Song, Guangming | Southeast University |
| Song, Aiguo | Southeast University |
Keywords: Marine Robotics, Force Control, Engineering for Robotic Systems
Abstract: Inspection of hydraulic structures is crucial for ensuring the reliability and safety of infrastructures. Although underwater manipulators are essential tools, existing systems often lack sufficient compliance and safe interaction capabilities. This study develops a novel underwater manipulator system with a robust admittance control framework designed specifically for safe contact inspection tasks. The manipulator integrates a 6-axis force/torque sensor for contact force measurement and an ultrasonic detector for structural inspection. An underwater force estimation algorithm is implemented to ensure accurate force measurement under varying flow conditions. The proposed robust admittance control strategy comprises an inner-loop position controller, enhanced by an unknown system dynamics estimator and super-twisting sliding mode control, to counteract hydrodynamic disturbances and improve trajectory tracking accuracy. An outer-loop variable admittance controller, incorporating variable damping mechanism and adaptive feedback compensation, ensures compliant interactions and precise force control with minimal overshoot. Extensive experiments, including force measurement, motion and contact force control, and underwater thickness measurement, demonstrate the system's excellent performance, validating its effectiveness for hydraulic structure inspection tasks.
|
| |
| 15:00-16:30, Paper ThI2I.351 | Add to My Program |
| GSON: A Group-Based Social Navigation Framework with Large Multimodal Model |
|
| Luo, Shangyi | Tsinghua Shenzhen International Graduate School |
| Sun, Peng | Tsinghua Shenzhen International Graduate School |
| Zhu, Ji | Tsinghua Shenzhen International Graduate School |
| Deng, Yuhong | National University of Singapore |
| Yu, Cunjun | National University of Singapore |
| Xiao, Anxing | National University of Singapore |
| Wang, Xueqian | Tsinghua Shenzhen International Graduate School |
Keywords: Vision-Based Navigation, AI-Enabled Robotics, Behavior-Based Systems
Abstract: With the increasing presence of service robots and autonomous vehicles in human environments, navigation systems need to evolve beyond simple destination reach to incorporate social awareness. This paper introduces GSON, a novel group-based social navigation framework that leverages Large Multimodal Models (LMMs) to enhance robots' social perception capabilities. Our approach uses visual prompting to enable zero-shot extraction of social relationships among pedestrians and integrates these results with robust pedestrian detection and tracking pipelines to overcome the inherent inference speed limitations of LMMs. The planning system incorporates a mid-level planner that sits between global path planning and local motion planning, effectively preserving both global context and reactive responsiveness while avoiding disruption of the predicted social group. We validate GSON through extensive real-world mobile robot navigation experiments involving complex social scenarios such as queuing, conversations, and photo sessions. Comparative results show that our system significantly outperforms existing navigation approaches in minimizing social perturbations while maintaining comparable performance on traditional navigation metrics.
|
| |
| 15:00-16:30, Paper ThI2I.352 | Add to My Program |
| Development of the Bioinspired Tendon-Driven DexHand 021 with Proprioceptive Compliance Control |
|
| Yuan, Jianbo | Shanghai Jiao Tong University |
| Haohua, Zhu | Zhejiang Dexterous Intelligent Technology Co., Ltd |
| Dai, Jing | Shanghai Jiao Tong University |
| Yi, Sheng | Zhejiang Dexterous Intelligent Technology Co., Ltd |
Keywords: Multifingered Hands, Biologically-Inspired Robots
Abstract: The human hand plays a vital role in daily life and industrial applications, yet replicating its multifunctional capabilities-including motion, sensing, and coordinated manipulation-with robotic systems remains a formidable challenge. Developing a dexterous robotic hand requires balancing human-like agility with engineering constraints such as complexity, size-to-weight ratio, durability, and force-sensing performance. This letter presents Dex-Hand 021, a high-performance, cable-driven five-finger robotic hand with 12 active and 7 passive degrees of freedom (DoFs), achieving 19 DoFs dexterity in a lightweight 1 kg design. We propose a proprioceptive force-sensing-based admittance control method to enhance manipulation. Experimental results demonstrate its superior performance: a single-finger load capacity exceeding 10 N, fingertip repeatability under 0.001 m, and force estimation errors below 0.2 N. Compared to PID control, joint torques in multi-object grasping are reduced by 31.19%, significantly improves force-sensing capability while preventing overload during collisions. The hand excels in both power and precision grasps, successfully executing 33 GRASP taxonomy motions and complex manipulation tasks. This work advances the design of lightweight, industrial-grade dexterous hands and enhances proprioceptive control, contributing to robotic manipulation and intelligent manufacturing.
|
| |
| 15:00-16:30, Paper ThI2I.353 | Add to My Program |
| ILCL: Inverse Logic-Constraint Learning from Temporally Constrained Demonstrations |
|
| Cho, Minwoo | Korea Advanced Institute of Science and Technology(KAIST) |
| Jang, Jaehwi | Georgia Institute of Technology |
| Park, Daehyung | Korea Advanced Institute of Science and Technology, KAIST |
Keywords: Formal Methods in Robotics and Automation, Learning from Demonstration, Reinforcement Learning
Abstract: We aim to solve the problem of temporal-constraint learning from demonstrations to reproduce demonstration-like logic-constrained behaviors. Learning logic constraints is challenging due to the combinatorially large space of possible specifications and the ill-posed nature of non-Markovian constraints. To this end, we introduce inverse logic-constraint learning (ILCL), a novel temporal-constraint learning method formulated as a two-player zero-sum game between 1) a genetic algorithm-based temporal-logic mining (GA-TL-Mining) and 2) logic-constrained reinforcement learning (Logic-CRL). GA-TL-Mining efficiently constructs syntax trees for parameterized truncated linear temporal logic (TLTL) without predefined templates. Subsequently, Logic-CRL finds a policy that maximizes task rewards under the constructed TLTL constraints via a novel constraint redistribution scheme. Our evaluations show ILCL outperforms state-of-the-art baselines in learning and transferring TL constraints on four temporally constrained tasks. We also demonstrate successful transfer to real-world peg-in-shallow-hole tasks.
|
| |
| 15:00-16:30, Paper ThI2I.354 | Add to My Program |
| A Nonlinear Control Allocation Strategy for Dual Half Bridge Power Converters (I) |
|
| Castro, Ricardo | UC Merced |
| Araújo, Rui Esteves | University of Porto, Faculty of Engineering |
| Brembeck, Jonathan | German Aerospace Center (DLR) |
| |
| 15:00-16:30, Paper ThI2I.355 | Add to My Program |
| Safe Autonomous Environmental Contact for Soft Robots Using Control Barrier Functions |
|
| Dickson, Akua | Boston University |
| Pacheco Garcia, Juan | Boston University |
| Anderson, Meredith | Boston University |
| Jing, Ran | Boston University |
| Alizadeh-Shabdiz, Sarah | University |
| Wang, Audrey | Boston University |
| DeLorey, Charles | University of Leeds |
| Patterson, Zachary | Case Western Reserve University |
| Sabelhaus, Andrew | Boston University |
Keywords: Modeling, Control, and Learning for Soft Robots, Robot Safety, Motion Control
Abstract: Robots built from soft materials will inherently apply lower environmental forces than their rigid counterparts, and therefore may be more suitable in sensitive settings with unintended contact. However, these robots' applied forces result from both their design and their control system in closed-loop, and therefore, ensuring bounds on these forces requires controller synthesis for safety as well. This article introduces the first feedback controller for a soft manipulator that formally meets a safety specification with respect to environmental contact. In our proof-of-concept setting, the robot's environment has known geometry and is deformable with a known elastic modulus. Our approach maps a bound on applied forces to a safe set of positions of the robot's tip via predicted deformations of the environment. Then, a quadratic program with Control Barrier Functions in its constraints is used to supervise a nominal feedback signal, verifiably maintaining the robot's tip within this safe set. Hardware experiments on a multi-segment soft pneumatic robot demonstrate that the proposed framework successfully maintains a positive safety margin. This framework represents a fundamental shift in perspective on control and safety for soft robots, implementing a formally verifiable logic specification on their pose and contact forces.
|
| |
| 15:00-16:30, Paper ThI2I.356 | Add to My Program |
| The Challenges of Using Robots to Automate the Recycling of Electronic Devices |
|
| Ude, Ales | Jozef Stefan Institute |
| Simonic, Mihael | Jozef Stefan Institute |
| Kuster, Boris | Jozef Stefan Institute |
| Mavsar, Matija | Jozef Stefan Institute |
| Bem, Martin | Jozef Stefan Institute |
| Ruiz, Sebastian | University of Göttingen |
| Tamosiunaite, Minija | University of Göttingen |
| Catalano, Manuel Giuseppe | Istituto Italiano Di Tecnologia |
| Tincani, Vinicio | University of Pisa |
| Bicchi, Antonio | Istituto Italiano Di Tecnologia |
| Karacan, Kübra | Technical University of Munich |
| Sadeghian, Hamid | Technical University of Munich |
| Haddadin, Sami | Mohamed Bin Zayed University of Artificial Intelligence |
| Persichini, Riccardo | University of Pisa |
| Fröhlich, Hannes | ELECTROCYCLING |
| Wörgötter, Florentin | University of Göttingen |
Keywords: Disassembly, Soft Robot Applications, Software-Hardware Integration for Robot Systems
Abstract: This paper tackles the challenges of automating battery removal from small electronic devices, such as heat cost allocators and smoke detectors. The process is critical for mitigating fire hazards caused by lithium batteries in recycling facilities and for supporting a circular economy. We focus on advanced methodologies and robotic technologies designed to overcome the significant hurdles posed by the diverse range of device designs, complex battery compartments, and varying states of damage. Our approach integrates Vision-Language Models (VLMs) for real-time, adaptive disassembly planning, computer vision, tactile skills, soft robotics, and reconfigurable robotic workcells to enhance perception, dexterity, and adaptability in handling diverse device designs and damage states. Additionally, a reconfigurable robotic workcell with modular hardware and standardized interfaces enables seamless adaptation to various devices. Laboratory testing demonstrates improved efficiency and reduced manual intervention, highlighting the potential of AI-driven, reconfigurable robotics for scalable and sustainable e-waste recycling.
|
| |
| 15:00-16:30, Paper ThI2I.357 | Add to My Program |
| SDRS: Shape-Differentiable Robot Simulator |
|
| Ye, Xiaohan | The University of Hong Kong |
| Gao, Xifeng | Tencent America |
| Wu, Kui | Tencent |
| Pan, Zherong | Tencent America |
| Komura, Taku | The University of Hong Kong |
Keywords: Simulation and Animation, Contact Modeling, Dynamics, Robot Design
Abstract: Robot simulators are indispensable tools across many fields, and recent research has significantly improved their functionality by incorporating additional gradient information. However, existing differentiable robot simulators suffer from non-differentiable singularities, when robots undergo substantial shape changes. To address this, we present the Shape-Differentiable Robot Simulator (SDRS), designed to be differentiable under significant robot shape changes. The core innovation of SDRS lies in its representation of robot shapes using a set of convex polyhedrons. This approach allows us to generalize smooth, penalty-based contact mechanics for interactions between any pair of convex polyhedrons. Using the separating hyperplane theorem, SDRS introduces a separating plane for each pair of contacting convex polyhedrons. This separating plane functions as a zero-mass auxiliary entity, with its state determined by the principle of least action. This setup ensures global differentiability, even as robot shapes undergo significant geometric and topological changes. To demonstrate the practical value of SDRS, we provide examples of robot co-design scenarios,
|
| |
| 15:00-16:30, Paper ThI2I.358 | Add to My Program |
| Global-State-Free Obstacle Avoidance for Quadrotor Control in Air-Ground Cooperation |
|
| Zhang, Baozhe | The Chinese University of Hong Kong, Shenzhen |
| Chen, Xinwei | Zhejiang University |
| Chen, Qingcheng | Shanghai Institute of Special Equipment Inspection & Technical Research |
| Xu, Chao | Zhejiang University |
| Gao, Fei | Zhejiang University |
| Cao, Yanjun | Zhejiang University, Huzhou Institute of Zhejiang University |
| |
| 15:00-16:30, Paper ThI2I.359 | Add to My Program |
| Gaussian or Plane? Both: Semantic-Driven Voxel Representation for LiDAR–Inertial Odometry |
|
| Wu, Haiyang | University of Twente |
| Vosselman, George | University of Twente |
| Lehtola, Ville | University of Twente |
Keywords: SLAM, Range Sensing, Localization
Abstract: Accurate LiDAR-inertial odometry (LIO) highly depends on the geometric fidelity of the underlying environment representation. We explore the new and interesting research direction of integrating semantic segmentation models into metric odometry algorithms to enrich their representational capacity. Specifically, this letter proposes a semantic-driven hybrid voxel representation in which an off-the-shelf 3D segmentation network assigns every voxel to either a planar or nonplanar class, using planar and Gaussian representations, respectively. Consequently, a hybrid scan matching strategy is presented using class-specific residual models that are tailored to the distinct error statistics of each surface category. The scan matcher is embedded within an Iterated Extended Kalman Filter (IEKF) for odometry and mapping. We evaluate our method on diverse platforms and environments, and show improved localization accuracy across various indoor and outdoor scenarios, while maintaining real-time performance.
|
| |
| 15:00-16:30, Paper ThI2I.360 | Add to My Program |
| Novel Robotic Fleet for Sample Recovery in Lunar Craters: A Concept of Operations (I) |
|
| Jangale, Rishi | Texas A&M University |
| Pravecek, Derek | Texas A&M University |
| Lam, Sarah | Texas A&M University |
| McDougall, David | Texas A&M University |
| Trevino, Mauricio | Texas A&M University |
| Villanueva, Aaron | Texas A&M University |
| Land, Jonas | Texas A&M University |
| De Leon, Heaven | Texas A&M University |
| Oevermann, Micah | Texas A&M University |
| Ambrose, Robert | Texas A&M University |
Keywords: Multi-Robot Systems, Space Robotics and Automation, Cooperating Robots
Abstract: Exploration of extraterrestrial surfaces, such as the lunar surface, can prove treacherous for humans and robots alike, and requires highly specialized mobility platforms to ensure the success of a mission and the safety of any operators. However, these specialized machines may limit the overall scope of a mission by limiting performance outside a particular environment. Thus, for maximum capabilities, a team of distinct but complementary specialized robots and vehicles may be used to expand mission capabilities in lunar environments. In this paper, a concept of operations for exploration of a lunar crater from utilizing a collaboration between a wheeled rover, represented by the RAD Exploration Vehicle (REV) and a non-traditional spherical robot, represented by RoboBall II, is introduced. These robots are used as an analog for mission-capable robots such as NASA’s Chariot rover and the larger RoboBall III. Design of these robots, along with collaborative features and intended operational environments, is discussed. A controller for RoboBall to attempt controlled descent on slopes is presented. Further, a ballistic sample return module for collection and ex situ analysis of a sample from the bottom of a lunar crater, along with potential navigational mechanisms to facilitate efficient recovery, is presented. Finally, a mission analog using RoboBall III and the ballistic sample return conducted in a former quarry is demonstrated.
|
| |
| 15:00-16:30, Paper ThI2I.361 | Add to My Program |
| AttBEV: Enhancing Multi-Modal 3D Object Detection with CBAM Attention in BEVFusion for Autonomous Driving |
|
| Zhang, Na | Polytechnic University of Catalonia |
| Guerra Paradas, Edmundo | Polytechnic University of Catalonia |
| Grau Saldes, Antoni | Polytechnic University of Catalonia |
Keywords: Computer Vision for Transportation, Sensor Fusion, Intelligent Transportation Systems
Abstract: Multimodal fusion has an important research value in environmental perception for autonomous driving. Among them, BEVFusion has become one of the mainstream framework for LiDAR camera fusion by unifying multimodal features in the bird’s-eye view (BEV) space. However, its performance is limited by inefficient cross-modal interaction and information loss during BEV projection, especially for dynamic objects and edge cases. To address these limitations, we propose AttBEV, an advanced fusion architecture that introduces a CBAM at the feature fusion layer: a lightweight attention mechanism that improves the model’s ability to capture key information through dynamic feature calibration of channel and spatial dimensions.Extensive experiments on the nuScenes dataset demonstrate that AttBEV achieves superior performance compared to BEVFusion on most evaluation metrics. NDS reaches 0.6795, which is 2.63% higher than BEVFusion’s 0.6532, and mAP reaches 0.6426, which is 1.79% higher than BEVFusion’s 0.6247. In general, AttBEV outperforms existing methods in both model accuracy and generalization ability and significantly improves the performance of 3D object detection in autonomous driving scenarios.
|
| |
| 15:00-16:30, Paper ThI2I.362 | Add to My Program |
| Ensemble-Based Event Camera Place Recognition under Varying Illumination |
|
| Joseph, Therese | Queensland University of Technology |
| Fischer, Tobias | Queensland University of Technology |
| Milford, Michael J | Queensland University of Technology |
Keywords: Localization, Computer Vision for Transportation
Abstract: Compared to conventional cameras, event cameras provide a high dynamic range and low latency, offering greater robustness to rapid motion and challenging lighting conditions. Although the potential of event cameras for visual place recognition (VPR) has been established, the development of robust VPR frameworks under severe illumination changes remains an open research problem. In this paper, we introduce an ensemble-based approach to event camera place recognition that combines sequence-matched results from multiple event-to-frame reconstructions, VPR feature extractors, and temporal resolutions. Unlike previous event-based ensemble methods, limited to temporal resolution, our broader fusion strategy delivers significantly improved robustness under varied lighting conditions (afternoon, sunset, night, etc.), achieving up to 77% relative improvement in Recall@1 across day-night transitions. We evaluate our approach on two long-term driving datasets (with 8 km per traverse) without metric subsampling, thereby preserving natural variations in speed and stop duration that influence event density. We also conduct a comprehensive analysis of key design choices, including binning strategies, reconstruction methods, and feature extractors, to identify the most critical components for robust performance. Additionally, we propose a modification to the standard sequence matching framework that enhances performance at longer sequence lengths. To facilitate future research, we will release our codebase and benchmarking framework.
|
| |
| 15:00-16:30, Paper ThI2I.363 | Add to My Program |
| Distributed Bearing-Only Formation Maneuvering Control for Quadrotors without Global Reference Frame |
|
| Li, Shaoshi | Beihang University |
| Zhang, Yuwei | Beihang University |
| Wang, Shaoping | Beihang University |
| Mu, Rui | Beihang University |
| Wang, Xingjian | Beihang University |
Keywords: Multi-Robot Systems, Distributed Robot Systems
Abstract: Most existing bearing-only formation control methods required that the relative bearings among neighboring agents are measured under a well-known global reference frame for each individual. To remove such constraint, this paper novelly introduces a distributed formation control scheme for quadrotors with only bearing measurement in each vehicle's local reference frame. To this end, firstly, a prescribed-time quaternion-based orientation estimator is proposed for each follower to estimate the leader's orientation without knowledge of the global reference frame. Secondly, a bearing-only formation control law is developed to achieve desired maneuvering formation using relative bearings under local reference frame, wherein a finite-time differentiator is incorporated to remove the need of bearing rate. The convergence is rigorously proven through mathematical derivations. Both comparative simulations and real-world experiments are conducted to validate the effectiveness of the proposed control scheme.
|
| |
| 15:00-16:30, Paper ThI2I.364 | Add to My Program |
| Interactive Robotic Moving Cable Segmentation by Motion Correlation |
|
| Holesovsky, Ondrej | Czech Technical University in Prague |
| Skoviera, Radoslav | Czech Technical University in Prague |
| Hlavac, Vaclav | Czech Technical University in Prague |
Keywords: Object Detection, Segmentation and Categorization, Perception for Grasping and Manipulation, Data Sets for Robotic Vision
Abstract: Manipulating tangled hoses, cables, or ropes can be challenging for both robots and humans. Humans often approach these perceptually demanding tasks by pushing or pulling tangled cables and observing the resulting motions. We follow a similar idea to aid robotic cable manipulation. In this letter, we integrate visual and proprioceptive perception to segment a grasped cable by moving it even when the robot or the grasped cable sometimes perturb neighboring cables. We formulate the cable interactive segmentation problem in such a way that our methods do not require robot arm segmentation masks. Furthermore, a novel grasp sampling method can propose new cable grasp points given a partial cable segmentation to improve the segmentation via additional cable-robot interaction. We evaluate the proposed motion correlation (MCor) method on data sequences recorded by our physical robotic setup and show that the method outperforms an earlier motion segmentation (MSeg) baseline.
|
| |
| 15:00-16:30, Paper ThI2I.365 | Add to My Program |
| Sampling-Aware Multi-Rate Combined Control for an Orbital Manipulator |
|
| Vijayan, Ria | German Aerospace Center (DLR) |
| De Stefano, Marco | German Aerospace Center (DLR) |
| Ott, Christian | TU Wien |
Keywords: Space Robotics and Automation, Motion Control, Discrete Event Dynamic Automation Systems
Abstract: In on-orbit servicing missions using robotic manipulators, certain challenging scenarios require the use of combined control i.e. actuation of spacecraft and the manipulator, to meet mission requirements. The low frequency of the controller of the spacecraft compared to the manipulator can compromise the stability margin of the combined control. In this paper, we first design a combined control strategy to carefully decouple the high-rate manipulator control from the spacecraft’s low-rate control. Second, we design a novel discrete controller accounting for the first-order effects of the servicer’s low sampling rate. This is realized by augmenting a classical proportional-derivative (PD) control scheme. The operational bounds of the discrete controller are first benchmarked on a one-DoF system and further investigated for performance using a multi-DoF orbital manipulator. The results shed light on the regions of enhanced performance in terms of stability and impulse utilization as a measure of efficiency. Simulation results and hardware-in-the-loop (HIL) experiments are performed to validate the proposed method.
|
| |
| 15:00-16:30, Paper ThI2I.366 | Add to My Program |
| Structure-Preserving Model Order Reduction of Slender Soft Robots Via Autoencoder-Parameterized Strain |
|
| Alkayas, Abdulaziz Y. | MBZUAI |
| Mathew, Anup Teejo | Khalifa University |
| Feliu, Daniel | Delft University of Technology (TU Delft) |
| Zweiri, Yahya | Khalifa University |
| George Thuruthel, Thomas | University College London |
| Renda, Federico | Khalifa University of Science and Technology |
Keywords: Modeling, Control, and Learning for Soft Robots, Deep Learning Methods, Dynamics
Abstract: While soft robots offer advantages in adaptability and safe interaction, their modeling remains challenging. This letter presents a novel, data-driven approach for model order reduction of slender soft robots using autoencoder-parameterized strain within the Geometric Variable Strain (GVS) framework. We employ autoencoders (AEs) to learn low-dimensional strain parameterizations from data to construct reduced-order models (ROMs), preserving the Lagrangian structure of the system while significantly reducing the degrees of freedom. Our comparative analysis demonstrates that AE-based ROMs consistently outperform proper orthogonal decomposition (POD) approaches, achieving lower errors for equivalent degrees of freedom across multiple test cases. Additionally, we demonstrate that our proposed approach achieves computational speed-ups over the high-order models (HOMs) in all cases, and outperforms the POD-based ROM in scenarios where accuracy is matched. We highlight the intrinsic dimensionality discovery capabilities of autoencoders, revealing that HOM often operate in lower-dimensional nonlinear manifolds. Through both simulation and experimental validation on a cable-actuated soft manipulator, we demonstrate the effectiveness of our approach, achieving near-identical behavior with just a single degree of freedom. This structure-preserving method offers significant reductions in the system degrees of freedom and computational effort while maintaining physical model interpretability, offering a promising direction for soft robot modeling and control.
|
| |
| 15:00-16:30, Paper ThI2I.367 | Add to My Program |
| ColorMap-VIO: A Drift-Free Visual-Inertial Odometry in a Prior Colored Point Cloud Map |
|
| Xu, Jie | Nanyang Technoligical University |
| Zhang, Xuanxuan | wuhan university |
| Ma, Yongxin | Shandong University |
| Li, Yixuan | Xi'an Jiaotong University |
| Wang, Linji | George Mason University |
| Xu, Xinhang | Nanyang Technological University |
| Yuan, Shenghai | Nanyang Technological University |
| Xie, Lihua | NanyangTechnological University |
| |
| 15:00-16:30, Paper ThI2I.368 | Add to My Program |
| K-ARC: Adaptive Robot Coordination for Multi-Robot Kinodynamic Planning |
|
| Qin, Mike | University of Illinois Urbana-Champaign |
| Solis Vidana, Juan Irving | University of Illinois Urbana-Champaign |
| Motes, James | University of Illinois Urbana-Champaign |
| Morales, Marco | University of Illinois Urbana-Champaign & Instituto Tecnológico Autónomo de México |
| Amato, Nancy | University of Illinois Urbana-Champaign |
| |
| 15:00-16:30, Paper ThI2I.369 | Add to My Program |
| Adaptive Legged Locomotion Via Online Learning for Model Predictive Control |
|
| Zhou, Hongyu | University of Michigan |
| Zhang, Xiaoyu | Georgia Institute of Technology |
| Tzoumas, Vasileios | University of Michigan, Ann Arbor |
Keywords: Legged Robots, Model Learning for Control, Robust/Adaptive Control
Abstract: We provide an algorithm for adaptive legged locomotion via online learning and model predictive control. The algorithm is composed of two interacting modules: model predictive control (MPC) and online learning of residual dynamics. The residual dynamics can represent modeling errors and external disturbances. We are motivated by the future of autonomy where quadrupeds will autonomously perform complex tasks despite real-world unknown uncertainty, such as unknown payload and uneven terrains. The algorithm uses random Fourier features to approximate the residual dynamics in reproducing kernel Hilbert spaces. Then, it employs MPC based on the current learned model of the residual dynamics. The model is updated online in a self-supervised manner using least squares based on the data collected while controlling the quadruped. The algorithm enjoys sublinear dynamic regret, defined as the suboptimality against an optimal clairvoyant controller that knows how the residual dynamics. We validate our algorithm in Gazebo and MuJoCo simulations, where the quadruped aims to track reference trajectories. The Gazebo simulations include constant unknown external forces up to 12g, where g is the gravity vector, in flat terrain, slope terrain with 20 degree inclination, and rough terrain with 0.25m height variation. The MuJoCo simulations include time-varying unknown disturbances with payload up to 8 kg and time-varying ground friction coefficients in flat terrain.
|
| |
| 15:00-16:30, Paper ThI2I.370 | Add to My Program |
| Human Perception in Social Tasks: A Comparative Evaluation of Autonomous and Teleoperated Robots |
|
| Hriscu, Lavinia | CSIC-UPC |
| Sanfeliu, Alberto | Universitat Politècnica De Cataluyna |
| Garrell, Anais | UPC-CSIC |
Keywords: Social HRI, Natural Dialog for HRI, Telerobotics and Teleoperation
Abstract: Robots and artificial intelligence technologies are becoming increasingly integrated into our daily lives. The introduction of humanoid robots into everyday settings is a gradual but ongoing process—one that society is already beginning to navigate. Yet this shift raises important questions: Who or what is truly behind these physical agents? And can we, as users, perceive differences in our interactions depending on whether a robot acts autonomously or it's teleoperated by a human? In this study, we present the results of an experiment in which participants interacted with a robot under two control conditions—autonomous and teleoperated—while it performed two distinct tasks in both static and dynamic movement scenarios. In our results, human operators outperformed autonomous systems in tasks requiring spatial awareness and contextual reasoning. Conversely, the autonomous robot—powered by a Large Language Model and operating without visual input—was perceived more favorably in tasks that demanded rapid access to broad and diverse information.
|
| |
| 15:00-16:30, Paper ThI2I.371 | Add to My Program |
| Tacser and Action-Conditioned Latent Filter for Generalizable Robotic Surface Perception |
|
| Dutta, Anirvan | Imperial College London |
| Massalim, Yerkebulan | Nazarbayev University |
| Burdet, Etienne | Imperial College London |
| Kaboli, Mohsen | Eindhoven University of Technology ( TU/e) & BMW Group Research |
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation, Representation Learning
Abstract: Perceiving the physical properties of different surfaces/textures via tactile sensing has been a long-standing problem in robotics. Most prior work has been limited to discriminative models that classify textures into a fixed set of categories. However, to enable seamless autonomous manipulation, robots must infer physical properties as structured, continuous variables rather than as discrete class labels. In this work, we present a novel deep state-space model (DSSM) to learn and infer key causal textural properties in an unsupervised manner. Using variational inference to solve the DSSM, our proposed Latent Filter allows robotic systems to perceive textures in a continuous and generalizable manner. In addition, we explore a novel interaction approach: Tacser (Tactile Enhancer), to further enhance tactile sensing through vibrations induced by high-frequency micro-movements and thereby improve perception. We evaluated our approach against state-of-the-art techniques and performed extensive ablation studies to demonstrate its effectiveness. This work advances tactile-based texture perception, providing a generalizable and comprehensive framework for robotics.
|
| |
| 15:00-16:30, Paper ThI2I.372 | Add to My Program |
| Robustness-Guaranteed Observer-Based Control Strategy with Modularity for Cleantech EMLA-Driven Heavy-Duty Robotic Manipulator (I) |
|
| Shahna, Mehdi Heydari | Tampere University |
| Bahari, Mohammad | Tampere University |
| Mattila, Jouni | Tampere University |
Keywords: Robust/Adaptive Control, Motion Control, Actuation and Joint Mechanisms
Abstract: This paper introduces an innovative observer-based modular control strategy in a class of n_a-degree-of-freedom (DoF) fully electrified heavy-duty robotic manipulators (HDRMs) to (1) guarantee robustness in the presence of uncertainties and disturbances, (2) address the complexities arising from several interacting mechanisms, (3) ensure uniformly exponential stability, and (4) enhance overall control performance. To begin, the dynamic model of HDRM actuation systems, which exploits the synergy between cleantech electromechanical linear actuators (EMLAs) and permanent magnet synchronous motors (PMSMs), is investigated. In addition, the reference trajectories of each joint are computed based on direct collocation with B-spline curves to extract the key kinematic and dynamic quantities of HDRMs. To guarantee robust tracking of the computed trajectories by the actual motion states, a novel control methodology, called robust subsystem-based adaptive (RSBA) control, is enhanced through an adaptive state observer. The RSBA control addresses inaccuracies inherent in motion, including modeling errors, non-triangular uncertainties, and both torque and voltage disturbances, to which the EMLA-driven HDRM is susceptible. The proposed RSBA control performance is validated through simulations and experiments of the scrutinized PMSM-powered EMLA-actuated mechanisms.
|
| |
| 15:00-16:30, Paper ThI2I.373 | Add to My Program |
| Design of a Passive Gravity Compensation Mechanism for Wearable Bilateral Lower Limb Exoskeletons |
|
| Chen, Tongshu | Southeast University |
| Shi, Ke | Southeast University |
| Zhang, Maozeng | Southeast University |
| Song, Aiguo | Southeast University |
Keywords: Mechanism Design, Wearable Robotics, Prosthetics and Exoskeletons
Abstract: ,重力补偿已被广泛应用于下肢 外骨骼用于减轻腿部负荷并缓解肌肉疲劳。被动 补偿方式具有固有的安全性和轻量化 优势;然而,现有设计常常受限于体积 以及基于弹簧的机制复杂性,这会损害 外骨骼的紧凑性和可靠性。为了解决这些问题 局限性,针对 可穿戴的被动双侧下肢外骨骼,具有紧凑型, 设计简单且坚固。基于 合成质心映射方法首先被引入,随后进行扩展 通过微分结构实现双侧配置。综合赛 采用自适应恒定力机制,最终系统 保持结构简洁和空间效率,使其 适合集成空间有限的可穿戴应用。一个 外骨骼原型基于所提出的机制构建, 其性能通过实验进行
|
| |
| 15:00-16:30, Paper ThI2I.374 | Add to My Program |
| RUMI: Rummaging Using Mutual Information |
|
| Zhong, Sheng | University of Michigan |
| Fazeli, Nima | University of Michigan |
| Berenson, Dmitry | University of Michigan |
Keywords: Perception for Grasping and Manipulation, Probability and Statistical Methods, Motion and Path Planning, Interactive Perception
Abstract: This paper presents Rummaging Using Mutual Information (RUMI), a method for online generation of robot action sequences to gather information about the pose of a known movable object in visually-occluded environments. Focusing on contact-rich rummaging, our approach leverages mutual information between the object pose distribution and robot trajectory for interactive perception. From an observed partial point cloud, RUMI deduces the compatible object pose distribution and approximates the mutual information of it with workspace occupancy in real time. Based on this, we develop an information gain cost function and a reachability cost function to keep the object within the robot's reach. These are integrated into a model predictive control (MPC) framework with a stochastic dynamics model, updating the pose distribution in a closed loop. Key contributions include a new belief framework for object pose estimation, an efficient information gain computation strategy, and a robust MPC-based control scheme. RUMI demonstrates superior performance in both simulated and real tasks compared to baseline methods.
|
| |
| 15:00-16:30, Paper ThI2I.375 | Add to My Program |
| Stability Analysis of a Dual-Rate Haptic System: A New Closed-Form Solution |
|
| Mashayekhi, Ahmad | Vrije Universiteit Brussel |
| Akhif, Oumaima | Vrije Universiteit Brussel |
| Khorasani, Amin | Vrije Universiteit Brussel |
| Verstraten, Tom | Vrije Universiteit Brussel |
Keywords: Haptics and Haptic Interfaces, Virtual Reality and Interfaces, Force Control
Abstract: Haptic devices (HDs) play a vital role in simulating the sense of touch in various virtual environments (VEs). Ensuring stable interaction between the HD and the VE is critical, particularly when simulating stiff virtual objects. One approach to enhancing stability is increasing the sampling rate; however, excessively high rates can compromise velocity information, thereby reducing damping stability. Dual-rate haptic devices address this issue by sampling position at higher rates and velocity at lower rates. This paper presents a novel closed-form equation for predicting the stability boundary of a dual-rate HD without restrictions on time delay or virtual damping. The proposed equation, which depends on the physical parameters of the HD and VE, sampling times, and time delay, is validated through simulations and experiments.
|
| |
| 15:00-16:30, Paper ThI2I.376 | Add to My Program |
| OKVIS2-X: Open Keyframe-Based Visual-Inertial SLAM Configurable with Dense Depth or LiDAR, and GNSS |
|
| Boche, Simon | Technical University of Munich |
| Jung, Jaehyung | Technical University of Munich |
| Barbas Laina, Sebastián | Technical University of Munich |
| Leutenegger, Stefan | ETH Zurich |
Keywords: SLAM, Mapping, Localization, Sensor Fusion
Abstract: To empower mobile robots with usable maps as well as highest state estimation accuracy and robustness, we present OKVIS2-X: a state-of-the-art multi-sensor Simultaneous Localization and Mapping (SLAM) system building dense volumetric occupancy maps, while scalable to large environments and operating in realtime. Our unified SLAM framework seamlessly integrates different sensor modalities: visual, inertial, measured or learned depth, LiDAR and Global Navigation Satellite System (GNSS) measurements. Unlike most state-of-the-art SLAM systems, we advocate using dense volumetric map representations when leveraging depth or range-sensing capabilities. We employ an efficient submapping strategy that allows our system to scale to large environments, showcased in sequences of up to 9 kilometers. OKVIS2-X enhances its accuracy and robustness by tightly-coupling the estimator and submaps through map alignment factors. Our system provides globally consistent maps, directly usable for autonomous navigation. To further improve the accuracy of OKVIS2-X, we also incorporate the option of performing online calibration of camera extrinsics. Our system achieves the highest trajectory accuracy in EuRoC against state-of-the-art alternatives, outperforms all competitors in the Hilti22 VI-only benchmark, while also proving competitive in the LiDAR version, and showcases state of the art accuracy in the diverse and large-scale sequences from the VBR dataset.
|
| |
| 15:00-16:30, Paper ThI2I.377 | Add to My Program |
| Enhancing Exploration with Diffusion Policies in Hybrid Off-Policy RL: Application to Non-Prehensile Manipulation |
|
| Le, Huy | Karlsruhe Institute of Technology, Bosch Center for Artificial Intelligence |
| Hoang, Tai | Karlsruhe Institute of Technology |
| Gabriel, Miroslav | Bosch Center for Artificial Intelligence |
| Neumann, Gerhard | Karlsruhe Institute of Technology |
| Anh Vien, Ngo | Bosch GmbH |
|
|
| |
| 15:00-16:30, Paper ThI2I.378 | Add to My Program |
| Learning Quadrupedal Locomotion for a Heavy Hydraulic Robot Using an Actuator Model |
|
| Lee, Minho | Korea Advanced Institute of Science and Technology |
| Kim, Hyeonseok | Hyundai-Rotem |
| Kim, Jin Tak | KITECH(Korea Institute of Industrial Technology), |
| Park, Sangshin | Korea Institute of Industrial Technology |
| Lee, Jeong Hyun | Korea Advanced Institute of Science & Technology (KAIST) |
| Cho, Jungsan | KITECH(Korea Institute of Industrial Technology) |
| Hwangbo, Jemin | Korean Advanced Institute of Science and Technology |
Keywords: Hydraulic/Pneumatic Actuators, Legged Robots, Reinforcement Learning
Abstract: The simulation-to-reality (sim-to-real) transfer of large-scale hydraulic robots presents a significant challenge in robotics because of the inherent slow control response and complex fluid dynamics. The complex dynamics result from the multiple interconnected cylinder structure and the difference in fluid rates of the cylinders. These characteristics complicate detailed simulation for all joints, making it unsuitable for reinforcement learning (RL) applications. In this work, we propose an analytical actuator model driven by hydraulic dynamics to represent the complicated actuators. The model predicts joint torques for all 12 actuators in under 1 microsecond, allowing rapid processing in RL environments. We compare our model with neural network-based actuator models and demonstrate the advantages of our model in data-limited scenarios. The locomotion policy trained in RL with our model is deployed on a hydraulic quadruped robot, which is over 300 kg. This work is the first demonstration of a successful transfer of stable and robust command-tracking locomotion with RL on a heavy hydraulic quadruped robot, demonstrating advanced sim-to-real transferability.
|
| |
| 15:00-16:30, Paper ThI2I.379 | Add to My Program |
| ATOM: A Tendon-Driven Aerial Manipulator Achieving High Stiffness, High Torque, and Low Coupling Disturbance |
|
| Xu, Quman | Harbin Institute of Technology |
| Li, Zhan | Harbin Institute of Technology |
| Li, Hai | Institute of Systems Engineering, China Academy of Engineering Physics |
| Yang, Yipeng | Harbin Institute of Technology |
| Yu, Xinghu | Ningbo Institute of Intelligent Equipment Technology Co. Ltd. |
| Chen, Zhang | Tsinghua University |
|
|
| |
| 15:00-16:30, Paper ThI2I.380 | Add to My Program |
| Online Approach to Near Time-Optimal Task-Space Trajectory Planning |
|
| Skuric, Antun | Pollen Robotics |
| Torres Alberto, Nicolas | INRIA Bordeaux Sud-Ouest, Stellantis |
| Joseph, Lucas | INRIA |
| Padois, Vincent | Inria Bordeaux |
| Daney, David | Inria centre at the university of Bordeaux, F-33405 Talence, France |
|
|
| |
| 15:00-16:30, Paper ThI2I.381 | Add to My Program |
| Improving Gecko Adhesive Performance in Robotic Systems through Trajectory Optimization |
|
| Kobo, Dror | Tel Aviv University |
| Gordon, Goren | Tel Aviv University |
| Pinchasik, Bat-El | Tel Aviv University |
Keywords: Grippers and Other End-Effectors, Biomimetics, Motion and Path Planning
Abstract: Gecko-inspired adhesives have attracted considerable attention due to their unique combination of strong, yet reversible adhesion to diverse surfaces. However, their integration into robotic systems remains limited due to sensitivity to contact alignment, typically requiring near-perpendicular engagement. Yet, many robotic tasks involve varying approach and detachment angles, highlighting the need for adhesion systems that operate reliably across different orientations and loading conditions. This study addresses two key questions: Can the adhesion strength of gecko-inspired adhesives, integrated into robotic systems, be optimized using trajectory optimization? And is this optimization surface-dependent? A gecko-inspired adhesive was integrated on a robotic arm’s end-effector, which attached to and detached from surfaces along various trajectories. The arm’s energy expenditure for each attachment-detachment cycle, along with the corresponding adhesion strength, were measured. Online particle swarm optimization (PSO) algorithm was applied to identify conditions that optimized adhesion strength, either to resist or ease detachment. Results show that trajectory optimization significantly improves both adhesion strength and detachment efficiency up to 17-fold, with surface-specific effectiveness. These findings underscore the importance of considering both the forces generated by gecko-inspired adhesives and the energy required by the robot to attach and detach from surfaces at various angles and positions. By optimizing adhesion strength across surfaces, this study helps overcome current limitations in the use of gecko-inspired adhesives for robotic applications, including grippers and climbers.
|
| |
| 15:00-16:30, Paper ThI2I.382 | Add to My Program |
| DiffRP: Diffusion-Driven Promising Region Prediction for Sampling-Based Path Planning |
|
| Xie, Zongwu | Harbin Institute of Technology |
| Ji, Yiming | Harbin Instititute of Technology |
| Liu, Yang | Harbin Institute of Technology |
| Xie, Yiqian | MinJiang University |
| Wang, Zhengpu | Harbin Institute of Technology |
| Ma, Boyu | Nanyang Technological University |
| Cao, Baoshi | Harbin Institute of Technology |
Keywords: AI-Based Methods, Motion and Path Planning
Abstract: Utilizing neural networks to predict potential regions containing optimal paths in advance and subsequently biasing the sampling probability towards these promising regions has been proven to effectively enhance the path planning efficiency of sampling-based algorithms. %In complex scenarios, uniform sampling often leads to prolonged computation time, whereas the biased information provided by promising regions can guide the algorithm to reduce sampling in irrelevant areas, thereby significantly shortening the computation time. Undoubtedly, the accuracy of the promising regions is of paramount importance. Currently, the generalizability of many CNN- or Transformer-based promising region prediction models remains limited, often performing poorly in unknown environments. Incorrect region predictions may reduce the planning efficiency, sometimes even underperforming uniform sampling. This work aims to leverage diffusion models to generate more accurate promising regions, referred to as the DiffRP (Diffusion-based Region Prediction) model, thereby designing a non-uniform sampler to improve sampling efficiency and reduce computation time. We propose three paradigms for generating promising regions using diffusion models, among which we innovatively introduce a biased noise initialization method for the diffusion process. Specifically, we bias the mean of the noise distribution using obstacle maps and design a map-conditioned denoising model to progressively generate accurate promising regions from the biased noise. Experiments on public datasets demonstrate that our proposed DiffRP method outperforms existing state-of-the-art (SOTA) models by 30% in promising region prediction accuracy. Moreover, the non-uniform sampling alg
|
| |
| 15:00-16:30, Paper ThI2I.383 | Add to My Program |
| DexSinGrasp: Learning a Unified Policy for Dexterous Object Singulation and Grasping in Densely Cluttered Environments |
|
| Xu, Lixin | National University of Singapore |
| Liu, Zixuan | National University of Singapore |
| Gui, Zhewei | National University of Singapore |
| Guo, Jingxiang | National University of Singapore |
| Jiang, Zeyu | National University of Singapore |
| Zhang, Tongzhou | Nation University of Singapore |
| Xu, Zhixuan | National University of Singapore |
| Gao, Chongkai | National University of Singapore |
| Shao, Lin | National University of Singapore |
Keywords: Grasping, Dexterous Manipulation, Deep Learning in Grasping and Manipulation
Abstract: Grasping objects in cluttered environments remains a fundamental yet challenging problem in robotic manipulation. While prior works have explored learning-based synergies between pushing and grasping for two-fingered grippers, few have leveraged the high degrees of freedom (DoF) in dexterous hands to perform efficient singulation for grasping in cluttered settings. In this work, we introduce DexSinGrasp, a unified policy for dexterous object singulation and grasping. DexSinGrasp enables high-dexterity object singulation to facilitate grasping, significantly improving efficiency and effectiveness in cluttered environments. We incorporate clutter arrangement curriculum learning to enhance success rates and generalization across diverse clutter conditions, while policy distillation enables a deployable vision-based grasping strategy. To evaluate our approach, we introduce a set of cluttered grasping tasks with varying object arrangements and occlusion levels. Experimental results show that our method outperforms baselines in both efficiency and grasping success rate, particularly in dense clutter. Codes, appendix, and videos are available on our website https://nus-lins-lab.github.io/dexsingweb/.
|
| |
| 15:00-16:30, Paper ThI2I.384 | Add to My Program |
| Hybrid Dynamics Modeling and Trajectory Planning for Cable-Trailer with Quadruped Robot System |
|
| Zhang, Wentao | Huazhong University of Science and Technology |
| Xu, Shaohang | City University of Hong Kong |
| Zuo, Gewei | Huazhong University of Science and Technology |
| Li, Bolin | Huazhong University of Science and Technology |
| Wang, Jingbo | Shanghai AI Lab |
| Zhu, Lijun | Huazhong University of Science and Technology |
Keywords: Nonholonomic Motion Planning, Collision Avoidance, Legged Robots
Abstract: Inspired by sled-pulling dogs in transportation, we present a cable-trailer integrated with a quadruped robot system. The motion planning of this system faces challenges due to the interactions between the cable's state transitions, the trailer's nonholonomic constraints, and the system's underactuation. To address these, we develop a hybrid dynamics model that captures the cable's taut and slack states. A search algorithm is then introduced to compute a suboptimal trajectory while incorporating mode transitions. Additionally, we propose a novel collision avoidance constraint based on geometric polygons to formulate the trajectory optimization problem for the hybrid system. The proposed method is implemented on a Unitree A1 quadruped robot with a customized cable-trailer and validated through experiments. The real system demonstrates both agile and safe motion with cable mode transitions.
|
| |
| 15:00-16:30, Paper ThI2I.385 | Add to My Program |
| A Near-Time-Optimal Trajectory Planning under Torque and Jerk Constraints for Industrial Robots on Fixed Paths (I) |
|
| Zhao, Shize | Harbin Institute of Technology |
| Zheng, Tianjiao | Harbin Institute of Technology |
| Wang, Chengzhi | Harbin Institute of Technology |
| Zhu, Yanhe | Harbin Institute of Technology |
| Zhao, Jie | Harbin Institute of Technology |
Keywords: Industrial Robots, Constrained Motion Planning, Integrated Planning and Control
Abstract: Trajectory planning plays a pivotal role in robotic motion planning, particularly in achieving time-optimal motion under complex dynamic constraints. Although the Time-Optimal Path Parameterization (TOPP) algorithm effectively addresses trajectory generation under joint torque constraints, classical methods often overlook third-order constraints. As a result, the generated trajectories, while torque-feasible, exhibit excessive jerk and poor dynamic stability, which limits their practical applicability. To overcome these limitations, this paper proposes a trajectory planning framework that simultaneously enforces torque and jerk constraints. Building upon torque-constrained TOPP, the method integrates a shooting-based strategy to identify switching points through bidirectional integration under jerk constraints and employs a Sigmoid-based fusion scheme to eliminate integration errors and ensure smooth transitions. The proposed approach is experimentally validated on a six-degree-of-freedom industrial robot. Comparative evaluations with the TOPP-RA algorithm demonstrate that the method significantly reduces both high-frequency vibrations during high-speed execution and residual oscillations after motion termination. Feedback from torque rate measurements, vibration sensors, and laser tracker data confirms faster settling and improved compliance, making the approach well-suited for complex industrial scenarios.
|
| |
| 15:00-16:30, Paper ThI2I.387 | Add to My Program |
| Anticipatory Task and Motion Planning: Improved Rearrangement in Persistent Continuous-Space Environments |
|
| Dhakal, Roshan | George Mason University |
| Nguyen, Duc | George Mason University |
| Silver, Tom | Princeton University |
| Xiao, Xuesu | George Mason University |
| Stein, Gregory | George Mason University |
Keywords: Integrated Planning and Learning, Task and Motion Planning
Abstract: We consider a sequential task and motion planning (TAMP) setting in which a robot is assigned continuous-space rearrangement-style tasks one-at-a-time in an environment that persists between each. Lacking advance knowledge of future tasks, existing (myopic) planning strategies unwittingly introduce side effects that impede completion of subsequent tasks: e.g., by blocking future access or manipulation. We present anticipatory task and motion planning, in which estimates of expected future cost from a learned model inform selection of plans generated by a model-based TAMP planner so as to avoid such side effects, choosing configurations of the environment that both complete the task and reduce overall cost. Simulated many-task deployments in navigation-among-movable-obstacles and cabinet-loading domains yield improvements of 32.7% and 16.7% average per-task cost respectively. When given time in advance to prepare the environment, our learning-augmented planning approach yields improvements of 83.1% and 22.3%.Finally, we also demonstrate anticipatory TAMP on a real-world Fetch mobile manipulator.
|
| |
| 15:00-16:30, Paper ThI2I.388 | Add to My Program |
| Minimum-Length Coverage Path Planning for Grid Environments with Approximation Guarantees |
|
| Ramesh, Megnath | University of Waterloo |
| Imeson, Frank | Avidbots |
| Fidan, Baris | University of Waterloo |
| Smith, Stephen L. | University of Waterloo |
Keywords: Motion and Path Planning, Service Robotics, Optimization and Optimal Control
Abstract: We focus on planning minimum-length robot paths to cover environments using the robot's sensor or coverage (e.g. cleaning) tool. Many algorithms use the following framework: (i) compute a grid decomposition of the environment, (ii) partition the grid to be covered by non-overlapping coverage lines (straight-line paths), and (iii) compute a cost-minimizing tour of the coverage lines to get a coverage path. While this framework aims to minimize turns in the path, it does not yield guarantees on the resulting path length. In this paper, we show that this framework guarantees a coverage path of length (1 + 1.5 gamma) times the optimal, where gamma > 1 is the approximation factor to solve the metric traveling salesman problem (metric-TSP). Following this, we propose the Minimum Length Coverage Approx (MLC-Approx) approach that modifies this framework to achieve an approximation factor of (1.5 + epsilon), where epsilon << 1 depends on the number of coverage lines. Instead of computing a tour of the coverage lines, MLC-Approx merges minimum-length sub-tours of coverage lines while minimizing the turns added by the merges. We also propose a lazy variation of MLC-Approx that achieves the same result with faster empirical runtime. We validate MLC-Approx in simulations using maps of real-world environments and compare against state-of-the-art CPP approaches.
|
| |
| 15:00-16:30, Paper ThI2I.389 | Add to My Program |
| Efficient Alignment of Unconditioned Action Prior for Language-Conditioned Pick and Place in Clutter (I) |
|
| Xu, Kechun | Zhejiang University |
| Xia, Xunlong | Alibaba Cloud |
| Wang, Kaixuan | Zhejiang University |
| Yang, Yifei | Zhejiang University |
| Mao, Yunxuan | Zhejiang University |
| Deng, Bing | Alibaba Cloud |
| Ye, Jieping | Alibaba Cloud |
| Xiong, Rong | Zhejiang University |
| Wang, Yue | Zhejiang University |
Keywords: Deep Learning in Grasping and Manipulation, Grasping, Imitation Learning
Abstract: We study the task of language-conditioned pick and place in clutter, where a robot should grasp a target object in open clutter and move it to a specified place. Some approaches learn end-to-end policies with features from vision foundation models, requiring large datasets. Others combine foundation models in a zero-shot setting, suffering from cascading errors. In addition, they primarily leverage vision and language foundation models, focusing less on action priors. In this paper, we aim to develop an effective policy by integrating foundation priors from vision, language, and action. We propose A2, an action prior alignment method that aligns unconditioned action priors with 3D vision-language priors by learning one attention layer. The alignment formulation enables our policy to train with less data and preserve zero-shot generalization capabilities. We show that a shared policy for both pick and place actions enhances the performance for each task, and introduce a policy adaptation scheme to accommodate the multi-modal nature of actions. Extensive experiments in simulation and the real-world show that our policy achieves higher task success rates with fewer steps for both pick and place tasks in clutter, effectively generalizing to unseen objects and language instructions. Videos and codes are available at https://xukechun.github.io/papers/A2.
|
| |
| 15:00-16:30, Paper ThI2I.390 | Add to My Program |
| Handling Transitions across Singularities for UR-Like Serial Robots |
|
| Boschi, Ivan | University of Bologna |
| De Toni, Alessandro | Università Di Bologna |
| Di Leva, Roberto | University of Bologna |
| Ida', Edoardo | University of Bologna |
| Carricato, Marco | University of Bologna |
Keywords: Kinematics, Motion and Path Planning
Abstract: The inverse kinematics (IK) of serial manipulators admits multiple solutions, making the selection of the desired one potentially challenging depending on the application at hand. For robots with a non-cuspidal architecture, a suitable strategy is to partition the joint space into independent regions, known as Uniqueness Domains (UDs), which are separated by surfaces defined by singular configurations. Within each UD, a single IK-solution branch corresponds to a specific robot posture. In this case, when an assigned task-space trajectory requires the robot to transition between UDs, a singular configuration must be crossed. Despite the practical importance of this issue, the existing literature lacks straightforward techniques for enabling such transitions. This paper proposes a method that facilitates switching between IK-solution branches when a task-space trajectory requires crossing a singularity, ensuring continuity and differentiability of the joint variables. The proposed method is evaluated against competing ones, both in simulation and experimentally, showing significant advantages.
|
| |
| 15:00-16:30, Paper ThI2I.391 | Add to My Program |
| Recursive and Scalable 3D Coarse to Fine Path Planning |
|
| Lee, Hwajung | Sungkyunkwan University |
| Ko, Daegeol | Sungkyunkwan University |
| Hur, Jaehyuk | Sungkyunkwan University |
| Lee, Junwon | Sungkyunkwan University |
| Ha, Seongbo | Sungkyunkwan University |
| Ko, Jong Hwan | Sungkyunkwan University |
| Yu, Hyeonwoo | SungKyunKwan University |
Keywords: Motion and Path Planning, Vision-Based Navigation, AI-Based Methods
Abstract: Path planning in large-scale, complex 3D environments is fundamentally constrained by a trade-off between path quality and computational speed. This paper presents RUSH (Recursive and Scalable 3D Coarse To Fine Path Planning), a hierarchical framework that resolves this trade-off. RUSH decomposes the long-range planning task into a coarse plan followed by fine-grained, independent subproblems that can be solved in parallel. These subproblems are addressed by a unified, diffusion-based network that refines an initial estimate path by learning its residual to an optimal path. This approach allows RUSH to leverage rich geometric information directly from 3D voxel maps without being bottlenecked by the full map’s complexity. We validate our method on large-scale outdoor (KITTI, MulRan) and indoor (HM3D) datasets, each spanning a 200m×200m×6m map. Experimental results demonstrate that RUSH generates feasible, high-quality paths with remarkable efficiency, achieving up to a 12.59× speedup over a hierarchically accelerated A* baseline, while maintaining a path cost within 24% of the optimal solution. This performance gain positions RUSH as a powerful and practical solution for applications requiring rapid global path planning in large-scale 3D maps.
|
| |
| 15:00-16:30, Paper ThI2I.392 | Add to My Program |
| Terramechanics-Based Mobility Failure Compensation and Soil Manipulation (I) |
|
| Pavlov, Catherine | Carnegie Mellon University |
| Rogg, Arno | NASA Ames Research Center |
| Johnson, Aaron M. | Carnegie Mellon University |
Keywords: Space Robotics and Automation, Field Robots, Wheeled Robots
Abstract: In this paper, we enable new mobility and manipulation modes for wheeled planetary exploration rovers through the use of terramechanics modeling and field experiments. Useful modes of wheel-based soil manipulation and examples of rovers driving with degraded mobility systems are first demonstrated in lunar and Martian analog environments. We show a full-scale rover use its wheels to dig trenches up to 10.6 cm deep, dig holes to estimate soil characteristics, and modify terrain to make it accessible to a smaller robot. We also measure the impact of actuator failure on a rover in lunar simulant. Here, we show the slip doubled on moderate slopes for a damaged drive motor, which would exceed the rover's operational limits for slip, motivating the need for driving strategies that mitigate mobility loss. We then develop an optimization framework which uses a recently developed terramechanics model to automatically generate both open and closed-loop driving strategies for planetary rovers performing terrain manipulation or operating in a degraded state with no need for hand tuning of behaviors. Finally, we demonstrate the generated driving strategies for soil manipulation and mobility compensation on a rover in a controlled lab setting, where we show that 1) mobility is maintained while manipulating soil; and 2) mobility is regained while experiencing failure of steer and drive actuators.
|
| |
| 15:00-16:30, Paper ThI2I.393 | Add to My Program |
| Enhancing Motor Synchrony in Rhythmic Dyadic Tasks through Portable Elbow Exoskeletons |
|
| Peperoni, Emanuele | Scuola Superiore Sant'Anna |
| Capitani, Stefano Laszlo | Scuola Superiore Sant'Anna |
| Grazi, Lorenzo | Scuola Superiore Sant'Anna |
| Penna, Michele Francesco | Scuola Superiore Sant'Anna |
| Amato, Lorenzo | Scuola Superiore Sant'Anna |
| Dell'Agnello, Filippo | Scuola Superiore Sant'Anna |
| Baldoni, Andrea | Scuola Superiore Sant'Anna |
| Formica, Domenico | Università Campus Bio-Medico Di Roma |
| Leman, Marc | University of Ghent |
| Vitiello, Nicola | Scuola Superiore Sant'Anna |
| Crea, Simona | Scuola Superiore Sant'Anna |
| Trigili, Emilio | Scuola Superiore Sant'Anna |
Keywords: Wearable Robotics, Physical Human-Robot Interaction, Haptics and Haptic Interfaces
Abstract: Synchrony is a cornerstone for the successful physical interaction between humans while cooperating or competing towards a goal and is achieved by correct and smooth information exchange between subjects. Recently, Human-Robot- Human (HRH) interaction arose as an emerging paradigm for improving motor control in collaborative and dyadic movement tasks. Among the robotic solutions explored for agent coupling, exoskeletons represent powerful tools for exerting torque and force feedback at the joint level. In this work, two identical torque-controlled elbow exoskeletons were used in the context of dyadic interaction, to provide haptic feedback and improve synchrony between two individuals performing a tapping task. Each exoskeleton is lightweight and compact, with a total weight of 0.8 kg on the arm and a volume of 360x80x80 𝒎𝒎𝟑. Bench tests to verify the performance of closed-loop torque control showed a residual torque below 0.2 Nm when the reference torque was set to null, and a bandwidth higher than 6 Hz, thus achieving adequate performance for applications in HRH scenarios. During human subjects’ experiments, the root-mean-squared error between the two users’ joint trajectories was 50% lower when users received haptic feedback compared to the condition without feedback; similarly, the relative phase error was lower than 60%. The results of this study suggest that exoskeletons can be used to enhance synchrony in HRH interactions, which could potentially be useful in rehabilitation training, collaborative industrial operations or sport and music learning.
|
| |
| 15:00-16:30, Paper ThI2I.394 | Add to My Program |
| Optimal Design of Integrated Aerial Platforms with Passive Joints |
|
| Yu, Yushu | Beijing Institute of Technology |
| Wang, Kaidi | Beijing Institute of Technology |
| Meng, Xin | National University of Singapore |
| Du, Jianrui | Beijing Institute of Technology |
| Sun, Jiali | Beijing Institute of Technology |
| Lai, Ganghua | Beijing Institute of Technology |
| Zhang, Yibo | Beijing Institute of Technology |
Keywords: Aerial Systems: Applications, Aerial Systems: Mechanics and Control, Methods and Tools for Robot System Design
Abstract: The Integrated Aerial Platform (IAP) uses multiple quadrotor sub-vehicles, acting as independent thrust generators, connected to a central platform via passive joints. This setup allows the sub-vehicles to collectively apply forces and torques to the central platform, achieving full six-degree-of-freedom (6-DoF) motion through coordinated thrust and posture adjustments. The IAP's modular design offers significant advantages in terms of mechanical simplicity, reconfigurability for diverse scenarios, and enhanced mission adaptability. This paper presents a comprehensive framework for IAP modeling and optimal design. We introduce a ``design matrix" that encapsulates key architectural parameters, including the number of sub-vehicles, their spatial configuration, and the types of passive joints used. To improve control performance and ensure balanced wrench generation capabilities, we propose an optimized design strategy that minimizes the condition number of this design matrix. Two distinct IAP configurations were optimally designed based on two typical application scenarios. The efficacy of the proposed optimization methodology was subsequently validated through comparative analysis against unoptimized platforms. Moreover, the full actuation capability of the IAP was empirically confirmed via extensive simulations and real-world flight experiments, which also demonstrated its operational performance through direct wrench control experiment.
|
| |
| 15:00-16:30, Paper ThI2I.395 | Add to My Program |
| Data-Efficient Constrained Robot Learning with Probabilistic Lagrangian Control |
|
| He, Shiming | Hangzhou City University |
| Ding, Yuzhe | Zhejiang University |
Keywords: Probabilistic Inference, Reinforcement Learning, Compliance and Impedance Control
Abstract: We propose a novel framework for data-efficient black-box robot learning under constraints. Our approach integrates probabilistic inference with Lagrangian optimization. With the guide of a learned Gaussian process model, the Lagrange multiplier is controlled by the probability of whether the constraints would be satisfied. This reduces the typical oscillations seen in primal-dual updates and therefore improves both data efficiency and safety during learning. Both synthetic results and robot experiments demonstrate that our method is a scalable and effective solution for constrained robot learning problems.
|
| |
| 15:00-16:30, Paper ThI2I.396 | Add to My Program |
| Threat-Aware UAV Dodging of Human-Thrown Projectiles with an RGB-D Camera |
|
| Zhang, Yuying | Sun Yat-Sen University |
| Fan, Na | The Hong Kong University of Science and Technology (HKUST) |
| Zheng, Haowen | Sun Yat-Sen University |
| Liang, Junning | Sun Yat-Sen University |
| Pan, Zongliang | Shenzhen ePropulsion Technology Limited |
| Chen, Qifeng | HKUST |
| Lyu, Ximin | Sun Yat-Sen University |
Keywords: Aerial Systems: Perception and Autonomy, Aerial Systems: Applications, Aerial Systems: Mechanics and Control
Abstract: Unmanned aerial vehicles (UAVs) performing tasks such as transportation and aerial photography are vulnerable to intentional projectile attacks from humans. Dodging such a sudden and fast projectile poses a significant challenge for UAVs, requiring ultra-low latency responses and agile maneuvers. Drawing inspiration from baseball, in which pitchers' body movements are analyzed to predict the ball's trajectory, we propose a novel real-time dodging system that leverages an RGB-D camera. Our approach integrates human pose estimation with depth information to predict the attacker's motion trajectory and the subsequent projectile trajectory. Additionally, we introduce an uncertainty-aware dodging strategy to enable the UAV to dodge incoming projectiles efficiently. Our perception system achieves high prediction accuracy and outperforms the baseline in effective distance and latency. The dodging strategy addresses temporal and spatial uncertainties to ensure UAV safety. Extensive real-world experiments demonstrate the framework's reliable dodging capabilities against sudden attacks and its outstanding robustness across diverse scenarios.
|
| |
| 15:00-16:30, Paper ThI2I.397 | Add to My Program |
| PoCoDP3: Pose and Contact-Aware Visual-Tactile Policy for Contact-Rich 3D Manipulation |
|
| Yue, Zhaokun | Southeast University |
| Tong, Ling | Southeast University |
| Qian, Kun | Southeast University |
Keywords: Force and Tactile Sensing, Imitation Learning, Manipulation Planning
Abstract: Imitation learning in contact-rich tasks requires both global spatial awareness and fine-grained in-hand interaction understanding. However, vision-only policies based on images or point clouds are often susceptible to occlusion and struggle to capture critical contact details, particularly in visually ambiguous regions or during subtle tactile interactions. In this work, we present PoCoDP3, a pose- and contact-aware visual-tactile policy that integrates 3D point clouds and tactile inputs to generate actions in contact-rich tasks. PoCoDP3 introduces a dual-branch tactile encoder that jointly models contact dynamics and estimates in-hand object pose, enabling structured tactile representations for precise contact-rich manipulation. A contact-driven cross-modal fusion mechanism adaptively prioritizes sensory modalities based on real-time interaction cues, enabling efficient visual-tactile integration. Moreover, a reference-guided diffusion policy leverages reference action offsets to reduce sampling steps, significantly accelerating inference while maintaining action quality. Experiments across simulation and real-world tasks demonstrate that PoCoDP3 consistently outperforms representative 2D and 3D policies in terms of both accuracy and inference efficiency.
|
| |
| 15:00-16:30, Paper ThI2I.398 | Add to My Program |
| Safe and Dynamically-Feasible Motion Planning Using Control Lyapunov and Barrier Functions |
|
| Mestres, Pol | California Institute of Technology |
| Nieto-Granda, Carlos | DEVCOM U.S. Army Research Laboratory |
| Cortes, Jorge | University of California, San Diego |
Keywords: Motion and Path Planning, Robot Safety, Optimization and Optimal Control, Control Barrier Functions
Abstract: This paper considers the problem of designing motion planning algorithms for control-affine systems that generate collision-free paths from an initial to a final destination and can be executed using safe and dynamically-feasible controllers. We introduce the C-CLF-CBF-RRT algorithm, which produces paths with such properties and leverages rapidly exploring random trees (RRTs), control Lyapunov functions (CLFs) and control barrier functions (CBFs). For linear systems with polytopic and ellipsoidal constraints, C-CLF-CBF-RRT requires solving a quadratically constrained quadratic program (QCQP) at every iteration of the algorithm, which can be done efficiently. We prove the probabilistic completeness of C-CLF-CBF-RRT and showcase its performance in simulation and hardware experiments.
|
| |
| 15:00-16:30, Paper ThI2I.399 | Add to My Program |
| Environment-Driven Online LiDAR-Camera Extrinsic Calibration (I) |
|
| Huang, Zhiwei | Tongji University |
| Li, Jiaqi | Tongji University |
| Zhao, Hongbo | Tongji University |
| Zhong, Ping | Central South University |
| Ma, Xiao | Beijing Institute of Aerospace Control Devices |
| Zhou, Xiao-Hu | Institute of Automation, Chinese Academy of Sciences |
| Ye, Wei | Tongji University |
| Fan, Rui | Tongji University |
Keywords: Sensor Fusion
Abstract: LiDAR-camera extrinsic calibration (LCEC) is crucial for multi-modal data fusion in autonomous robotic systems. Existing methods, whether target-based or target-free, typically rely on customized calibration targets or fixed scene types, which limit their applicability in real-world scenarios. To address these challenges, we present EdO-LCEC, the first environment-driven online calibration approach. Unlike traditional target-free methods, EdO-LCEC employs a generalizable scene discriminator to estimate the feature density of the application environment. Guided by this feature density, EdO-LCEC extracts LiDAR intensity and depth features from varying perspectives to achieve higher calibration accuracy. To overcome the challenges of cross-modal feature matching between LiDAR and camera, we introduce dual-path correspondence matching (DPCM), which leverages both structural and textural consistency for reliable 3D-2D correspondences. Furthermore, we formulate the calibration process as a joint optimization problem that integrates global constraints across multiple views and scenes, thereby enhancing overall accuracy. Extensive experiments on real-world datasets demonstrate that EdO-LCEC outperforms state-of-the-art methods, particularly in scenarios involving sparse point clouds or partially overlapping sensor views.
|
| |
| 15:00-16:30, Paper ThI2I.400 | Add to My Program |
| A Novel Robotic Prototype Simulating Oral and Pharyngeal Swallowing with Passive Epiglottis Actuation |
|
| Zhou, Zizhong | University of Bristol |
| Gambaruto, Alberto | University of Bristol |
| Tzemanaki, Antonia | University of Bristol |
Keywords: Mechanism Design, Medical Robots and Systems, Biomimetics
Abstract: Robotics simulating swallowing hold the potential to enhance our understanding of the swallowing process, support the development of safer texture- or viscosity-modified foods and beverages, and act as medical education tools for both patients with dysphagia and healthcare professionals. Although robotic models in the literature offer insightful actuation mechanisms, many tackle only isolated stages of swallowing, have reduced physiological accuracy, and tend to be mechanically complex and costly. This paper addresses these limitations by developing a novel robotic model that replicates the oral and pharyngeal stages of swallowing, featuring a passive system to simulate the protective closure of the epiglottis. This paper presents the design, function and experimental validation of the robot model. The proposed model can transport thickened fluids from the tongue to the pharynx, preventing aspiration. By enabling passive epiglottis closure, this model advances the physiological fidelity of swallowing robotics, offering insights into actuation mechanisms for future studies.
|
| |
| 15:00-16:30, Paper ThI2I.401 | Add to My Program |
| Load-Bearing Assessment for Safe Locomotion of Quadruped Robots on Collapsing Terrain |
|
| Suzano Medeiros, Vivian | University of São Paulo |
| Dessy, Giovanni Battista | Istituto Italiano Di Tecnologia |
| Boaventura, Thiago | University of São Paulo |
| Becker, Marcelo | University of São Paulo |
| Semini, Claudio | Istituto Italiano Di Tecnologia |
| Barasuol, Victor | Istituto Italiano Di Tecnologia |
Keywords: Legged Robots, Whole-Body Motion Planning and Control, Vision-Based Navigation
Abstract: Collapsing terrains, often present in search and rescue missions or planetary exploration, pose significant challenges for quadruped robots. This paper introduces a robust locomotion framework for safe navigation over unstable surfaces by integrating terrain probing, load-bearing analysis, motion planning, and control strategies. Unlike traditional methods that rely on specialized sensors or external terrain mapping alone, our approach leverages joint measurements to assess terrain stability without hardware modifications. A Model Predictive Control (MPC) system optimizes robot motion, balancing stability and probing constraints, while a state machine coordinates terrain probing actions, enabling the robot to detect collapsible regions and dynamically adjust its footholds. Experimental results on custom-made collapsing platforms and rocky terrains demonstrate the framework's ability to traverse collapsing terrain while maintaining stability and prioritizing safety.
|
| |
| 15:00-16:30, Paper ThI2I.402 | Add to My Program |
| DVMM: A Dual-View Combination Descriptor for Multi-Modal LiDARs Online Place Recognition |
|
| Duan, Xuzhe | Wuhan University |
| Hu, Qingwu | Wuhua University |
| Ai, Mingyao | Wuhua University |
| Zhao, Pengcheng | Wuhua University |
| Li, Jiayuan | Wuhan University |
Keywords: SLAM, Multi-Robot SLAM, Localization
Abstract: Existing place recognition descriptors developed for single-agent SLAM struggle with multi-modal LiDAR differences in collaborative SLAM. To overcome this, we propose an online place recognition method for multi-modal LiDARs. This method introduces a dual-view combination descriptor, termed DVMM, by separately encoding azimuthal and vertical scene information. The place recognition process consists of two stages: loop closure detection and verification. In the detection stage, point clouds are projected onto an adaptive grid and a 1D azimuthal descriptor is generated via Gaussian-weighted column summation. The azimuthal descriptor is utilized to retrieve loop candidates through vector matching. In the verification stage, point clouds within a fixed height range are encoded as a binary occupancy image, which serves as the cross-section descriptor. Accurate loop closures are determined by performing image matching on the cross-section descriptors. We evaluate the proposed method on both public and realworld datasets encompassing a total of seven LiDAR sensors. The results demonstrate that DVMM significantly outperforms state-of-the-art descriptors in handling multi-modal LiDAR data and is compatible with collaborative SLAM systems. The code will be open-sourced upon acceptance.
|
| |
| 15:00-16:30, Paper ThI2I.403 | Add to My Program |
| A Differentiable Distance Metric for Robotics through Generalized Alternating Projection |
|
| Gonçalves, Vinicius Mariano | Federal University of Minas Gerais, UFMG, Brazil |
| Wei, Shiqing | New York University |
| Malacarne Soeiro de Souza, Eduardo | UFMG |
| Krishnamurthy, Prashanth | New York University Tandon School of Engineering |
| Tzes, Anthony | New York University Abu Dhabi |
| Khorrami, Farshad | New York University Tandon School of Engineering |
Keywords: Constrained Motion Planning, Computational Geometry, Motion Control
Abstract: In many robotics applications, it is necessary to compute not only the distance between the robot and the environment, but also its derivative - for example, when using control barrier functions. However, since the traditional Euclidean distance is not differentiable, meaning it is not guaranteed to be differentiable everywhere}, there is a need for alternative distance metrics that possess this property. Recently, a metric with guaranteed differentiability was proposed [1]. This approach has some important drawbacks, which we address in this paper. We provide much simpler and practical expressions for the smooth projection for general convex polytopes. Additionally, as opposed to [1], we ensure that the distance vanishes as the objects overlap. We show the efficacy of the approach in experimental results. Our proposed distance metric is publicly available through the Python-based simulation package UAIBot.
|
| |
| 15:00-16:30, Paper ThI2I.404 | Add to My Program |
| Cross-Embodiment Imitation: Learning a Unified Latent Space for Multi-Robot Control |
|
| Yan, Yashuai | Vienna University of Technology |
| Lee, Dongheui | Technische Universität Wien (TU Wien) |
Keywords: Multi-Robot Systems, Imitation Learning, Machine Learning for Robot Control
Abstract: We present a scalable framework for cross-embodiment humanoid robot control by learning a shared latent representation that unifies motion across humans and diverse humanoid platforms, including single-arm, dual-arm, and legged humanoid robots. Our method proceeds in two stages: first, we construct a decoupled latent space that captures localized motion patterns across different body parts using contrastive learning, enabling accurate and flexible motion retargeting even across robots with diverse morphologies. To enhance alignment between embodiments, we introduce tailored similarity metrics that combine joint rotation and end-effector positioning for critical segments, such as arms. Then, we train a goal-conditioned control policy directly within this latent space using only human data. Leveraging a conditional variational autoencoder, our policy learns to predict latent space displacements guided by intended goal directions. We show that the trained policy can be directly deployed on multiple robots without any adaptation. Furthermore, our method supports the efficient addition of new robots to the latent space by learning only a lightweight, robot-specific embedding layer. The learned latent policies can also be directly applied to the new robots. Experimental results demonstrate that our approach enables robust, scalable, and embodiment-agnostic robot control across a wide range of humanoid platforms.
|
| |
| 15:00-16:30, Paper ThI2I.405 | Add to My Program |
| Knee-Inspired Hinge Absorbs Longitudinal Impacts to Enhance Robot-Environment Interaction Safety |
|
| Yang, Lianxin | Beihang University |
| Li, Xinyan | Tsinghua University |
| Zhao, Tianyu | Tsinghua University |
| Zhao, Zhihua | Tsinghua University |
Keywords: Biomimetics, Compliant Joint/Mechanism, Physical Human-Robot Interaction, Impact Absorption
Abstract: As robots integrate into human society, safe robot-environment interaction has emerged as a growing priority. A promising solution is introducing compliance to existing robots, akin to musculoskeletal systems, to absorb impacts. However, mimicking longitudinal compliance in biological joints remains a challenge due to its complex architecture. Here, adapting the elastic longitudinal movement structure of knee, we incorporate mechanical hinges with a compact buffer structure to enable both simple rotation and effective longitudinal impact absorption. Under longitudinal loading, the buffer structure transmits the limited compression to amplified deformations of elastic elements, and thus produces resistance. The load-displacement curve is tailored for a high-static-low-dynamic stiffness to improve energy absorption efficiency. Drop tests and walking robot demonstrations confirm that our knee-inspired hinge not only mitigates acceleration transmitted to robot body but also reduces ground reaction forces, thus improving robot-environment interaction safety. This work highlights the design paradigm of adapting natural solutions, and holds potential for direct integration into robots.
|
| |
| 15:00-16:30, Paper ThI2I.406 | Add to My Program |
| SilRef: Joint Visual Silhouette and Tactile Pose Optimization for Transparent Object Manipulation |
|
| Weibel, Jean-Baptiste | TU Wien |
| Dubois, Clemence | Université Paris-Saclay, CEA List |
| Layegh Khavidaki, Negar | Technical University of Vienna |
| Aloui, Saifeddine | Université Grenoble Alpes, CEA, Leti |
| Grossard, Mathieu | Université Paris-Saclay, CEA, List |
| Vincze, Markus | Vienna University of Technology |
| Holzinger, Andreas | Human-Centered AI Lab, University of Natural Resources and Life Sciences Vienna |
|
|
| |
| 15:00-16:30, Paper ThI2I.407 | Add to My Program |
| Harmonising Safety Paradigms: Energy-Aware Control of Active Response and Passive Compliance for Safety-Critical Robotic Tasks |
|
| Zhao, Xinyuan | Institute for Infocomm Research, A*STAR |
| Liang, Wenyu | Institute for Infocomm Research, A*STAR |
| Xue, Junyuan | National University of Singapore |
| Wu, Yan | A*STAR Institute for Infocomm Research |
Keywords: Safety in HRI, Motion Control, Compliance and Impedance Control
Abstract: Ensuring safety in robotic manipulation is increasingly critical as robots become integrated into human-shared environments for complex physical interaction tasks. This paper presents an energy-aware control framework that combines active responses with passive compliance for safety-critical robotic manipulation. Specifically, Control Barrier Functions (CBFs) are employed for active collision avoidance with detected obstacles, which are then integrated with fallback safety actions to resolve potential violation of CBF constraints. Complementing this active safety paradigm, a passive safety paradigm is implemented to mitigate post-collision impacts by monitoring energy variance and limiting power exchanges. Furthermore, an energy tank is incorporated to enforce passivity of the robot, which is crucial to address potential instability issues in variable impedance control. To make the tank adaptive to varying energy requirements arising from dynamic environments and unpredictable events, we propose a novel, task-agnostic tank recharging condition without compromising the system's passivity guarantee. The effectiveness of the proposed control framework is validated through experiments on a KUKA iiwa 14 robot.
|
| |
| 15:00-16:30, Paper ThI2I.408 | Add to My Program |
| AsterNav: Autonomous Aerial Robot Navigation in Darkness Using Passive Computation |
|
| Singh, Deepak | Worcester Polytechnic Institute |
| Khobragade, Shreyas | Worcester Polytechnic Institute |
| Jagannatha Sanket, Nitin | Worcester Polytechnic Institute |
Keywords: Aerial Systems: Perception and Autonomy, Vision-Based Navigation, Deep Learning for Visual Perception
Abstract: Autonomous aerial navigation in absolute darkness is crucial for post-disaster search and rescue operations, which often occur from disaster-zone power outages. Yet, due to resource constraints, tiny aerial robots, perfectly suited for these operations, are unable to navigate in the darkness to find survivors safely. In this paper, we present an autonomous aerial robot for navigation in the dark by combining an Infra-Red (IR) monocular camera with a large-aperture coded lens and structured light without external infrastructure like GPS or motion-capture. Our approach obtains depth-dependent defocus cues (each structured light point appears as a pattern that is depth dependent), which acts as a strong prior for our AsterNet deep depth estimation model. The model is trained in simulation by generating data using a simple optical model and transfers directly to the real world without any fine-tuning or retraining. AsterNet runs onboard the robot at 20 Hz on an NVIDIA Jetson Orin Nano. Furthermore, our network is robust to changes in the structured light pattern and relative placement of the pattern emitter and IR camera, leading to simplified and cost-effective construction. We successfully evaluate and demonstrate our proposed depth navigation approach AsterNav using depth from AsterNet in many real-world experiments using only onboard sensing and computation, including dark matte obstacles and thin ropes (diameter 6.25mm), achieving an overall success rate of 95.5% with unknown object shapes, locations and materials. To the best of our knowledge, this is the first work on monocular, structured-light-based quadrotor navigation in absolute darkness.
|
| |
| 15:00-16:30, Paper ThI2I.409 | Add to My Program |
| Reliable Robotic Task Execution in the Face of Anomalies |
|
| Santhanam, Bharath | Neura Robotics |
| Mitrevski, Alex | Chalmers University of Technology |
| Thoduka, Santosh | Fraunhofer IAIS |
| Houben, Sebastian | University of Applied Sciences Bonn-Rhein-Sieg |
| Hassan, Teena | Bonn-Rhein-Sieg University of Applied Sciences |
Keywords: Learning from Experience, Failure Detection and Recovery, Visual Learning
Abstract: Learned robot policies have consistently been shown to be versatile, but they typically have no built-in mechanism for handling the complexity of open environments, making them prone to execution failures; this implies that deploying policies without the ability to recognise and react to failures may lead to unreliable and unsafe robot behaviour. In this paper, we present a framework that couples a learned policy with a method to detect visual anomalies during policy deployment and to perform recovery behaviours when necessary, thereby aiming to prevent failures. Specifically, we train an anomaly detection model using data collected during nominal executions of a trained policy. This model is then integrated into the online policy execution process, so that deviations from the nominal execution can trigger a three-level sequential recovery process that consists of (i) pausing the execution temporarily, (ii) performing a local perturbation of the robot's state, and (iii) resetting the robot to a safe state by sampling from a learned execution success model. We verify our proposed method in two different scenarios: (i) a door handle reaching task with a Kinova Gen3 arm using a policy trained in simulation and transferred to the real robot, and (ii) an object placing task with a UFactory xArm 6 using a general-purpose policy model. Our results show that integrating policy execution with anomaly detection and recovery increases the execution success rate in environments with various anomalies, such as trajectory deviations and adversarial human interventions.
|
| |
| 15:00-16:30, Paper ThI2I.410 | Add to My Program |
| STAGE: STyle-Controllable Action GEneration for Personalized Autonomous Driving |
|
| Liu, Zihao | Northwestern Polytechnical University |
| Liu, Xing | Northwestern Polytechnical University |
| Zhang, Yizhai | Northwestern Polytechnical University |
| Huang, Panfeng | Northwestern Polytechnical University |
Keywords: Autonomous Vehicle Navigation, Human Factors and Human-in-the-Loop, Imitation Learning
Abstract: Driving style refers to the behavioral preferences that drivers maintain during driving, shaped by their diverse experiences, habits, and needs, and is typically reflected in varying levels of aggressiveness. If humans choose to use autonomous driving systems, they would expect the driving style of the systems to closely resemble their own habit. However, this is challenging for current industrial autonomous driving systems. To address this, we developed a style controllable action generation method, STAGE, for driving tasks. Its training process is based on imitation learning, incorporating both style value and latent value action modality encoding. Preference learning is then used to identify the user's driving style as a continuous, monotonic style value. And to reduce the cost of human involvement in the preference training process, we also developed a set of rules to compare driving style in data pairs. Then, during inference, the user inputs the style value to control the generated action patterns, dynamically meeting the user's expectations. Using the STAGE method, we verified that the style-controlled action generation results in several typical road scenarios significantly align with human expectations. Furthermore, through comparisons between the STAGE method and various other approaches, we reveal the unique functionalities of STAGE, including its style controllability, style continuity, driving style alignment capability and driving safety. The code for this work is available at: github.com/CarlDegio/STAGE
|
| |
| 15:00-16:30, Paper ThI2I.411 | Add to My Program |
| Fabric Dynamic Motion Modeling and Collision Avoidance with Oriented Bounding Box |
|
| Li, Letian | The University of Hong Kong |
| Tokuda, Fuyuki | Tohoku University |
| Seino, Akira | Centre for Transformative Garment Production |
| Kobayashi, Akinari | Centre for Transformative Garment Production |
| Tien, Norman | University of Hong Kong |
| Kosuge, Kazuhiro | The University of Hong Kong |
Keywords: Collision Avoidance, Deep Learning Methods, Motion and Path Planning
Abstract: Avoiding collision between the fabric and the obstacle is critical to transport fabric piece in the garment factory. If the fabric collides with the sharp-edged obstacle, it can be scratched or contaminated, resulting in poor product quality and increased waste. However, when we consider the fabric model, we find that current fabric models are not accurate enough for this real-world application. It is almost impossible to model the dynamic motion of the fabric with high accuracy, because its motion and deformation are affected by many hard-to-estimate factors. In this paper, instead of using an accurate fabric model, we propose a new fabric motion modeling method using the proposed OBB-Transformer, which models the fabric motion as a time series of oriented bounding boxes (OBBs). Using OBB-Transformer, a dynamic collision avoidance method is designed to plan a robot trajectory connecting the start point and the goal point without collision between the fabric and the obstacle. The performance of the fabric dynamic motion modeling is compared between the proposed and conventional methods. Then, the collision avoidance of a piece of fabric using the proposed method is demonstrated on a real robot system in both 2D and 3D scenarios.
|
| |
| 15:00-16:30, Paper ThI2I.412 | Add to My Program |
| Safe Navigation under State Uncertainty: Online Adaptation for Robust Control Barrier Functions |
|
| Das, Ersin | Illinois Institute of Technology |
| Nanayakkara, Rahal | University of California at Los Angeles |
| Tan, Xiao | Beihang University |
| Bena, Ryan | California Institute of Technology |
| Burdick, Joel | California Institute of Technology |
| Tabuada, Paulo | UCLA |
| Ames, Aaron | California Institute of Technology |
Keywords: Robot Safety, Robust/Adaptive Control, Optimization and Optimal Control
Abstract: Measurements and state estimates are often imperfect in control practice, posing challenges for safety-critical applications, where safety guarantees rely on accurate state information. In the presence of estimation errors, several prior robust control barrier function (R-CBF) formulations have imposed strict conditions on the input. These methods can be overly conservative and can introduce issues such as infeasibility, high control effort, etc. This work proposes a systematic method to improve R-CBFs, and demonstrates its advantages on a tracked vehicle that navigates among multiple obstacles. A primary contribution is a new optimization-based online parameter adaptation scheme that reduces the conservativeness of existing R-CBFs. In order to reduce the complexity of the parameter optimization, we merge several safety constraints into one unified numerical CBF via Poisson’s equation. We further address the dual relative degree issue that typically causes difficulty in vehicle tracking. Experimental trials demonstrate the overall performance improvement of our approach over existing formulations.
|
| |
| 15:00-16:30, Paper ThI2I.413 | Add to My Program |
| Safe and Efficient Quadrupedal Locomotion with a Chambolle-Pock Whole-Body Controller |
|
| Yang, Xu | Tsinghua University |
| Wang, Run | Tsinghua University |
| Lu, Yiwen | Tsinghua University |
| Mo, Yilin | Tsinghua University |
Keywords: Legged Robots, Optimization and Optimal Control, Deep Learning in Robotics and Automation, Reinforcement Learning
Abstract: This paper presents a hierarchical control framework for quadrupedal locomotion that unifies the complementary strengths of model-based optimization and reinforcement learning. We develop a convex Quadratic Programming~(QP) solver based on the primal-dual Chambolle-Pock algorithm, enabling both massively parallel policy training and real-time deployment through efficient handling of constrained optimization problems. Our hierarchical framework employs learned policies for robust high-level control to handle real-world perturbations, while ensuring safety and energy efficiency through a low-level whole-body controller powered by the proposed solver. Extensive benchmarks and experimental validation demonstrate quantifiable improvements in energy consumption, constraint satisfaction, and task transferability across simulated and real-world environments.
|
| |
| 15:00-16:30, Paper ThI2I.414 | Add to My Program |
| Mechanistic Analysis of Cable Tension Effects on the Stiffness of Cable-Driven Serpentine Manipulators |
|
| Dai, Yicheng | Harbin Institute of Technology (Shenzhen) |
| Wang, Sheng | Hefei University of Technology |
| Wang, Xin | Harbin Institute of Technology, Shenzhen |
| Yuan, Han | Harbin Institute of Technology |
Keywords: Tendon/Wire Mechanism, Dynamics, Flexible Robotics
Abstract: This paper presents a mechanistic analysis of stiffness in cable-driven serpentine manipulators (CDSMs), incorporating both cable tension and cable stiffness. First, we derive an analytical stiffness model based on robot statics, identifying cable tension and stiffness as the dominant factors governing robot stiffness at a given configuration. Crucially, we characterize a previously overlooked tension-stiffness coupling effect: cable tension induces nonlinear stiffness variations in driving cables, significantly altering overall robot stiffness. Due to this interdependence, quantifying cable tension’s specific influence on stiffness remains a challenging research gap. To address this, simulations and experiments validate the model and quantify their coupled effects on robotic stiffness. Results demonstrate that for CDSMs using multi-strand cables with nonlinear stiffness, robot stiffness increases sharply with rising tension. Conversely, when cable elasticity is constant, robot stiffness decreases with increasing tension. These findings provide critical insights for advancing stiffness control accuracy in CDSMs.
|
| |
| 15:00-16:30, Paper ThI2I.415 | Add to My Program |
| TerrFlat: Physics-Driven Geometry Representation for Structure-Aware Freespace Detection |
|
| Yang, Jingwei | Tongji University |
| Wang, Liuyi | Tongji University |
| Shen, Mengjiao | Tongji University |
| Du, Jiayuan | Tongji University |
| Liu, Chengju | Tongji University |
| Chen, Qijun | Tongji University |
Keywords: Semantic Scene Understanding, Intelligent Transportation Systems
Abstract: Freespace detection in autonomous driving is limited by the lack of explicit geometric modeling, hindering generalization across complex terrains. Existing approaches are predominantly data-driven and neglect the physical structure of drivable surfaces. We propose Terrain Flat (TerrFlat), a physics-driven geometric representation that models road surfaces along three interpretable dimensions: lateral smoothness, longitudinal consistency, and vertical deviation. TerrFlat is constructed through geometric reasoning and projected into pixel-aligned maps via a differentiable projection, ensuring geometric–visual consistency. Building on this representation, we introduce a symmetric feature fusion module (SFFM) to integrate TerrFlat with visual features through bidirectional recalibration, improving semantic discrimination and boundary localization. Together, TerrFlat and SFFM form TerrFlat-Seg, a unified framework for physics-aware freespace perception. Experiments on KITTI-Road, Semantic-KITTI, and ORFD datasets demonstrate consistent improvements over existing baselines. Real-world validation on an automated guided vehicle platform further confirms the robustness of our approach.
|
| |
| 15:00-16:30, Paper ThI2I.416 | Add to My Program |
| Shear-Based Grasp Control for Multi-Fingered Underactuated Tactile Robotic Hands |
|
| Ford, Christopher | University of Bristol |
| Li, Haoran | University of Bristol |
| Catalano, Manuel Giuseppe | Istituto Italiano Di Tecnologia |
| Bianchi, Matteo | University of Pisa |
| Psomopoulou, Efi | University of Bristol |
| Lepora, Nathan | University of Bristol |
Keywords: Force and Tactile Sensing, Underactuated Robots, Dexterous Manipulation, Grippers and Other End-Effectors
Abstract: This paper presents a shear-based control scheme for grasping and manipulating delicate objects with a Pisa/IIT anthropomorphic SoftHand equipped with soft biomimetic tactile sensors on all five fingertips. These `microTac' tactile sensors are miniature versions of the TacTip vision-based tactile sensor, and can extract precise contact geometry and force information at each fingertip for use as feedback into a controller to modulate the grasp while a held object is manipulated. Using a parallel processing pipeline, we asynchronously capture tactile images and predict contact pose and force from multiple tactile sensors. Consistent pose and force models across all sensors are developed using supervised deep learning with transfer learning techniques. We then develop a grasp control framework that uses contact force feedback from all fingertip sensors simultaneously, allowing the hand to safely handle delicate objects even under external disturbances. This control framework is applied to several grasp-manipulation experiments: first, retaining a flexible cup in a grasp without crushing it under changes in object weight; second, a pouring task where the center of mass of the cup chang
|
| |
| 15:00-16:30, Paper ThI2I.417 | Add to My Program |
| Unveiling Non-Reproducibility in LiDAR-Inertial Odometry |
|
| Huang, Hongqian | Xi'an Jiaotong University |
| Zhang, Meng | Xi'an Jiaotong University |
| Hu, Jianchen | Xi'an Jiaotong University |
| Guan, Xiaohong | Xi'an Jiaotong University |
Keywords: Localization, SLAM, Software Tools for Benchmarking and Reproducibility
Abstract: This letter presents empirical research on the non-reproducibility of light detection and ranging sensor (LiDAR)-inertial odometry (LIO) systems. Although the LIO community has made commendable efforts toward reproducible localization accuracy, noteworthy non-reproducibility remains, thus hindering a fair evaluation of method effectiveness. To better understand such non-reproducibility, we first define non-reproducibility and introduce a quantitative criterion to identify noteworthy non-reproducibility. We then propose five significant non-deterministic implementations that are included in state-of-the-art LIO systems and present solutions for modifying these non-deterministic implementations into deterministic ones. A general procedure is also introduced to identify and pinpoint non-deterministic implementations, regardless of whether they are covered in this letter. Extensive experiments demonstrate that the non-deterministic implementations are the major or potentially sole causes of non-reproducibility under constant experimental conditions. Additionally, the non-reproducibility is noteworthy in datasets obtained from low-vertical-resolution LiDARs or recorded in geometrically degenerate scenes.
|
| |
| 15:00-16:30, Paper ThI2I.418 | Add to My Program |
| Behavior-Controllable Stable Dynamics Models on Riemannian Configuration Manifolds |
|
| Lee, Byeongho | Samsung Eletronics |
| Lee, Yonghyeon | MIT |
| Ha, Junsu | Seoul National University |
| Park, Frank | Seoul National University |
Keywords: Robust/Adaptive Control of Robotic Systems, Motion Control of Manipulators, Learning from Demonstration, Stable Dynamical System Learning
Abstract: Due to their stability and robustness, Stable Dynamical Systems (SDS) have received attention as means of representing motions in learning from demonstration tasks. Designing vector fields that fit complex trajectories while ensuring stability still remains a key challenge; although recent deep learning-based methods have shown progress, their tendency to overfit demonstration trajectories often leads to undesirable behaviors, particularly as tasks deviate from demonstrations. Fundamentally, the only reliable way to address this lack of generalization is to provide supervision in out-of-demonstration regions. Focusing on mimicking and contracting behaviors, we propose a Behavior-Controllable Stable Dynamics Model (BCSDM), a one-parameter family of SDS that allows users to adjust the system's overall behavior depending on user intent. We show how to extend BCSDM to accommodate demonstrations of multiple tasks, and propose a Deep Operator Vector Field (DeepOVec) for memory-efficient encoding of multiple dynamical systems. Experiments on tasks that involve mimicking or contracting behaviors demonstrate the advantages of BCSDMs over existing state-of-the-art methods.
|
| |
| 15:00-16:30, Paper ThI2I.419 | Add to My Program |
| ColonAdapter: Geometry Estimation through Foundation Model Adaptation for Colonoscopy |
|
| Jiang, Zhiyi | Southeast University |
| Wang, Yifu | Tencent |
| Cheng, Xuelian | Monash University |
| Ge, Zongyuan | Monash University |
Keywords: Deep Learning for Visual Perception, Computer Vision for Medical Robotics, Localization
Abstract: Estimating 3D geometry from monocular colonoscopy images is challenging due to non-Lambertian surfaces, moving light sources, and large textureless regions. While recent 3D geometric foundation models eliminate the need for multi-stage pipelines, their performance deteriorates in clinical scenes. These models are primarily trained on natural scene datasets and struggle with specularity and homogeneous textures typical in colonoscopy, leading to inaccurate geometry estimation. In this paper, we present ColonAdapter, a self-supervised fine-tuning framework that adapts geometric foundation models for colonoscopy geometry estimation. Our method leverages pretrained geometric priors while tailoring them to clinical data. To improve performance in low-texture regions and ensure scale consistency, we introduce a Detail Restoration Module (DRM) and a geometry consistency loss. Furthermore, a confidence-weighted photometric loss enhances training stability in clinical environments. Experiments on both synthetic and real datasets demonstrate that our approach achieves state-of-the-art performance in camera pose estimation, monocular depth prediction, and dense 3D point map reconstruction, without requiring ground-truth intrinsic parameters.
|
| |
| 15:00-16:30, Paper ThI2I.420 | Add to My Program |
| Large Pre-Trained Models and Few-Shot Fine-Tuning for Virtual Metrology: A Framework for Uncertainty-Driven Adaptive Process Control in Semiconductor Manufacturing (I) |
|
| Lin, Chin-Yi | University of Texas at El Paso |
| Tseng, Tzu-Liang (Bill) | University of Texas at El Paso |
| Emon, Solayman Hossain | University of Texas at El Paso |
| Tsai, Tsung-Han | National Taipei University of Business |
Keywords: Deep Learning Methods, Reinforcement Learning, Big Data in Robotics and Automation
Abstract: High-precision wafer metrology poses significant cost and throughput challenges in modern semiconductor manufacturing, where frequent process changes and recipe variations demand highly adaptive and scalable solutions. In this paper, we present a Generative-FewShot-Active Virtual Metrology (GFA-VM) framework that unifies large-scale generative modeling, few-shot fine-tuning, and uncertainty-driven active sampling into a single, data-centric system. A foundational generative model, built on a hybrid architecture of Transformer networks and Variational Autoencoders (VAEs), learns diverse sensor characteristics in an offline stage without relying on extensive labeled data. During online inference, the model produces both wafer quality predictions and predictive uncertainties; samples exceeding a dynamic uncertainty threshold are selected for physical measurement and few-shot model recalibration. This selective sampling both reduces measurement costs and adapts rapidly to new process conditions (e.g., novel recipes or equipment upgrades), requiring only a handful of freshly labeled wafers. The paper further addresses the long-term stability of the system through a self-updating mechanism that adjusts the uncertainty threshold when distributional shifts occur. Empirical evaluations confirm that our GFA-VM approach achieves state-of-the-art accuracy while significantly reducing metrology overhead compared to conventional virtual metrology methods.
|
| |
| 15:00-16:30, Paper ThI2I.421 | Add to My Program |
| Density-Aware Point Cloud Upsampling Via Relational Graph Flow Matching |
|
| Deng, Yuzhong | University of Electronic Science and Technology of China |
| Liu, Dongzhen | University of Electronic Science and Technology of China |
| Wu, Di | University of Electronic Science and Technology of China |
| Song, Junlin | Shenzhen Institute of Artificial Intelligence and Robotics for Society |
| Zou, Jianxiao | UESTC |
| Fan, Shicai | University of Electronic Science and Technology of China |
|
|
| |
| 15:00-16:30, Paper ThI2I.422 | Add to My Program |
| A Cooperation Control Framework Based on Admittance Control and Time-Varying Passive Velocity Field Control for Human-Robot Co-Carrying Tasks (I) |
|
| Dang, Van Trong | Nara Institute of Science and Technology |
| Kotake, Hiroki | Nara Institute of Science and Technology |
| Honji, Sumitaka | Nara Institute of Science and Technology |
| Wada, Takahiro | Nara Institute of Science and Technology |
Keywords: Human-Robot Collaboration, Safety in HRI, Intention Recognition
Abstract: Human-robot co-carrying tasks reveal their potential in both industrial and everyday applications by leveraging the strengths of both parties. However, such collaborative tasks pose numerous challenges due to varied human intentions under time-varying workspaces, leading to human-robot conflicts. In this paper, we develop a cooperation control framework for human-robot co-carrying tasks constructed by utilizing reference generator and low-level controller to aim to achieve safe interaction and synchronized human-robot movement. Firstly, the human motion predictions are corrected in the event of prediction errors based on the conflicts measured by the interaction forces through admittance control, thereby mitigating conflict levels. Low-level controller using an energy-compensation passive velocity field control approach allows encoding the corrected motion to produce control torques for the robot. In this manner, the closed-loop robotic system is passive when the energy level exceeds the predetermined threshold, and otherwise. Furthermore, the passivity, stability, energy-compensation rate, and power flow regulation are analyzed from theoretical viewpoints. Human-in-the-loop experiments involving 18 participants have demonstrated that the proposed method significantly enhances task performance and reduces human workload, as evidenced by both objective metrics and subjective evaluations.
|
| |
| 15:00-16:30, Paper ThI2I.423 | Add to My Program |
| ROGIBOT: Development of a Dual-Arm Autonomous Container Unloading Robot System for Various Logistics Packages |
|
| Auh, Eugene | Sungkyunkwan University |
| Oh, Ilho | Sungkyunkwan University |
| Lee, JangHoon | Sungkyunkwan University, RISE Lab |
| Pico Rosas, Nabih Andres | Sungkyunkwan University |
| Park, Yeong-Jae | Sungkyunkwan University |
| Jang, Jae Hyuck | Sungkyunkwan University |
| Lee, Haneol | Sungkyunkwan University |
| Coutinho, Altair | Istituto Italiano Di Tecnologia |
| Park, Jongsam | STC Engineering |
| Jung, Byungtaek | STC Engineering |
| Koo, Seongyong | University of Bonn |
| Kim, Kyung-Hoon | CJ Logistics |
| Rodrigue, Hugo | Sungkyunkwan University |
| Moon, Hyungpil | Sungkyunkwan University |
Keywords: Logistics, Service Robotics, Autonomous Agents
Abstract: This article presents ROGIBOT, an autonomous robot designed for high-speed unloading inside standard containers. Previous unloading robots have typically been limited to handling only box-shaped packages and a small number of items at a time. These limitations hinder their applicability in general logistics environments, which involve a wide variety of package types and require high throughput. To address these challenges, we propose a novel dual-arm robot equipped with multi-functional end effectors, including a vacuum array, a two-finger gripper, and an on-hand sliding surface. The components are designed to handle the most common types of logistics packages, such as boxes, plastic pouches, and bundle sacks. In particular, the on-hand sliding surface can draw packages efficiently using the frictional force generated by a rotating elastic conveyor belt. This enables the robot to unload multiple packages consecutively at high speed by sweeping the end effectors across the pile. In addition to the hardware, we introduce a package recognition method and a finite-state-machine-based task planner that detects logistics packages and selects actions to maximize operational efficiency. In our evaluation, ROGIBOT was tested in a mock container and achieved a throughput of 2,030 pieces per hour, surpassing both state-of-the-art robots and typical human performance.
|
| |
| 15:00-16:30, Paper ThI2I.424 | Add to My Program |
| Adaptive Robust Control for Rotation Tracking of a Soft Rotary Actuator with Hysteresis Compensation |
|
| Lee, Young Min | SKKU |
| Yun, Yeoil | Sungkyunkwan Univ |
| Moon, Hyungpil | Sungkyunkwan University |
| Choi, Hyouk Ryeol | Sungkyunkwan University |
| Ihn, Yong Seok | Korea Institute of Science and Technology |
| Koo, Ja Choon | Sungkyunkwan University |
Keywords: Soft Sensors and Actuators, Robust/Adaptive Control, Hydraulic/Pneumatic Actuators
Abstract: Precise control of soft pneumatic actuators is impeded by significant nonlinearities, particularly large internal volume variations during actuation---a factor often overlooked in conventional modeling. This paper proposes an adaptive robust control (ARC) framework designed for high-performance, energy-efficient control of soft actuators with non-negligible volume dynamics. The framework integrates a Modified Prandtl-Ishlinskii (MPI) model for hysteresis compensation with a real-time volume estimator using an internal Time-of-Flight (ToF) sensor. The ARC law then systematically handles uncertainties from both valve parameter variations and the volume estimation process. Experimental validation, through direct comparison with a conventional fixed-volume model, demonstrates that this volume-aware approach achieves robust trajectory tracking with significantly reduced control effort and energy consumption. This work establishes that explicitly modeling internal volume dynamics is crucial for developing high-performance control systems for a broad class of soft pneumatic actuators.
|
| |
| 15:00-16:30, Paper ThI2I.425 | Add to My Program |
| Concurrent-Learning Based Relative Localization in Shape Formation of Robot Swarms (I) |
|
| Lv, Jinhu | Beihang University |
| Ze, Kunrui | Beihang University |
| Yue, Shuoyu | University of Cambridge |
| Liu, Kexin | Beihang University |
| Wang, Wei | Beihang University |
| Sun, Guibin | Beihang University |
Keywords: Swarm Robotics, Localization, Distributed Robot Systems
Abstract: In this article, we address the shape formation problem for massive robot swarms in environments where external localization systems are unavailable.Achieving this task effectively with solely onboard measurements is still scarcely explored and faces some practical challenges.To solve this challenging problem, we propose the following novel results.Firstly, to estimate the relative positions among neighboring robots, a concurrent-learning based estimator is proposed.It relaxes the persistent excitation condition required in the classical ones such as the least-square estimator.Secondly, we introduce a finite-time agreement protocol to determine the shape location.This is achieved by estimating the relative position between each robot and a randomly assigned seed robot.The initial position of the seed one marks the shape location.Thirdly, based on the theoretical results of the relative localization, a novel behavior-based control strategy is devised.This strategy not only enables the adaptive shape formation of large groups of robots but also enhances the observability of inter-robot relative localization.Numerical simulation results are provided to verify the performance of our proposed strategy compared to the state-of-the-art ones.Additionally, outdoor experiments on real robots further demonstrate the practical effectiveness and robustness of our methods.
|
| |
| 15:00-16:30, Paper ThI2I.426 | Add to My Program |
| Vision-Based Policy Learning for High-Speed Autonomous Racing |
|
| Xu, Haoran | Zhejiang University |
| Chen, Xianwei | Zhejiang University |
| Lang, Yilin | Zhejiang University |
| Ren, Qinyuan | Zhejiang University |
Keywords: Machine Learning for Robot Control, Motion Control, Reinforcement Learning
Abstract: Motion planning for autonomous vision-based car racing is a challenging task in robotics. Classical racing systems divide the task into numerous submodules, undermining computational efficiency and leading to error propagation. Previous studies have demonstrated impressive reinforcement learning (RL) results for end-to-end autonomous driving. However, RL exhibits poor scalability on high-dimensional data, such as images, and it is challenging to learn optimal racing behaviors due to a lack of global information about the environments. To address these issues, a two-phase learning paradigm is proposed in this work to train a vision-based racing policy. First, RL trains a teacher policy that integrates progress maximization with collision avoidance in the reward function and utilizes privileged information about the racetrack to achieve high-performance racing. Then, a student policy, relying only on an ego-centric depth camera for perception, is trained by distilling racing knowledge from the teacher policy. The student policy achieves high-speed drive, high success rate, and smooth control in vision-based racing games. The proposed approach is validated in the simulation and on a real-world 1/10-scale race car, showing that the approach outperforms previous model-based and learning-based baselines.
|
| |
| 15:00-16:30, Paper ThI2I.427 | Add to My Program |
| A Proximity-Based Framework for Human-Robot Seamless Close Interactions |
|
| Bertoni, Liana | Italian Institute of Technology |
| Baccelliere, Lorenzo | Istituto Italiano Di Tecnologia |
| Muratore, Luca | Istituto Italiano Di Tecnologia |
| Tsagarakis, Nikos | Istituto Italiano Di Tecnologia |
Keywords: Human-Robot Collaboration, Collision Avoidance, Human-Aware Motion Planning
Abstract: The administration and monitoring of shared workspaces are crucial for seamlessly integrating robots to operate in close interactions with humans. Adaptive, versatile, and reliable robot movements are key to achieving effective and successful human-robot synergy. In situations involving unexpected or unintended collisions, robots must react appropriately to minimize risks to humans while still staying focused on their primary tasks or safely resuming them. Although collision detection and identification algorithms are well-established, more advanced robot reactions beyond basic stop-and-wait reactions have not yet been widely adopted and understood. This limitation highlights the need for more sophisticated robot responses to better handle complex collision scenarios, ensuring both safety and task continuity. This letter introduces a novel complete robotic system that leverages the potential of on-board proximity sensor equipment to seamlessly furnish compatible robot reactions while operating in close interactions. With on-board distributed proximity sensors, the robot gains a continuous close workspace awareness, facilitating a transparent negotiation of potential collisions while executing tasks. The proposed system and framework are validated in a collaborative industrial task scenario composed of sub-tasks allocated to the human and the robot and performed within shared regions of the workspace, demonstrating the efficacy of the approach.
|
| |
| 15:00-16:30, Paper ThI2I.428 | Add to My Program |
| Confidence-Based Intent Prediction for Teleoperation in Bimanual Robotic Suturing |
|
| Hu, Zhaoyang Jacopo | Imperial College London |
| Xu, Haozheng | Imperial College London |
| Kim, Sion | Imperial College London |
| Li, Yanan | University of Sussex |
| Rodriguez y Baena, Ferdinando | Imperial College, London, UK |
| Burdet, Etienne | Imperial College London |
Keywords: Human-Robot Collaboration, Medical Robots and Systems
Abstract: Robotic-assisted procedures offer enhanced precision, but while fully autonomous systems are limited in task knowledge, difficulties in modeling unstructured environments, and generalisation abilities, fully manual teleoperated systems also face challenges such as delay, stability, and reduced sensory information. To address these, we developed an interactive control strategy that assists the human operator by predicting their motion plan at both high and low levels. At the high level, a surgeme recognition system is employed through a Transformer-based real-time gesture classification model to dynamically adapt to the operator's actions, while at the low level, a Confidence-based Intention Assimilation Controller adjusts robot actions based on user intent and shared control paradigms. The system is built around a robotic suturing task, supported by sensors that capture the kinematics of the robot and task dynamics. Experiments across users with varying skill levels demonstrated the effectiveness of the proposed approach, showing statistically significant improvements in task completion time and user satisfaction compared to traditional teleoperation.
|
| |
| 15:00-16:30, Paper ThI2I.429 | Add to My Program |
| Optimized Design and Calibration of a Human-Eye-Sized Active Binocular Vision System Based on Spherical Parallel Mechanism |
|
| Wang, Kaifang | CAS |
| Yang, Dongdong | Shanghai Eyevolution Technology |
| Zhang, Li | Anhui Eyevolution Technology |
| Liu, Jun | Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences |
| Zhang, Xiaolin | Shanghai Institute of Microsystem And Information Technology,Chinese Academy of Science |
|
|
| |
| 15:00-16:30, Paper ThI2I.430 | Add to My Program |
| A Real-Time 6-DoF Posture Estimation Method for High-Speed 6-Axis Industrial Manipulator Control Using a 2D Laser Profiler |
|
| Chen, Tao | National Taiwan University |
| Lin, Pei-Chun | National Taiwan University |
Keywords: Range Sensing, Localization, Sensor-based Control
Abstract: 6-DoF posture estimation is a critical technique in robotics. However, a significant gap exists between the two primary approaches—camera-based methods and laser tracker systems—in terms of cost and performance. To bridge this gap, this work proposes a method to calculate the 6-DoF posture of a manipulator’s end-effector using the 2D profile of a custom-designed metallic gauge. The core principle relies on a one-to-one correspondence between the measured profile and the manipulator's posture. A mathematical model was developed to derive a closed-form solution, which is further refined via maximum likelihood estimation to enhance robustness and accuracy. Simulation studies assessed the influence of geometric parameters on estimation accuracy and noise robustness. Real-world experiments demonstrated that the refined solution significantly outperforms the closed-form solution alone. Furthermore, a speed benchmark against a camera-based system highlighted the proposed method's advantage in operating frequency. Finally, the method was successfully integrated into a real-time position control task for a 6-axis industrial manipulator, verifying its practical applicability in real-time robotic control.
|
| |
| 15:00-16:30, Paper ThI2I.431 | Add to My Program |
| HoLoArm: Deformable Arms for Collision-Tolerant Quadrotor Flight |
|
| Pham, Ngoc Quang | Japan Advanced Institute of Science and Technology |
| Eschmann, Jonas | University of California Berkeley |
| Zhou, Yang | New York University |
| Ojeda Olarte, Alejandro | New York University |
| Loianno, Giuseppe | UC Berkeley |
| Ho, Van | Japan Advanced Institute of Science and Technology |
Keywords: Mechanism Design, Soft Robot Applications, Aerial Systems: Mechanics and Control
Abstract: The increasing use of drones in human-centric applications highlights the need for designs that can survive collisions and recover rapidly, minimizing risks to both humans and the environment. We present HoLoArm, a quadrotor with compliant arms inspired by the nodus structure of dragonfly wings. This design provides natural flexibility and resilience while preserving flight stability, which is further reinforced by the integration of a Reinforcement Learning (RL) control policy that enhances both recovery and hovering performance. Experimental results demonstrate that HoLoArm can passively deform in any direction, including axial one, and recover within 0.3-0.6~s depending on the direction and level of the impact. The drone can survive collisions at speeds up to 7.6 m/s and carry a 540 g payload while maintaining stable flight. This work contributes to the morphological design of soft aerial robots with high agility and reliable safety, enabling operation in cluttered and human-shared environments, and lays the groundwork for future fully soft drones that integrate compliant structures with intelligent control.
|
| |
| 15:00-16:30, Paper ThI2I.432 | Add to My Program |
| Low-Latency Event-Based Velocimetry for Quadrotor Control in a Narrow Pipe |
|
| Bauersfeld, Leonard | University of Zurich (UZH) |
| Scaramuzza, Davide | University of Zurich |
Keywords: Aerial Systems: Mechanics and Control, Computer Vision for Other Robotic Applications, Sensor-based Control, Fluid Mechanics
Abstract: Autonomous quadrotor flight in confined spaces such as pipes and tunnels presents significant challenges due to unsteady, self-induced aerodynamic disturbances. Very recent advances have enabled flight in such conditions, but they either rely on constant motion through the pipe to mitigate airflow recirculation effects or suffer from limited stability during hovering. In this work, we present the first closed-loop control system for quadrotors for hovering in narrow pipes that leverages real-time flow field measurements. We develop a low-latency, event-based smoke velocimetry method that estimates local airflow at high temporal resolution. This flow information is used to by a disturbance estimator based on a recurrent convolutional neural network, which infers force and torque disturbances in real time. The estimated disturbances are integrated into a learning-based controller trained via reinforcement learning. To the best of our knowledge, this work represents the first demonstration of an aerial robot with closed-loop control informed by real-time flow field measurements.
|
| |
| 15:00-16:30, Paper ThI2I.433 | Add to My Program |
| Hybrid Soft-Rigid Elbow Exosuit: Theory, Mechatronic Design, and Experimental Assessment |
|
| KhalilianMotamed Bonab, Ali | Scuola Superiore Sant'Anna |
| Camardella, Cristian | Scuola Superiore Sant'Anna |
| Frisoli, Antonio | Scuola Superiore Sant'Anna |
| Chiaradia, Domenico | Scuola Superiore Sant'Anna, Institute of Intelligent Mechanics, Pisa |
|
|
| |
| 15:00-16:30, Paper ThI2I.434 | Add to My Program |
| Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking |
|
| Pätzold, Bastian | University of Bonn |
| Nogga, Jan | University of Bonn |
| Behnke, Sven | University of Bonn |
Keywords: Object Detection, Segmentation and Categorization, Semantic Scene Understanding, Visual Tracking
Abstract: Vision-language models (VLMs) excel in visual understanding but often lack reliable grounding capabilities and actionable inference rates. Integrating them with open-vocabulary object detection (OVD), instance segmentation, and tracking leverages their strengths while mitigating these drawbacks. We utilize VLM-generated structured descriptions to identify visible object instances, collect application-relevant attributes, and inform an open-vocabulary detector to extract corresponding bounding boxes that are passed to a video segmentation model providing segmentation masks and tracking. Once initialized, this model directly extracts segmentation masks, processing image streams in real time with minimal computational overhead. Tracks can be updated online as needed by generating new structured descriptions and detections. This combines the descriptive power of VLMs with the grounding capability of OVD and the pixel-level understanding and speed of video segmentation. Our evaluation across datasets and robotics platforms demonstrates the broad applicability of this approach, showcasing its ability to extract task-specific attributes from non-standard objects in dynamic environments.
|
| |
| ThI2LB Late Breaking Results Session, Hall C |
Add to My Program |
| Late Breaking Results 6 |
|
| |
| |
| 15:00-16:30, Paper ThI2LB.1 | Add to My Program |
| A Display-Based Testbed for Bridging the Gap between Drone Simulation and Real-World Testing |
|
| Cai, Kai | Osaka Metropolitan University |
| Hara, Renma | Osaka Metropolitan University |
Keywords: Education Robotics, Discrete Event Dynamic Automation Systems, Task Planning
Abstract: Real-world drone experiments are essential but often constrained by safety, regulations, and space, while simulations fail to capture key physical effects. This paper presents a compact indoor testbed that bridges this gap by integrating physical drone dynamics with display-based virtual environments. A real drone interacts with displayed scenes on screens, enabling closed-loop image-based control. By combining physical motion with virtual environment shifting, the system supports flexible and controllable experiments. The effectiveness of the proposed testbed is demonstrated through three scenarios: drone delivery, farmland monitoring, and disaster response.
|
| |
| 15:00-16:30, Paper ThI2LB.2 | Add to My Program |
| Real-Time 3D Needle-To-Retina Pose Sensing Using Triple OCT Fibers for Tremor-Compensated Subretinal Injection |
|
| Cho, Gichan | DGIST |
| Im, Jintaek | Massachusetts General Hospital |
| Song, Cheol | DGIST |
| |
| 15:00-16:30, Paper ThI2LB.3 | Add to My Program |
| Uncertainty-Guided Proactive Adaptation for Visual-Inertial SLAM |
|
| Khan, Ehsan Ullah | Chungbuk National University |
| Kim, Gon-Woo | Chungbuk National University |
Keywords: Visual-Inertial SLAM, SLAM, Planning under Uncertainty
Abstract: Visual--inertial SLAM systems often fail in feature-poor environments such as corridors and textureless walls, leading to catastrophic tracking loss. Existing methods detect degradation reactively after failure occurs, leaving little opportunity for corrective action. We propose a proactive framework that predicts feature degradation 1--2 seconds in advance and adapts sensor fusion weights through uncertainty-guided decisions. Through a systematic comparison of eight temporal architectures across 15,233 sequences, including real robot data, we identify LSTM as the most robust predictor (26.77 MAE). We incorporate uncertainty estimation using Monte Carlo Dropout to enable confidence-aware adaptation thresholds that prevent false adjustments. Our approach provides a foundation for proactive SLAM failure prevention through principled sensor fusion and real-time system adaptation.
|
| |
| 15:00-16:30, Paper ThI2LB.4 | Add to My Program |
| A MATLAB/Simulink-Based Sim2Real Control Framework for the Unitree G1 Using ROS 2 and MuJoCo |
|
| Trochez, Leffer | Universidad De Los Andes |
| Quijano, Nicanor | Universidad De Los Andes |
| Lopez-Jimenez, Jorge | Universidad De Los Andes |
| Rodriguez Herrera, Carlos Francisco | Universidad De Los Andes |
Keywords: Humanoid Robot Systems, Software, Middleware and Programming Environments, Hardware-Software Integration in Robotics
Abstract: Control research on humanoid robots requires environments where control strategies can be designed, tuned, debugged, validated, and transferred to real hardware with minimal friction. For commercial platforms such as the Unitree G1, this process is often fragmented across separate tools for modeling, simulation, communication, visualization, and deployment, slowing down controller development and experimental iteration. This work presents a MATLAB/Simulink-based Sim2Real framework for the Unitree G1 that integrates MuJoCo and ROS 2 into a unified workflow for control-oriented research. The framework is organized around a modular Variant Subsystem that switches between a MuJoCo simulation backend and a ROS 2 real-robot backend while preserving compatible interfaces, enabling reuse of high-level control logic, monitoring, and system instrumentation across both domains. This is especially valuable for robotics control, where rapid prototyping, closed-loop debugging, structured block based design, and repeatable validation are critical for moving from controller concept to hardware testing. As a representative example, a standard ankle motion command task was validated both in simulation and on the real robot. The proposed framework establishes a practical model-based environment for implementing and evaluating control strategies on the Unitree G1, while providing an extensible basis for future estimation, stabilization, perception, and humanoid control modules.
|
| |
| 15:00-16:30, Paper ThI2LB.5 | Add to My Program |
| Lateral Reciprocal Collision Avoidance: A Probabilistic Social-Norm-Inspired Strategy for Deadlock-Free Multi-Robot Navigation |
|
| Lu, Siyi | Beihang University |
| Ruan, Sipu | Beihang University |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Collision Avoidance, Distributed Robot Systems
Abstract: Safe and efficient obstacle avoidance in multi-robot navigation is a challenging problem, with deadlock being a key issue due to strict collision-free constraints. This paper proposes a novel Lateral Reciprocal Collision Avoidance (LRCA) strategy based on velocity obstacle theory to mitigate deadlock among multiple robots. Inspired by pedestrian collision avoidance, LRCA incorporates symmetric lateral displacement to resolve deadlock. Unlike Optimal Reciprocal Collision Avoidance (ORCA) algorithm, which computes velocity constraints based on the minimal adjustment across the full velocity obstacle, our proposed method restricts the velocity changes of the agents to one randomly selected side of the relative velocity. This randomized directional selection strategy effectively prevents deadlock while preserving collision avoidance. This approach avoids conflicting velocity changes that could lead to mutual trapping. Theoretical analysis shows how ORCA cause deadlock using quadratic programming, Lagrangian functions, and KKT conditions, and how LRCA effectively prevents this. Extensive simulations across four benchmark multi-robot navigation scenarios show that LRCA outperforms existing algorithms in success rate, time to goal, path length, and computational efficiency.
|
| |
| 15:00-16:30, Paper ThI2LB.6 | Add to My Program |
| State-Space Time Surfaces for Event-Based Zero-Shot Robotic Grasping and Scene Reconstruction |
|
| Gong, Gu | Hong Kong Polytechnic University |
| Navarro-Alarcon, David | The Hong Kong Polytechnic University |
Keywords: Sensor-based Control
Abstract: Event cameras report per-pixel brightness changes asynchronously with microsecond latency, but their output is incompatible with vision foundation models trained on conventional images. We propose State-Space Time Surfaces (S3TS), a training-free representation that recasts exponential-decay time surfaces as a diagonal state-space model with multi-scale temporal channels and Mamba-inspired selective decay. The resulting pseudo-RGB image is fed directly to a frozen OWLv2 detector for zero-shot, text-prompted object detection from events alone. We demonstrate two applications on a 6-DOF manipulator: event-only grasping with near-nadir refinement, and dense 3D scene reconstruction via multi-view TSDF fusion with neuromorphic surface descriptors. S3TS detects over twice as many objects as single-channel event representations and produces faithful 3D workspace meshes
|
| |
| 15:00-16:30, Paper ThI2LB.7 | Add to My Program |
| Time-Division Multimodal Tactile Perception for Physical AI and Robotic Hands |
|
| Kim, Dohyung | Seoul National University |
| Kim, Kyun Kyu | Stanford University |
| Bang, Junhyuk | California Institute of Technology |
| Ko, Seung Hwan | Seoul National University |
Keywords: Force and Tactile Sensing, Soft Sensors and Actuators, Biologically-Inspired Robots
Abstract: Equipping robotic end-effectors with human-like tactile perception is crucial for dexterous manipulation, requiring simultaneous thermal and mechanical sensing at the contact interface. Conventional multimodal sensors often rely on stacked or patterned layers, which increase device thickness, reduce conformability on curved robotic fingers, and introduce response delays. To address this, we present a time-division tactile perception platform tailored for robotic applications that utilizes memristive Ag-Cu 2O core-sheath nanowire networks. This ultrathin artificial skin alternates between thermal and mechanical modalities at 16 Hz via memristive transitions, mirroring the processing of biological mechanoreceptors. In the SET state, sparse silver filaments form a mechanically sensitive network. During RESET, the semiconducting Cu 2O sheath provides high thermal sensitivity. Lacking reactive components, the sensor achieves sub-microsecond mechanical and millisecond thermal responses, ideal for real-time robotic feedback. A deep learning pipeline processing these time-division signals improved object classification accuracy to 95%. Using a wireless module, 20 household objects were recognized with 83% accuracy. This single-layer architecture enables direct, seamless integration onto robotic hands, laying the groundwork for multimodal tactile intelligence in physical AI.
|
| |
| 15:00-16:30, Paper ThI2LB.8 | Add to My Program |
| Low-Latency VR Telepresence for Remote Inspection in Fence-Free Collaborative Manufacturing |
|
| Svediroh, Stanislav | Brno University of Technology |
Keywords: Telerobotics and Teleoperation, Virtual Reality and Interfaces, Human-Robot Collaboration
Abstract: Fence-free collaborative manufacturing lets workers and machines share space, but autonomous safety monitoring cannot handle every situation alone. When an anomaly is flagged - unauthorized access, ambiguous sensor data, or unexpected worker behavior - a human operator must visually assess the scene. Our open-source framework deploys a mobile inspection robot controlled through an immersive VR headset. The operator sees through the robot's cameras with low-latency head-coupled video, navigates to the scene, and assesses the situation remotely - closing the loop between autonomous detection and human decision-making.
|
| |
| 15:00-16:30, Paper ThI2LB.9 | Add to My Program |
| Cascading Velocity Modulation for Multi-Agent Path Finding Execution |
|
| Park, SeungHyun | Tech University of Korea |
| Shim, Jae Hong | Tech University of Korea |
| Eoh, Gyuho | Tech University of Korea |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Collision Avoidance
Abstract: Multi-Agent Path Finding (MAPF) plans are increasingly deployed on real multi-robot fleets, where communication dropouts, actuator faults, and sensor noise routinely cause individual robots to deviate from the planned trajectory. We propose Cascading Velocity Modulation (CVM), a continuous execution controller that maps the temporal margin on each dependency edge into a proportional velocity command and propagates an exponentially attenuated damping signal along the dependency chain. CVM runs a three-step control loop: self-recovery, direct cushioning, and cascade propagation. CVM reduces the makespan by about 25 percent on average compared to a binary baseline, over ten randomized scenarios with 5 to 8 agents, each containing a malfunctioning agent that suffers an unexpected delay. An experiment with eight e-puck2 robots reproduces about a 35 percent reduction under two simultaneously malfunctioning agents.
|
| |
| 15:00-16:30, Paper ThI2LB.10 | Add to My Program |
| Diffusion Policy for Robot-Assisted Dressing with Moving Human Arms |
|
| Sun, Haoxiang | The Hong Kong Polytechnic University |
| Navarro-Alarcon, David | The Hong Kong Polytechnic University |
Keywords: Human-Centered Robotics, Imitation Learning, Physical Human-Robot Interaction
Abstract: Robot-assisted dressing remains challenging due to the close physical human–robot interaction and the highly deformable nature of garments. This work presents a purely vision-based approach that transfers human-mastered dressing skills to robots while accommodating dynamic human arm movements. The proposed method adopts a hierarchical structure. At the high level, a diffusion model serves as the policy to learn action distributions conditioned on point cloud observations. During execution, a diffused scalar field is constructed to infer an object-centric axial distribution of the human arm from cluttered points. Local point cloud registration across consecutive frames further captures arm motion, enabling real-time adaptation of robot actions to user dynamics. Comprehensive evaluations have been conducted in both simulation and real-world dressing scenarios using a UR10e robot with human participants of diverse genders and body types.
|
| |
| 15:00-16:30, Paper ThI2LB.11 | Add to My Program |
| Robust Unknown Object Detection and Tracking for Vision-Language-Action Models on Edge Devices |
|
| Joo, Subin | Korea Institute of Machinery and Metals |
| Jeung, Deokgi | Korea Institute of Machinery and Materials |
Keywords: Humanoid Robot Systems, Vision-Based Navigation, Computer Vision for Manufacturing
Abstract: This study proposes a Stepwise Vision-Language-Action (VLA) framework for the robust detection and tracking of unknown objects in edge device environments (NVIDIA Jetson AGX Orin). Conventional end-to-end VLA models face challenges such as massive memory requirements and a "black-box" nature that complicates debugging. To address these issues, we adopt a modular architecture, specifically integrating Depth-Guided Gaussian Sampling with MobileSAM in the vision module. This approach achieves over 99% detection success for unlearned objects. Furthermore, we demonstrate real-time 6-DOF pose tracking at over 30 FPS through ORB feature matching and ROI-based localization following the initialization phase
|
| |
| 15:00-16:30, Paper ThI2LB.12 | Add to My Program |
| Heterogeneous Skill Learning for Asynchronous Multi-Robot Relay Pushing in Complex Environments |
|
| Zhi, Hui | The Hong Kong Polytechnic University |
| Navarro-Alarcon, David | The Hong Kong Polytechnic University |
Keywords: Mobile Manipulation, Deep Learning in Grasping and Manipulation, Multi-Robot Systems
Abstract: This paper presents a heterogeneous skill learning framework for asynchronous multi robot relay pushing in complex and cluttered environments. To support cooperative relay transportation, we construct a skill library comprising room robot pushing, corridor helper pushing, and standby behaviors. We further propose a geometry aware pushing strategy that enables contact rich manipulation without relying on external force sensors. For the room robot, curriculum learning is adopted to decompose training into an approach to parcel phase and a parcel to target pushing phase, thereby improving training stability and task progression. For long horizon transportation in constrained corridors, an affordance network is introduced to model the local feasibility of pushing actions, providing structured guidance that improves policy learning efficiency. The overall framework combines Soft Actor Critic (SAC) with Dijkstra based reachability maps to coordinate the ``Room Robot Pushing'' and ``Corridor Helper Pushing'' skills. Experimental results demonstrate high success rates across progressive curriculum lessons, suggesting that the proposed framework provides an effective skill primitive for cooperative multi robot transportation.
|
| |
| 15:00-16:30, Paper ThI2LB.13 | Add to My Program |
| Enhancing Null Space Exploitation within TDPA through Online Optimization Based on Maximum Constraints Capabilities |
|
| Celli, Camilla | Scuola Superiore Sant'Anna |
| Filippeschi, Alessandro | Scuola Superiore Sant'Anna |
| Porcini, Francesco | PERCRO Laboratory, TeCIP Institute, Sant’Anna School of Advanced Studies, Pisa |
| Frisoli, Antonio | TeCIP Institute, Scuola Superiore Sant'Anna |
| |
| 15:00-16:30, Paper ThI2LB.14 | Add to My Program |
| Inverse Reachability Map Guided Motion Planning of Mobile Manipulator |
|
| Choi, JungHyun | University of Seoul |
| Lee, Taegyeom | University of Seoul |
| Hwang, Myun Joong | University of Seoul |
Keywords: Mobile Manipulation, Motion Control, Whole-Body Motion Planning and Control
Abstract: Mobile manipulators must coordinate end-effector (EE) tracking and mobile base motion to perform manipulation tasks robustly. However, even when the same EE trajectory is feasible, different base poses can lead to substantially different manipulator configurations, manipulability levels, and proximity to singularities. Thus, accurate EE tracking does not guarantee kinematically suitable whole-body behavior. To address this issue, a hierarchical framework is proposed that combines 1) an manipulator controller for EE tracking considering base motion, 2) an inverse reachability map (IRM) that encodes kinematically feasible base regions for the current and predicted EE states, and 3) a model predictive controller (MPC) that optimizes base velocity using the IRM as a soft cost. In the proposed architecture, the manipulator executes the task, the IRM evaluates which base regions are more reachable for the task, and the MPC generates base motion accordingly. Simulation results demonstrate that the proposed method improves manipulability while maintaining accurate EE tracking, highlighting the importance of reachability-aware base behavior in mobile manipulation.
|
| |
| 15:00-16:30, Paper ThI2LB.15 | Add to My Program |
| Mitigating Dissipative Artifacts in Long-Delay Bilateral Teleoperation through Optimal Prioritized Dissipation under Actuator Constraints |
|
| Celli, Camilla | Scuola Superiore Sant'Anna |
| Filippeschi, Alessandro | Scuola Superiore Sant'Anna |
| Porcini, Francesco | PERCRO Laboratory, TeCIP Institute, Sant’Anna School of Advanced Studies, Pisa |
| Frisoli, Antonio | TeCIP Institute, Scuola Superiore Sant'Anna |
| |
| 15:00-16:30, Paper ThI2LB.16 | Add to My Program |
| High-Stiffness Capacitive Torque Sensor Based on a Hybrid Scott-Russell and Parallelogram Mechanism |
|
| Sim, Jae Yoon | Sungkyunkwan University, AIDIN ROBOTICS |
| Lee, Seung Yeon | Sungkyunkwan University |
| Seok, Dong-Yeop | AIDIN ROBOTICS Inc |
| Kim, Yong Bum | Sungkyunkwan University |
| Choi, Hyouk Ryeol | Sungkyunkwan University |
Keywords: Force and Tactile Sensing, Cooperating Robots
Abstract: While joint torque sensors enable precise robot interactions, insufficient structural stiffness significantly limits control bandwidth and accuracy by reducing overall system rigidity. This study proposes a high-stiffness torque sensor based on a hybrid Scott-Russell (SR) and parallelogram (PL) flexure mechanism. The SR structure performs mechanical displacement amplification, ensuring high sensitivity even within a rigid design. By integrating the PL mechanism, the inherent parasitic rotation typically observed in conventional SR structures is effectively suppressed, ensuring pure translational motion between the capacitive electrodes. This hybrid flexure maximizes the capacitance change and achieves high sensing sensitivity while maintaining the high structural stiffness required for robust robotic joints. The proposed mechanism is validated through simulation, demonstrating its potential to ensure both system-level rigidity and high-resolution torque sensing.
|
| |
| 15:00-16:30, Paper ThI2LB.17 | Add to My Program |
| Compact Four-Degree-Of-Freedom Fingertip Feedback Device with Large Range of Motion |
|
| Byeon, Seongju | Seoul National University |
| Kwon, Jayhyun | Seoul National University |
| Han, Amy Kyungwon | Seoul National University |
Keywords: Haptics and Haptic Interfaces, Mechanism Design
Abstract: This work presents a compact, four-degree-of-freedom fingertip cutaneous feedback device capable of rendering normal force, shear, and rotational cues over a large range of motion. Based on a tendon-driven truss mechanism, the device achieved high repeatability in rotational and shear motions, with consistent positional errors that could be addressed through feedforward compensation. A proof-of-concept user study with additional z-axis actuation demonstrated reliable perception of all three tactile cues, achieving accuracies of 98.3% for normal force, 77.5% for rotation, and 91.2% for shear. These results support the feasibility of the proposed mechanism as a compact multi-modal tactile interface for a future handheld haptic device.
|
| |
| 15:00-16:30, Paper ThI2LB.18 | Add to My Program |
| From DMs to Drones: Weaponizing Discord As a Covert Command-And-Control RAT for UAV Hijacking |
|
| Nhlabatsi, Armstrong | Qatar University |
| Sleiman, Hadi Hassan | Qatar University |
| |
| 15:00-16:30, Paper ThI2LB.19 | Add to My Program |
| Cable-Driven Parallel Robot-Based Needle Steering for Imaging-Compatible Interventional Procedures |
|
| Son, Seongho | Chonnam Natial University |
| Hong, Ayoung | Chonnam National University |
Keywords: Surgical Robotics: Steerable Catheters/Needles
Abstract: Accurate needle steering using bevel-tip needles remains challenging due to nonlinear needle-tissue interactions and structural limitations of conventional robotic insertion systems in imaging-guided environments. This paper presents a cable-driven parallel robot (CDPR)-based needle steering framework that enables curvature-induced steering through coordinated control of the needle base pose. The proposed system provides 6-DoF needle orientation control using eight cables and an additional Bowden cable mechanism for axial rotation. Phantom insertion experiments demonstrate that steering direction can be regulated by bevel-tip orientation and that obstacle-avoidance insertion toward a desired target location is achievable. These results confirm the feasibility of CDPR-based needle steering for imaging-compatible minimally invasive intervention scenarios.
|
| |
| 15:00-16:30, Paper ThI2LB.20 | Add to My Program |
| Rapidly Deployable Nasal Surgery Tools Using Magnetic Soft Continuum Robots |
|
| Smith, Griffin | McMaster University |
| Gleason, Colton | McMaster University |
| Khalife, Sarah | McMaster University |
| van der Woerd, Benjamin | McMaster University |
| Onaizah, Onaizah | McMaster University |
| |
| 15:00-16:30, Paper ThI2LB.21 | Add to My Program |
| Passive Torsional Compliance for Dynamic Stability Improvement of a Curved-Spoke Tri-Wheel |
|
| Jeong, Sunbeom | Busan National University |
| Kim, Youngsoo | Pusan National University |
Keywords: Compliant Joints and Mechanisms, Mechanism Design
Abstract: Curved-spoke tri-wheel (CSTW) has been proposed as a simple mechanism for overcoming stair-like unstructured obstacles with fast, pure rolling motion. However, when the contact point transitions from one spoke to the next during climbing, the discontinuity in the radius of curvature between adjacent spokes causes a sudden drop in the effective rotational radius. As a result, the robot’s linear speed drops abruptly, which can induce dynamic instability such as payload disturbance and slip, leading to difficulty in maintaining stable climbing. To mitigate these issues, we propose a passive Compliant Spiral Torsional Suspension (C-STS) placed between the motor and the wheel’s drive axis to reduce the transition-induced deceleration. Using camera-based marker tracking, we obtain wheel COR velocities under low, medium, and high torsional stiffness and speed conditions, and quantified dynamic stability using the deceleration associated with the velocity drop at each contact transition. By comparing both cases with and without C-STS, the results reveal that the proposed C-STS effectively reduces deceleration under appropriate stiffness-speed combinations, owing to the torsional compliance and the release of spring energy during transition. Although the proposed C-STS was effective only under a limited range of conditions, it shows potential for further refinement through hybrid dynamic modeling and for application to other reconfigurable wheel systems with similar instabilities.
|
| |
| 15:00-16:30, Paper ThI2LB.22 | Add to My Program |
| A Multifunctional Capsule and Magnetic Navigation Platform for Controlled Actuation and Task Execution in GI Environments |
|
| Abu-Shaera, Razan | McMaster University |
| Gupta, Shivam | McMaster University |
| Palanichamy, Veerash | McMaster University |
| Onaizah, Onaizah | McMaster University |
Keywords: Embedded Systems for Robotic and Automation, Medical Robots and Systems, Actuation and Joint Mechanisms
Abstract: Wireless capsule endoscopy provides a minimally invasive method for examining the gastrointestinal (GI) tract; however, most existing systems are limited to passive operation and single functions, restricting control and functionality. This work presents the design, fabrication, and experimental evaluation of a multifunctional magnetically actuated capsule for drug delivery, sampling, and cargo transport. The capsule incorporates a novel spring–magnet mechanism that enables controlled, repeatable opening and closing under external magnetic fields using a single actuation input. In parallel, a large-workspace magnetic actuation platform is developed to support autonomous navigation and task execution. Iterative capsule designs improved fabrication and sealing performance, guided by analytical modeling. Experimental results demonstrate a substantial reduction in the required magnetic field for actuation (from 38.3 ± 7.7 mT to 12.7 ± 2.5 mT), alongside an approximately 4-fold reduction in leakage (6.19% vs. 23.59%). The actuation platform achieved accurate path tracking with a mean deviation of 2.63 mm across multiple trajectories and enabled navigation in a stomach phantom. These results demonstrate the feasibility of a multifunctional capsule platform with integrated actuation for minimally invasive GI interventions.
|
| |
| ThBT1 Regular Session, Hall A2 |
Add to My Program |
| Human-Robot Interaction |
|
| |
| |
| 16:45-16:55, Paper ThBT1.1 | Add to My Program |
| Human-Centered Development of Guide Dog Robots: Quiet and Stable Locomotion Control |
|
| Yu, Shangqun | University of Massachusetts Amherst |
| Hwang, Hochul | University of Massachusetts Amherst |
| Dang, Trung | University of Massachusetts Amherst |
| Biswas, Joydeep | The University of Texas at Austin |
| Giudice, Nicholas | University of Maine |
| Lee, Sunghoon Ivan | UMass Amherst |
| Kim, Donghyun | University of Massachusetts Amherst |
Keywords: Human-Centered Robotics, Legged Robots, Design and Human Factors
Abstract: A quadruped robot is a promising system that can offer assistance comparable to that of guide dogs due to its similar form factor. However, various challenges remain in making these robots a reliable option for blind and low-vision (BLV) individuals. Among these challenges, noise and jerky motion during walking are critical drawbacks of existing quadruped robots. While these issues have largely been overlooked in guide dog robot research, our interviews with guide dog handlers and trainers revealed that acoustic and physical disturbances can be particularly disruptive for BLV individuals, who rely heavily on environmental sounds for navigation. To address these issues, we developed a novel walking controller for slow stepping and smooth foot swing/contact while maintaining human walking speed, as well as robust and stable balance control. The controller integrates with a perception system to facilitate locomotion over non-flat terrains, such as stairs. Our controller was extensively tested on the Unitree Go1 robot and, when compared with other control methods, demonstrated significant noise reduction -- half of the default locomotion controller. To evaluate the usability, workload, and perceived noise of the developed system from a user’s perspective, we conducted indoor walking experiments. In these tests, participants compared our controller with the robot’s default controller. The results demonstrated higher user acceptance of our controller, highlighting its potential to improve the overall user experience of robotic guide dogs.
|
| |
| 16:55-17:05, Paper ThBT1.2 | Add to My Program |
| A Cyclic Adaptation-Generalization Framework with Uncertainty-Guided Self-Paced Learning for Long-Term Brain-Machine Interfaces |
|
| Wei, Jiyu | Zhejiang University |
| Hong, Di | Zhejiang University |
| Zhang, Zhanjie | Zhejiang University |
| Rong, Dazhong | Zhejiang University |
| He, Qinming | Zhejiang University |
| Wang, Yueming | Zhejiang University |
Keywords: Brain-Machine Interfaces, Rehabilitation Robotics, Transfer Learning
Abstract: Brain-Machine Interfaces (BMIs), which link the brain to external devices, hold great potential in rehabilitation, human performance augmentation, and human-centered robotics. However, invasive BMIs face a critical challenge for long-term deployment due to neural drift, which degrades decoding performance over time and necessitates frequent recalibration. Existing methods designed to mitigate neural drift typically rely on either domain adaptation (DA) or domain generalization (DG) alone and often fail to capture fine-grained distribution shifts across neural subdomains, resulting in limited performance. To overcome these limitations, we propose Uncertainty-guided Self-paced Cycling (UnSPC), a robust framework that synergizes DA and DG for target domain refining under an Uncertainty-guided Self-paced Pseudo-labeling (UnSPL) mechanism. To handle subdomain neural drift across domains, UNSPL is proposed to iteratively mine reliable pseudo-labeled samples with a noise-robust ranking strategy for further fine-tuning. Leveraging these high-quality samples, we introduce a novel Cycling Adaptation and Generalization (CycAG) strategy, which integrates DA and DG within an iterative cycle to progressively mitigate both global and subdomain drift. This cyclic process enables effective alignment to evolving target distributions while preserving robust and transferable representations, thereby mitigating performance degradation under long-term neural drifts. Extensive experiments on multiple neural decoding datasets demonstrate the effectiveness and robustness of UnSPC. To our knowledge, our proposed UnSPC is the first to cyclically integrate DA and DG with pseudo-labeling, paving the way toward stable long-term BMI controls.
|
| |
| 17:05-17:15, Paper ThBT1.3 | Add to My Program |
| Cooperation or Collaboration? on a Human-Inspired Impedance Strategy in a Human-Robot Co-Manipulation Task |
|
| Vianello, Lorenzo | Shirley Ryan Ability Lab |
| Gomes, Waldez | Université Paris-Saclay |
| Aubry, Alexis | Université De Lorraine |
| Maurice, Pauline | Cnrs - Loria |
| Ivaldi, Serena | INRIA |
Keywords: Physical Human-Robot Interaction, Human-Robot Collaboration, Human Factors and Human-in-the-Loop
Abstract: Robotic manipulators have the capability to engage in physical interaction with human operators, sharing not only the same workspace but also offering physical assistance to alleviate the human physical workload. In this study, we explore whether a robot should act as a collaborator or a cooperator in a co-manipulation task with a human partner, and investigate different collaboration strategies. In a previous study, we addressed the same question for a human–human dyad and found that collaboration is preferable to make fewer errors at the expense of increased arm stiffness for the humans. In our current investigation, a human physically interacts with a Franka robot in various co-manipulation conditions. In the cooperation conditions, the robot is either a leader or a follower, exhibiting fixed impedance strategies. In the collaborative conditions, the robot exhibits either reciprocal or mirrored adaptive impedance strategies that vary according to an online EMG-based function of the human arm stiffness. Our findings indicate that, for co-manipulation tasks, a robot collaborator is preferable to a robot cooperator (leader or follower), similarly to human dyads. However, unlike the behavior observed within human dyads, the reciprocal strategy for impedance adjustment appears to be the most effective for human–robot collaboration.
|
| |
| 17:15-17:25, Paper ThBT1.4 | Add to My Program |
| Flying Together: Human-Guided Immersive Shared Control for Aerial Robot Teams in Unknown Environments |
|
| De Bel-Air, Lou | École Polytechnique Fédérale De Lausanne (EPFL) |
| Morando, Luca | New York University |
| Chen, Ruitao | New York University |
| Wang, Keru | New York University |
| Jarvis, Benjamin | École Polytechnique Fédérale De Lausanne |
| Toumieh, Charbel | EPFL |
| Zhou, Yang | New York University |
| Perlin, Ken | New York University |
| Floreano, Dario | Ecole Polytechnique Fédérale De Lausanne (EPFL) |
| Loianno, Giuseppe | UC Berkeley |
Keywords: Aerial Systems: Applications
Abstract: While autonomous multi-robots can achieve safe and coordinated navigation, they often struggle to adapt to unforeseen conditions and to capture operator-driven objectives in unstructured environments. We present a Virtual Reality (VR)-based shared control framework for teams of drones operating in constrained and unknown environments, enabling real-time, user-guided exploration. Our approach integrates a novel user-guided motion-primitive-based planner with an admittance controller, generating dynamically feasible, collision-free trajectories while allowing the operator to flexibly influence team behavior. By leveraging user input, the framework enables the robot team to explore regions of interest that autonomous planners may overlook. The system supports mixed-reality operations with both physical and simulated drones, and implements a bilateral VR-based interface, allowing the operator to guide the robot team via migration points while receiving immediate visual feedback of the team state. Experimental results show that shared control improves obstacle avoidance, maintains inter-agent spacing, and reduces operator effort, demonstrating the feasibility and advantages of immersive, human-in-the-loop swarm navigation
|
| |
| 17:25-17:35, Paper ThBT1.5 | Add to My Program |
| Energy-Based Auto-Tuning of Velocity Flow Controller for Exoskeleton-User Speed Synchronization |
|
| Tang, Lyndon | University of Waterloo |
| Goswami, Bhavya Giri | University of Waterloo, Canada |
| Ghorbani Siavashani, Atusa | University of Waterloo |
| McPhee, John | University of Waterloo |
| Nasiri, Rezvan | University of Waterloo |
| Arami, Arash | University of Waterloo |
Keywords: Prosthetics and Exoskeletons, Rehabilitation Robotics, Wearable Robotics
Abstract: The Velocity Flow Field (VFF) lower-limb exoskeleton controller is widely applicable for gait rehabilitation because it provides the user with considerable agency over their gait; however, previous studies reported the feeling of "walking through water", and resistance to the user's efforts. In this work, a mathematical explanation for the viscous damping behavior when users deviate from the reference trajectory is presented. The controller was corrected and an adaptation law is proposed that synchronizes the speed gain with the user's current walking speed by minimizing the average mechanical work transferred between the user and exoskeleton per step. Experiments comparing a fixed and adaptive controller with 12 participants walking at 0.4 +/- 0.1 body length/s on a treadmill showed that the adaptive controller tracks changes in walking speed, while reducing the energy absorbed by 0.589 +/- 0.126 J/step compared to the fixed controller at the fastest walking speed. Analysis of changes in muscle effort and interaction torques with a human-exoskeleton interaction portrait showed that for most participants, the adaptive controller at medium and fast speeds substantially reduced user-controller disagreement and increased user agency over the walking motion. These positive results suggest that optimizing the energy supplied per step can serve as an effective coordination mechanism, enabling personalized and real-time adjustments of walking speed between the user and the exoskeleton.
|
| |
| 17:35-17:45, Paper ThBT1.6 | Add to My Program |
| A Multi-Layer Sim-To-Real Framework for Gaze-Driven Assistive Neck Exoskeletons |
|
| Rubow, Colin | University of Utah |
| Brewer, Eric | University of Utah |
| Bales, Ian | University of Utah |
| Zhang, Haohan | University of Utah |
| Brown, Daniel | University of Utah |
Keywords: Physically Assistive Devices, Virtual Reality and Interfaces, Wearable Robotics
Abstract: Dropped head syndrome, caused by neck muscle weakness from neurological diseases, severely impairs an individual’s ability to support and move their head, causing pain and making everyday tasks challenging. Our long-term goal is to develop an assistive powered neck exoskeleton that restores natural movement. However, predicting a user’s intended head movement remains a key challenge. We leverage virtual reality (VR) to collect coupled eye and head movement data from healthy individuals to train models capable of predicting head movement based solely on eye gaze. We also propose a novel multi-layer controller selection framework, where head control strategies are evaluated across decreasing levels of abstraction—from simulation and VR to a physical neck exoskeleton. This pipeline effectively rejects poor-performing controllers early, identifying two novel gaze-driven models that achieve strong performance when deployed on the physical exoskeleton. Our results reveal that no single controller is universally preferred, highlighting the necessity for personalization in gaze-driven assistive control. Our work demonstrates the utility of VR-based evaluation for accelerating the development of intuitive, safe, and personalized assistive robots.
|
| |
| 17:45-17:55, Paper ThBT1.7 | Add to My Program |
| ProVox: Personalization and Proactive Planning for Situated Human-Robot Collaboration |
|
| Grannen, Jennifer | Stanford University |
| Karamcheti, Siddharth | Stanford University |
| Wulfe, Blake | Stanford University |
| Sadigh, Dorsa | Stanford University |
Keywords: Human-Robot Collaboration, Long term Interaction, Human-Robot Teaming
Abstract: Collaborative robots must quickly adapt to their partner’s intent and preferences to proactively identify helpful actions. This is especially true in situated settings where human partners can continually teach robots new high-level behaviors, visual concepts, and physical skills (e.g., through demonstration), growing the robot’s capabilities as the human-robot pair work together to accomplish diverse tasks. In this work, we argue that robots should be able to infer their partner’s goals from early interactions and use this information to proactively plan behaviors ahead of explicit instructions from the user. Building from the strong commonsense priors and steerability of large language models, we introduce ProVox (“Proactive Voice”), a novel framework that enables robots to efficiently personalize and adapt to individual collaborators. We design a meta-prompting protocol that empowers users to communicate their distinct preferences, intent, and expected robot behaviors ahead of starting a physical interaction. ProVox then uses the personalized prompt to condition a proactive language model task planner that anticipates a user’s intent from the current interaction context and robot capabilities to suggest helpful actions; in doing so, we alleviate user burden, minimizing the amount of time partners spend explicitly instructing and supervising the robot. We evaluate ProVox through user studies grounded in household manipulation tasks (e.g., assembling lunch bags) that measure the efficiency of the collaboration, as well as features such as perceived helpfulness, ease of use, and reliability. Our analysis suggests that both meta-prompting and proactivity are critical, resulting in 38.7% faster task completion times and 31.9% less user burden relative to non-active baselines. Videos and Supplementary Material: https://provox-2025.github.io
|
| |
| 17:55-18:05, Paper ThBT1.8 | Add to My Program |
| Learning-Based Safety-Aware Task Scheduling for Efficient Human-Robot Collaboration |
|
| Faroni, Marco | Politecnico Di Milano |
| Spanò, Alessio | Politecnico Di Milano |
| Zanchettin, Andrea Maria | Politecnico Di Milano |
| Rocco, Paolo | Politecnico Di Milano |
Keywords: Human-Robot Collaboration, Human-Aware Motion Planning, Safety in HRI
Abstract: Ensuring human safety in collaborative robotics can compromise efficiency because traditional safety measures increase robot cycle time when human interaction is frequent. This paper proposes a safety-aware approach to mitigate efficiency losses without assuming prior knowledge of safety logic. Using a deep-learning model, the robot learns the relationship between system state and safety-induced speed reductions based on execution data. Our framework does not explicitly predict human motions but directly models the interaction effects on robot speed, simplifying implementation and enhancing generalizability to different safety logics. At runtime, the learned model optimizes task selection to minimize cycle time while adhering to safety requirements. Experiments on a pick-and-packaging scenario demonstrated significant reductions in cycle times.
|
| |
| 18:05-18:15, Paper ThBT1.9 | Add to My Program |
| A Kinesthetic Teaching Framework for Tasks with Contact Transitions and Time-Optimized Execution |
|
| Thelenberg, Nikolas | TU Wien |
| Ott, Christian | TU Wien |
Keywords: Physical Human-Robot Interaction, Compliance and Impedance Control, Motion and Path Planning
Abstract: In kinesthetic teaching, a robot is manually guided by a human operator to demonstrate a task. Most methods focus on replaying the recorded motion, but are agnostic to contact transitions, which can be critical when interacting with rigid environments. To overcome this limitation, we propose a framework that allows to teach motions in free space as well as in contact while preventing fast unintended contact transitions. This is accomplished by exploiting a projection-based unilateral damping force that increases close to contact. We derive an explicit analytical expression for the damping characteristics to ensure a safe stop before the contact when no further forces act on the robot. Furthermore, after the teaching, the recorded motion data is utilized to generate a time-optimized trajectory based on convex optimization, in which the contact transitions are explicitly considered. We validated our framework in experiments with a torque-controlled manipulator.
|
| |
| ThBT2 Regular Session, Hall A3 |
Add to My Program |
| Robot Learning II |
|
| |
| Chair: Loquercio, Antonio | University of Pennsylvania |
| Co-Chair: Morimoto, Jun | Kyoto University |
| |
| 16:45-16:55, Paper ThBT2.1 | Add to My Program |
| SVR-GS: Spatially Variant Regularization for Probabilistic Masks in 3D Gaussian Splatting |
|
| Taghipour, Ashkan | University of Western Australia |
| Naghshin, Vahid | Dolby Laboratories |
| Southwell, Benjamin John | Dolby Laboratories |
| Boussaid, Farid | University of Western Australia |
| Laga, Hamid | Murdoch University |
| Bennamoun, Mohammed | University of Western Australia |
Keywords: Deep Learning Methods, Deep Learning for Visual Perception, RGB-D Perception
Abstract: 3D Gaussian Splatting (3DGS) enables fast, high- quality novel view synthesis but relies on densification followed by pruning to optimize the number of Gaussians. Existing mask-based pruning, such as MaskGS, regularizes the global mean of the mask, which is misaligned with the local per-pixel (per-ray) reconstruction loss that determines image quality along individual camera rays. This paper introduces SVR-GS, a spatially variant regularizer that renders a per-pixel spatial mask from each Gaussian’s effective contribution along the ray, thereby applying sparsity pressure where it matters: on low-importance Gaussians. We explore three spatial-mask aggregation strategies, implement them in CUDA, and conduct a gradient analysis to motivate our final design. Extensive experiments on Tanks&Temples, Deep Blending, and Mip-NeRF360 datasets demonstrate that, on average across the three datasets, the proposed SVR-GS reduces the number of Gaussians by 1.79× compared to MaskGS and 5.63× compared to 3DGS, while incurring only 0.50 dB and 0.40 dB PSNR drops, respectively. These gains translate into significantly smaller, faster, and more memory-efficient models, making them well-suited for real-time applications such as robotics, AR/VR, and mobile perception. Additional materials are available on our project page: https://ashkantaghipour.github.io/svrgs/.
|
| |
| 16:55-17:05, Paper ThBT2.2 | Add to My Program |
| Dense-Jump Flow Matching with Non-Uniform Time Scheduling for Robotic Policies: Mitigating Multi-Step Inference Degradation |
|
| Chen, Zidong | Imperial College London |
| Guo, Zhihao | Manchester Metropolitan University |
| Wang, Peng | University of Surrey |
| Egbe, ThankGod Itua | The Manchester Metropolitan University |
| Lyu, Yan | Southeast University |
| Qian, Chenghao | University of Leeds |
Keywords: AI-Enabled Robotics, Model Learning for Control, Deep Learning in Grasping and Manipulation
Abstract: Flow matching has emerged as a competitive framework for learning high-quality generative policies in robotics; however, we find that generalisation arises and saturates early along the flow trajectory, in accordance with recent findings in the literature. We further observe that increasing the number of Euler integration steps during inference counter-intuitively and universally degrades policy performance. We attribute this to (i) additional, uniformly spaced integration steps oversample the late-time region, thereby constraining actions towards the training trajectories and reducing generalisation; and (ii) the learned velocity field becoming non-Lipschitz as integration time approaches 1, causing instability. To address these issues, we propose a novel policy that utilises non-uniform time scheduling (e.g., U-shaped) during training, which emphasises both early and late temporal stages to regularise policy training, and a dense-jump integration schedule at inference, which uses a single-step integration to replace the multi-step integration beyond a jump point, to avoid unstable areas around 1. Essentially, our policy is an efficient one-step learner that still pushes forward performance through multi-step integration, yielding up to 23.7% performance gains over state-of-the-art baselines across diverse robotic tasks. Code is anonymously open-sourced at url{https://github.com/DenseJumpFM/DenseJump_FlowMatching}
|
| |
| 17:05-17:15, Paper ThBT2.3 | Add to My Program |
| SCIZOR: A Self-Supervised Approach to Data Curation for Large-Scale Imitation Learning |
|
| Zhang, Yu | The University of Texas at Austin |
| Xie, Yuqi | NVIDIA |
| Liu, Huihan | The University of Texas at Austin |
| Shah, Rutav | The University of Texas at Austin |
| Wan, Michael | NVIDIA |
| Fan, Linxi | NVIDIA |
| Zhu, Yuke | The University of Texas at Austin |
Keywords: Big Data in Robotics and Automation, Imitation Learning, Learning from Demonstration
Abstract: Imitation learning advances robot capabilities by enabling the acquisition of diverse behaviors from human demonstrations. However, large-scale datasets used for policy training often introduce substantial variability in quality, which can negatively impact performance. As a result, automatically curating datasets by filtering low-quality samples to improve quality becomes essential. Existing robotic curation approaches rely on costly manual annotations and perform curation at a coarse granularity, such as the dataset or trajectory level, failing to account for the quality of individual state-action pairs. To address this, we introduce SCIZOR, the first self-supervised transition-level curation framework that requires no annotations and scales to large-scale datasets to improve the performance of imitation learning policies and modern Vision-Language-Action (VLA) models. SCIZOR targets two complementary sources of low-quality data: suboptimal data, which hinders learning with undesirable actions, and redundant data, which dilutes training with repetitive patterns. SCIZOR leverages a self-supervised task progress predictor for suboptimal data to remove samples lacking task progression, and a deduplication module operating on joint state-action representation for samples with redundant patterns. Empirically, we show that SCIZOR enables imitation learning policies and modern VLA models to achieve higher performance with less data, yielding an average improvement of 15.4% across multiple benchmarks. More information is available at: https://scizor-icra2026.github.io
|
| |
| 17:15-17:25, Paper ThBT2.4 | Add to My Program |
| Deep Sensorimotor Control by Imitating Predictive Models of Human Motion |
|
| Singh, Himanshu Gaurav | University of California Berkeley |
| Abbeel, Pieter | UC Berkeley |
| Malik, Jitendra | UC Berkeley |
| Loquercio, Antonio | University of Pennsylvania |
Keywords: Reinforcement Learning, Dexterous Manipulation, Deep Learning in Grasping and Manipulation
Abstract: As the embodiment gap between a robot and a human narrows, new opportunities arise to leverage datasets of humans interacting with their surroundings for robot learning. % We propose a novel technique for training sensorimotor policies with reinforcement learning by imitating predictive models of human motions. % Our key insight is that the motion of keypoints on human-inspired robot end-effectors closely mirrors the motion of corresponding human body keypoints. % This enables us to use a model trained to predict future motion on human data emph{zero-shot} on robot data. % We train sensorimotor policies to track the predictions of such a model, conditioned on a history of past robot states, while optimizing a relatively sparse task reward. % This approach entirely bypasses gradient-based kinematic retargeting and adversarial losses, which limit existing methods from fully leveraging the scale and diversity of modern human-scene interaction datasets. % Empirically, we find that our approach can work across robots and tasks, outperforming existing baselines by a large margin. % In addition, we find that tracking a human motion model can substitute for carefully designed dense rewards and curricula in manipulation tasks. Code, data and qualitative results available at url{https://dynamicsprediction.space}
|
| |
| 17:25-17:35, Paper ThBT2.5 | Add to My Program |
| ViTaS: Visual Tactile Soft Fusion Contrastive Learning for Visuomotor Learning |
|
| Tian, Yufeng | Shanghai Qi Zhi Institute |
| Cheng, Shuiqi | The University of Hong Kong |
| Wei, Tianming | Tsinghua University |
| Zhou, Tianxing | Tsinghua University |
| Zhang, Yuanhang | Carnegie Mellon University |
| Liu, Zixian | Tsinghua University |
| Han, Qianwei | Shanghai Qi Zhi Institute |
| Yuan, Zhecheng | Tsinghua University |
| Xu, Huazhe | Tsinghua University |
Keywords: Representation Learning, Machine Learning for Robot Control
Abstract: Tactile information plays a crucial role in human manipulation tasks and has recently garnered increasing attention in robotic manipulation. However, existing approaches mostly focus on the alignment of visual and tactile features and the integration mechanism tends to be direct concatenation. Consequently, they struggle to effectively cope with occluded scenarios due to neglecting the inherent complementary nature of both modalities and the alignment may not be exploited enough, limiting the potential of their real-world deployment. In this paper, we present ViTaS, a simple yet effective framework that incorporates both visual and tactile information to guide the behavior of an agent. We introduce Soft Fusion Contrastive Learning, an advanced version of conventional contrastive learning method and a CVAE module to utilize the alignment and complementarity within visuo-tactile representations. We demonstrate the effectiveness of our method in 12 simulated and 3 real-world environments, and our experiments show that ViTaS significantly outperforms existing baselines. Project page: https://skyrainwind.github.io/ViTaS/index.html.
|
| |
| 17:35-17:45, Paper ThBT2.6 | Add to My Program |
| TADPO: Reinforcement Learning Goes Off-Road |
|
| Wu, Zhouchonghao | Carnegie Mellon University |
| Song, Raymond | Carnegie Mellon University |
| Mundheda, Vedant | Carnegie Mellon University |
| Navarro-Serment, Luis E. | Carnegie Mellon University |
| Schoenborn, Christof | Carnegie Mellon University |
| Schneider, Jeff | Carnegie Mellon University |
Keywords: Reinforcement Learning, Field Robots, Machine Learning for Robot Control
Abstract: Off-road autonomous driving poses significant challenges such as navigating unmapped, variable terrain with uncertain and diverse dynamics. Addressing these challenges requires effective long-horizon planning and adaptable control. Reinforcement Learning (RL) offers a promising solution by learning control policies directly from interaction. However, because off-road driving is a long-horizon task with low-signal rewards, standard RL methods are challenging to apply in this setting. We introduce TADPO, a novel policy gradient formulation that extends Proximal Policy Optimization (PPO), leveraging off-policy trajectories for teacher guidance and on-policy trajectories for student exploration. Building on this, we develop a vision-based, end-to-end RL system for high-speed off-road driving, capable of navigating extreme slopes and obstacle-rich terrain. We demonstrate our performance in simulation and, importantly, zero-shot sim-to-real transfer on a full-scale off-road vehicle. To our knowledge, this work represents the first deployment of RL-based policies on a full-scale off-road platform.
|
| |
| 17:45-17:55, Paper ThBT2.7 | Add to My Program |
| Learning Problem Decomposition for Efficient Sequential Multi-Object Manipulation Planning (resubmitted) |
|
| Zhang, Yan | Idiap Research Institute; EPFL |
| Xue, Teng | Idiap Research Institute and EPFL |
| Razmjoo, Amirreza | Idiap Research Institute |
| Calinon, Sylvain | Idiap Research Institute |
Keywords: Task and Motion Planning, Learning from Demonstration
Abstract: We present an efficient task and motion replanning approach for sequential multi-object manipulation in dynamic environments. Conventional Task And Motion Planning (TAMP) solvers experience an exponential increase in planning time as the planning horizon and number of objects grow, limiting their applicability in real-world scenarios. To address this, we propose learning problem decompositions from demonstrations to accelerate TAMP solvers. Our approach consists of three key components: goal decomposition learning, computational distance learning, and object reduction. Goal decomposition identifies the necessary sequences of states that the system must pass through before reaching the final goal, treating them as subgoal sequences. Computational distance learning predicts the computational complexity between two states, enabling the system to identify the temporally closest subgoal from a disturbed state. Object reduction minimizes the set of active objects considered during replanning, further improving efficiency. We evaluate our approach on three benchmarks, demonstrating its effectiveness in improving replanning efficiency for sequential multi-object manipulation tasks in dynamic environments
|
| |
| 17:55-18:05, Paper ThBT2.8 | Add to My Program |
| Motion Generation for Modular Robots Using Hierarchical Policies |
|
| Minamikawa, Kenjiro | Graduate School of Infromatics, Kyoto University |
| Yamamori, Satoshi | Kyoto University |
| Yagi, Satoshi | Kyoto University |
| Takeda, Sho | Kyoto University |
| Yoshida, Kazuya | Tohoku University |
| Morimoto, Jun | Kyoto University |
Keywords: Reinforcement Learning, Legged Robots, Cellular and Modular Robots
Abstract: Modular robots can be reconfigured into multiple morphologies, offering high adaptability for diverse tasks. However, reinforcement learning (RL)-based motion generation typically requires separate policy training for each morphology, and end-to-end training often fails to exploit module-specific roles. This paper proposes a hierarchical policy framework that explicitly separates control at the module level, learning reusable motion skills for each module and coordinating them with an upper-level policy for whole-body control. A single lower-level reaching policy, shared across all arm modules, is trained once and reused across morphologies, ensuring that module-specific functions are preserved even as complexity increases. The method is evaluated on the modular robot emph{MoonBot} in simulation, demonstrating scalable control of diverse morphologies and improved learning efficiency and interpretability over non-hierarchical baselines.
|
| |
| 18:05-18:15, Paper ThBT2.9 | Add to My Program |
| Factorizing Diffusion Policies for Observation Modality Prioritization |
|
| Patil, Omkar Deepak | Arizona State University |
| Rath, Prabin Kumar | Arizona State University |
| Pangaonkar, Kartikay Milind | Arizona State University |
| Rosen, Eric | Robotics and AI Institute |
| Gopalan, Nakul | Arizona State University |
Keywords: Learning from Demonstration, Probability and Statistical Methods, Imitation Learning
Abstract: Diffusion models have been extensively leveraged for learning robot skills from demonstrations. These policies are conditioned on several observational modalities such as proprioception, vision and tactile. However, observational modalities have varying levels of influence for different tasks that diffusion polices fail to capture. In this work, we propose 'Factorized Diffusion Policies' abbreviated as FDP, a novel policy formulation that enables observational modalities to have differing influence on the action diffusion process by design. This results in learning policies where certain observations modalities can be prioritized over the others such as vision>tactile or proprioception>vision. FDP achieves modality prioritization by factorizing the observational conditioning for diffusion process, resulting in more performant and robust policies. Our factored approach shows strong performance improvements in low-data regimes with 15% absolute improvement in success rate on several simulated benchmarks when compared to a standard diffusion policy that jointly conditions on all input modalities. Moreover, our benchmark and real-world experiments show that factored policies are naturally more robust with 40% higher absolute success rate across several visuomotor tasks under distribution shifts such as visual distractors or camera occlusions, where existing diffusion policies fail catastrophically. FDP thus offers a safer and more robust alternative to standard diffusion policies for real-world deployment. Videos are available at https://fdp-policy.github.io/fdp-policy/.
|
| |
| ThBT3 Regular Session, Lehar 1-4 |
Add to My Program |
| Mechanisms, Design and Control |
|
| |
| Co-Chair: Borras Sol, Julia | Institut De Robòtica I Informàtica Industrial, CSIC-UPC |
| |
| 16:45-16:55, Paper ThBT3.1 | Add to My Program |
| End-To-End Low-Level Neural Control of an Industrial-Grade 6D Magnetic Levitation System |
|
| Hartmann, Philipp | Bielefeld University |
| Stranghöner, Jannick | Universität Bielefeld |
| Neumann, Klaus | Bielefeld University / Fraunhofer IOSB-INA |
Keywords: Neural and Fuzzy Control, Robust/Adaptive Control, Machine Learning for Robot Control
Abstract: Magnetic levitation is poised to revolutionize industrial automation by integrating flexible in-machine product transport and seamless manipulation. It is expected to become the standard drive technology for automated manufacturing. However, controlling such systems is inherently challenging due to their complex, unstable dynamics. Traditional control approaches, which rely on hand-crafted control engineering, typically yield robust but conservative solutions, with their performance closely tied to the expertise of the engineering team. In contrast, learning-based neural control presents a promising alternative. This paper presents the first neural controller for 6D magnetic levitation. Trained end-to-end on interaction data from a proprietary controller, it directly maps raw sensor data and 6D reference poses to coil current commands. The neural controller can effectively generalize to previously unseen situations while maintaining accurate and robust control. These results underscore the practical feasibility of learning-based neural control in complex physical systems and suggest a future where such a paradigm could enhance or even substitute traditional engineering approaches in demanding real-world applications. The trained neural controller, source code, and demonstration videos are publicly available at https://sites.google.com/view/neural-maglev.
|
| |
| 16:55-17:05, Paper ThBT3.2 | Add to My Program |
| ByteWrist: A Parallel Robotic Wrist Enabling Flexible and Anthropomorphic Motion for Confined Spaces |
|
| Tian, Jiawen | LiberAI |
| Qiao, Jingchao | ByteDance Seed Robotics |
| Huang, Liqun | Seed-Robotics,bytedance |
| Cui, Zhongren | Bytedance |
| Ma, Xiao | Dyson |
| Xu, Jiafeng | ByteDance |
| Ren, Zeyu | ByteDance |
Keywords: Mechanism Design, Motion Control, Dual Arm Manipulation
Abstract: This paper introduces ByteWrist, a novel highly-flexible and anthropomorphic parallel wrist for robotic manipulation. ByteWrist addresses the critical limitations of existing serial and parallel wrists in narrow-space operations through a compact three-stage parallel drive mechanism integrated with arc-shaped end linkages. The design achieves precise RPY (Roll-Pitch-Yaw) motion while maintaining exceptional compactness, making it particularly suitable for complex unstructured environments such as home services, medical assistance, and precision assembly. The key innovations include: (1) a nested three-stage motor-driven linkages that minimize volume while enabling independent multi-DOF control, (2) arc-shaped end linkages that optimize force transmission and expand motion range, and (3) a central supporting ball functioning as a spherical joint that enhances structural stiffness without compromising flexibility. Meanwhile, we present comprehensive kinematic modeling including forward / inverse kinematics and a numerical Jacobian solution for precise control. Empirically, we observe ByteWrist demonstrates strong performance in narrow-space maneuverability and dual-arm cooperative manipulation tasks, outperforming Kinova-based systems. Results indicate significant improvements in compactness, efficiency, and stiffness compared to traditional designs, establishing ByteWrist as a promising solution for next-generation robotic manipulation in constrained environments.
|
| |
| 17:05-17:15, Paper ThBT3.3 | Add to My Program |
| Design and Implementation of Elastic Structure Preserving Vibration Suppression Control for Flexible Link Robots Using IMU Measurements |
|
| Kitzinger, Alexander | Johannes Kepler University Linz |
| Gattringer, Hubert | Johannes Kepler University Linz |
| Müller, Andreas | Johannes Kepler University Linz |
Keywords: Motion Control, Compliant Joints and Mechanisms, Industrial Robots
Abstract: Elastic lightweight manipulators offer multiple benefits but suffer from increased structural flexibility, making them susceptible to vibrations and thus requiring dedicated control concepts for vibration suppression. Based on a lumped element model formulation, a method called elastic structure preserving (ESP) control is used for additional damping injection, while using standard PD motor position control. The control method is applied for the first time to a flexible link robot by combining it with a link-side IMU-based observer. It is demonstrated in an industrial context using a standard controller setup, enabling straightforward implementation on existing industrial robots. The novel ESP method is further compared to a flatness-based control approach and to standard PD motor control. Particular aspects of controller tuning are discussed. Both theoretical analysis and experimental evaluations are conducted to address trajectory tracking behavior, disturbance rejection, and robustness to model parameter uncertainties. Results based on end effector accelerations show that ESP achieves superior vibration damping, demonstrating its effectiveness for industrial lightweight robots.
|
| |
| 17:15-17:25, Paper ThBT3.4 | Add to My Program |
| When Rolling Gets Weird: A Curved-Link Tensegrity Robot for Non-Intuitive Behavior |
|
| Ervin, Lauren | University of Alabama |
| Bezawada, Harish | The University of Alabama |
| Vikas, Vishesh | University of Alabama |
Keywords: Mechanism Design, Kinematics, Field Robots
Abstract: Conventional mobile tensegrity robots constructed with straight links offer mobility at the cost of locomotion speed. While spherical robots provide highly effective rolling behavior, they often lack the stability required for navigating unstructured terrain common in many space exploration environments. This research presents a solution with a semi-circular, curved-link tensegrity robot that strikes a balance between efficient rolling locomotion and controlled stability, enabled by discontinuities present at the arc endpoints. Building upon an existing geometric static modeling framework [1], this work presents the system design of an improved Tensegrity eXploratory Robot 2 (TeXploR2). Internal shifting masses instantaneously roll along each curved-link, dynamically altering the two points of contact with the ground plane. Simulations of quasistatic, piecewise continuous locomotion sequences reveal new insights into the positional displacement between inertial and body frames. Non-intuitive rolling behaviors are identified and experimentally validated using a tetherless prototype, demonstrating successful dynamic locomotion. A preliminary impact test highlights the tensegrity structure’s inherent shock absorption capabilities and conformability. Future work will focus on finalizing a dynamic model that is experimentally validated with extended testing in real-world environments as well as further refinement of the prototype to incorporate additional curved-links and subsequent ground contact points for increased controllability.
|
| |
| 17:25-17:35, Paper ThBT3.5 | Add to My Program |
| 3D Printable Soft Liquid Metal Sensors for Delicate Manipulation Tasks |
|
| Liow, Lois | CSIRO |
| Milford, Jonty | Flinders University |
| Uygun, Emre | CSIRO |
| Farinha, Andre | CSIRO |
| Viswanathan, VinothKumar | CSIRO |
| Pinskier, Joshua | CSIRO |
| Howard, David | CSIRO |
Keywords: Soft Robot Applications, Soft Sensors and Actuators, Grasping
Abstract: Robotics and automation are key enablers to increase throughput in ongoing conservation efforts across various threatened ecosystems. Cataloguing, digitisation, husbandry, and similar activities require the ability to interact with delicate, fragile samples without damaging them. Additionally, learning-based solutions to these tasks require the ability to safely acquire data to train manipulation policies, e.g., reinforcement learning. To address these twin needs, we introduce a novel method to print free-form, highly sensorised soft ‘physical twins’. We present an automated design workflow to create complex and customisable 3D soft sensing structures on demand from 3D scans or models. Compared to the state of the art, our soft liquid metal sensors faithfully recreate complex natural geometries and display excellent sensing properties suitable for validating performance in delicate manipulation tasks. We demonstrate the application of our physical twins as 'sensing corals': high-fidelity, 3D printed replicas of scanned corals that eliminate the need for live coral experimentation, whilst increasing data quality, offering an ethical and scalable pathway for advancing autonomous coral handling and soft manipulation broadly. Through extensive bench-top manipulation and underwater grasping experiments, we show that our sensing coral is able to detect grasps under 0.5 N, effectively capturing the delicate interactions and light contact forces required for coral handling. Finally, we showcase the value of our physical twins across two demonstrations: (i) automated coral labelling for lab identification and (ii) robotic coral aquaculture. Sensing physical twins such as ours can provide richer grasping feedback than conventional sensors providing experimental validation prior to deployment in handling fragile and delicate items.
|
| |
| 17:35-17:45, Paper ThBT3.6 | Add to My Program |
| Whole-Body Safe Control of Robotic Systems with Koopman Neural Dynamics |
|
| Jung, Sebin | Carnegie Mellon University |
| Abuduweili, Abulikemu | Apple |
| Li, Jiaxing | Carnegie Mellon University |
| Liu, Changliu | Carnegie Mellon University |
Keywords: Robot Safety, Model Learning for Control, Deep Learning Methods
Abstract: Controlling robots with strongly nonlinear, high-dimensional dynamics remains challenging, as direct nonlinear optimization with safety constraints is often intractable in real time. The Koopman operator offers a way to represent nonlinear systems linearly in a lifted space, enabling the use of efficient linear control. We propose a data-driven framework that learns a Koopman embedding and operator from data, and integrates the resulting linear model with the Safe Set Algorithm (SSA). This allows the tracking and safety constraints to be solved in a single quadratic program (QP), ensuring feasibility and optimality without a separate safety filter. We validate the method on a Kinova Gen3 manipulator and a Go2 quadruped, showing accurate tracking and obstacle avoidance.
|
| |
| 17:45-17:55, Paper ThBT3.7 | Add to My Program |
| Grip As Needed, Glide on Demand: Ultrasonic Lubrication for Robotic Locomotion |
|
| Atalla, Mostafa A. | Delft University of Technology |
| Cumming, Jack | University of Twente |
| van Bemmel, Daan | Delft University of Technology |
| Breedveld, Paul | TU Delft |
| Wiertlewski, Michael | TU Delft |
| Sakes, Aimee | TU Delft |
Keywords: Mechanism Design, Actuation and Joint Mechanisms, Biologically-Inspired Robots
Abstract: Friction is the essential mediator of terrestrial locomotion, yet in robotic systems it is almost always treated as a passive property fixed by surface materials and conditions. Here, we introduce ultrasonic lubrication as a method to actively control friction in robotic locomotion. By exciting resonant structures at ultrasonic frequencies, contact interfaces can dynamically switch between "grip" and "glide" states, enabling locomotion. We developed two friction control modules: a cylindrical design for lumen-like environments and a flat-plate design for external surfaces, and integrated them into bio-inspired systems modeled after inchworm and wasp ovipositor locomotion. Both systems achieved bidirectional locomotion with nearly perfect locomotion efficiencies that exceeded 90%. Friction characterization experiments further demonstrated substantial friction reduction across various surfaces, including rigid, soft, granular, and biological tissue interfaces, under dry and wet conditions, and on surfaces with different levels of roughness, confirming the versatility of ultrasonic lubrication for locomotion applications. These findings establish ultrasonic lubrication as a viable active friction control mechanism for robotic locomotion, with the potential to reduce design complexity and improve the efficiency of robotic locomotion systems.
|
| |
| 17:55-18:05, Paper ThBT3.8 | Add to My Program |
| Robots That Redesign Themselves through Kinematic Self-Destruction |
|
| Yu, Chen | Northwestern University |
| Kriegman, Sam | Northwestern University |
Keywords: Evolutionary Robotics, Bioinspired Robot Learning, Cellular and Modular Robots
Abstract: Every robot built to date was predesigned by an external process, prior to deployment. Here we show a robot that actively participates in its own design during its lifetime. Starting from a randomly assembled body, and using only proprioceptive feedback, the robot dynamically "sculpts" itself into a new design through kinematic self-destruction: identifying redundant links within its body that inhibit its locomotion, and then thrashing those links against the surface until they break at the joint and fall off the body. It does so using a single autoregressive sequence model, a universal controller that learns in simulation when and how to simplify a robot's body through self-destruction and then adaptively controls the reduced morphology. The optimized policy successfully transfers to reality and generalizes to previously unseen kinematic trees, generating forward locomotion that is more effective than otherwise equivalent policies that randomly remove links or cannot remove any. This suggests that self-designing robots may be more successful than predesigned robots in some cases, and that kinematic self-destruction, though reductive and irreversible, could provide a general adaptive strategy for a wide range of robots.
|
| |
| 18:05-18:15, Paper ThBT3.9 | Add to My Program |
| A Gripper for Flap Separation and Opening of Sealed Bags |
|
| Foix, Sergi | Institut De Robòtica I Informàtica Industrial, CSIC-UPC |
| Oriol Lladó, Jaume | Institut De Robòtica I Informàtica Industrial, CSIC-UPC |
| Torras, Carme | Institut De Robòtica I Informàtica Industrial, CSIC-UPC |
| Borras Sol, Julia | Institut De Robòtica I Informàtica Industrial, CSIC-UPC |
Keywords: Grippers and Other End-Effectors, Physically Assistive Devices, Contact Modeling
Abstract: Separating thin, flexible layers that must be individually grasped is a common but challenging manipulation primitive for most off-the-shelf grippers. A prominent example arises in clinical settings: the opening of sterile flat pouches for the preparation of the operating room, where the first step is to separate and grasp the flaps. We present a novel gripper design and opening strategy that enables reliable flap separation and robust seal opening. This capability addresses a high-volume repetitive hospital procedure in which nurses manually open up to 240 bags per shift, a physically demanding task linked to musculoskeletal injuries. Our design combines an active dented-roller fingertip with compliant fingers that exploit environmental constraints to robustly grasp thin flexible flaps. Experiments demonstrate that the proposed gripper reliably grasps and separates sealed bag flaps and other thin-layered materials from the hospital, the most sensitive variable affecting performance being the normal force applied. When two copies of the gripper grasp both flaps, the system withstands the forces needed to open the seals robustly. To our knowledge, this is one of the first demonstrations of robotic assistance to automate this repetitive, low-value, but critical hospital task.
|
| |
| ThBT4 Regular Session, Strauss 1-2 |
Add to My Program |
| Applications |
|
| |
| |
| 16:45-16:55, Paper ThBT4.1 | Add to My Program |
| Localized Graph-Based Neural Dynamics Models for Terrain Manipulation |
|
| Liu, Chaoqi | University of Illinois at Urbana-Champaign |
| Li, Yunzhu | Columbia University |
| Hauser, Kris | University of Illinois at Urbana-Champaign |
Keywords: Robotics and Automation in Construction, Representation Learning, Simulation and Animation
Abstract: Predictive models can be particularly helpful for robots to effectively manipulate terrains in construction sites and extraterrestrial surfaces. However, terrain state representations become extremely high-dimensional especially to capture fine-resolution details and when depth is unknown or unbounded. This paper introduces L-GBND, a learning-based approach for terrain dynamics modeling and manipulation, leveraging the Graph-based Neural Dynamics (GBND) framework to represent terrain deformation as motion of a graph of particles. Based on the principle that the moving portion of a terrain is usually localized, our approach builds a large terrain graph (potentially millions of particles) but only identifies a very small active subgraph (hundreds of particles) for predicting the outcomes of robot-terrain interaction. To minimize the size of the active subgraph we introduce a learning-based approach that identifies a small region of interest (RoI) based on the robot's control inputs and the current scene. We also introduce a novel domain boundary feature encoding that allows GBNDto perform accurate dynamics prediction in the RoI interior while avoiding particle penetration through RoI boundaries. Our proposed method is both orders of magnitude faster than naive GBND and it achieves better overall prediction accuracy. We further evaluated our framework on excavation and shaping tasks on terrain with different granularity.
|
| |
| 16:55-17:05, Paper ThBT4.2 | Add to My Program |
| Acoustic Feedback for Closed-Loop Force Control in Robotic Grinding |
|
| Zhang, Zongyuan | Queensland University of Technology |
| Lehnert, Christopher | Queensland University of Technology |
| Browne, Will | Queensland University of Technology |
| Roberts, Jonathan | Queensland University of Technology |
Keywords: Industrial Robots, Force Control, Machine Learning for Robot Control
Abstract: Acoustic feedback is a critical indicator for assessing the contact condition between the tool and the workpiece when humans perform grinding tasks with rotary tools. In contrast, robotic grinding systems typically rely on force sensing, with acoustic information largely ignored. This reliance on force sensors is costly and difficult to adapt to different grinding tools, whereas audio sensors (microphones) are low-cost and can be mounted on any medium that conducts grinding sound. This paper introduces a low-cost Acoustic Feedback Robotic Grinding System (AFRG) that captures audio signals with a contact microphone, estimates grinding force from the audio in real time, and enables closed-loop force control of the grinding process. Compared with conventional force-sensing approaches, AFRG achieves a 4-fold improvement in consistency across different grinding disc conditions. AFRG relies solely on a low-cost microphone, which is approximately 200-fold cheaper than conventional force sensors, as the sensing modality, providing an easily deployable, cost-effective robotic grinding solution.
|
| |
| 17:05-17:15, Paper ThBT4.3 | Add to My Program |
| Context-Triggered Contingency Games for Strategic Multi-Agent Interaction |
|
| Schweppe, Kilian | Max Planck Institute for Software Systems |
| Schmuck, Anne-Kathrin | Max Planck Institute for Software Systems |
Keywords: Multi-Robot Systems, Integrated Planning and Control, Optimization and Optimal Control
Abstract: We address the challenge of reliable and efficient interaction in autonomous multi-agent systems, where agents must balance long-term strategic objectives with short-term dynamic adaptation. We propose context-triggered contingency games, a novel integration of strategic games derived from temporal logic specifications with dynamic contingency games solved in real time. Our two-layered architecture leverages strategy templates to guarantee satisfaction of high-level objectives, while a new factor-graph–based solver enables scalable, real-time model predictive control of dynamic interactions. The resulting framework ensures both safety and progress in uncertain, interactive environments. We validate our approach through simulations and hardware experiments in autonomous driving and robotic navigation, demonstrating efficient, reliable, and adaptive multi-agent interaction.
|
| |
| 17:15-17:25, Paper ThBT4.4 | Add to My Program |
| Many-To-Many Multi-Agent Pickup and Delivery |
|
| Schneider, Ethan | Georgia Institute of Technology |
| Chen, Jingkai | Symbotic |
| Gu, Tianyi | Symbotic |
| Lian, Kunlei | Symbotic |
| Hutchinson, Seth | Northeastern University |
| Chernova, Sonia | Georgia Institute of Technology |
Keywords: Multi-Robot Systems
Abstract: Multi-robot systems in automated warehouses must manage continuous streams of pickup-and-delivery tasks while ensuring efficiency and safety. Prior work on Multi-Agent Pickup-and-Delivery (MAPD) has largely focused on the one-to-one variant, where each task has a fixed pickup and delivery location. In contrast, real warehouses often present many-to-many MAPD scenarios, where items, tracked by stock keeping unit (SKU) identifiers, can be retrieved from or stored at multiple locations, resulting in an NP-hard four-dimensional assignment problem. To solve the many-to-many MAPD problem, we contribute our algorithm: Many-to-Many Multi-Agent Pickup and Delivery (M2M). We experiment with two variants of our algorithm: one that minimizes estimated task durations (M2M), and one which incorporates SKU distribution into the objective function (M2M-wSKU). Simulation results over 8-hour warehouse operations show that our method consistently matches or outperforms prior state of the art, with M2M completing up to 22,000 more tasks on average across different environments and warehouse inventory densities.
|
| |
| 17:25-17:35, Paper ThBT4.5 | Add to My Program |
| A Passive Soft Wearable Suit with a Single Elastic Belt and Multiple Pulley System |
|
| Kim, Sangman | Pusan National University |
| Zou, Hanbo | Pusan National University |
| Jin, Sangeun | Pusan National University |
| Kwon, Junghan | Pusan National University |
Keywords: Soft Robot Applications, Wearable Robotics, Physically Assistive Devices
Abstract: Exoskeleton robots promise to enhance safety by supporting workers' back strength during heavy lifting tasks, thereby improving work efficiency and productivity. However, the components of these robots, such as exoskeletal structures, actuators, and batteries, often increase their size and weight, which can reduce wearability and mobility. To tackle this issue, we propose a lightweight, passive wearable suit designed to assist back muscles during lifting tasks. The proposed system features a single elastic belt connected to multiple pulleys, which are attached to the back and lower limb sleeves. These pulleys are attached to the upper and lower limbs, and their relative distances change depending on body movements such as lifting or walking, thereby producing an effect similar to that of the moving pulley system. This innovative design allows the suit to deliver substantial support while efficiently distributing anchoring pressure across the wearer's skin during squatting and stooping positions. Additionally, the movement of belts through the pulleys minimizes the restrictions on gait motion compared to traditional designs. By adjusting the length of the belt, assist mode can be easily turned on and off, and flexibly applied to various body sizes. The supporting force is characterized by modeling and experimental tests. We evaluated the immediate effect of the prototype passively supporting back muscles during lifting tasks and reducing gait restriction during walking tasks.
|
| |
| 17:35-17:45, Paper ThBT4.6 | Add to My Program |
| Tracailer: An Efficient Trajectory Planner for Tractor-Trailer Robots in Unstructured Environments (I) |
|
| Xu, Long | Zhejiang University |
| Chai, Kaixin | Xi'an Jiaotong University |
| An, Boyuan | Zhejiang University |
| Ji, Shuhang | ZheJiang University |
| Hou, Zhenyu | Huzhou Institute of Zhejiang University |
| Gan, JiaXiang | North China Electric Power University |
| Wang, Qianhao | Zhejiang University |
| Zhou, Yuan | Zhejiang University |
| Li, Xiaoying | Huzhou Institute of Zhejiang University |
| Lin, Junxiao | Zhejiang University |
| Han, Zhichao | Zhejiang University |
| Xu, Chao | Zhejiang University |
| Cao, Yanjun | Zhejiang University, Huzhou Institute of Zhejiang University |
| Gao, Fei | Zhejiang University |
Keywords: Motion and Path Planning, Collision Avoidance, Nonholonomic Motion Planning
Abstract: We address trajectory planning for tractor-trailer robots, where additional trailers increase transport capacity but introduce complex nonholonomic kinematics, high-dimensional states, and deformable structures. We propose a lightweight, compact, high-order smooth trajectory representation and an efficiently solvable spatiotemporal optimization formulation. To handle deformability and collision avoidance, we directly deform trajectories in continuous space by exploiting collision-free regions, eliminating the need to build convex safe sets from seed points before each optimization. This avoids loss of feasible space and reduces sensitivity to initial guesses. A multi-terminal fast path search further provides high-quality initialization. Extensive simulations show severalfold efficiency gains over existing methods, with lower curvature and shorter durations. Real-world indoor and outdoor experiments on transport, loading, and unloading validate effectiveness. Code: https://github.com/Tracailer/Tracailer.
|
| |
| 17:45-17:55, Paper ThBT4.7 | Add to My Program |
| Robot Cell Modeling Via Exploratory Robot Motions |
|
| Meli, Gaetano | KUKA Deutschland GmbH |
| Dehio, Niels | KUKA Deutschland GmbH |
Keywords: Collision Avoidance, Physical Human-Robot Interaction, Software Tools for Robot Programming
Abstract: Generating a collision-free robot motion is crucial for safe applications in real-world settings. This requires an accurate model of all obstacle shapes within the constrained robot cell, which is particularly challenging and time-consuming. The difficulty is heightened in flexible production lines, where the environment model must be updated each time the robot cell is modified. Furthermore, sensor-based methods often necessitate costly hardware and calibration procedures, and can be influenced by environmental factors (e.g., light conditions or reflections). To address these challenges, we present a novel data-driven approach to modeling a cluttered workspace, leveraging solely the robot’s internal joint encoders to capture exploratory motions. By computing the corresponding swept volume, we generate a (conservative) mesh of the environment that is subsequently used for collision checking within established path planning and control methods. Our method significantly reduces the complexity and cost of classical environment modeling by removing the need for CAD files and external sensors. We validate the approach with the KUKA LBR iisy collaborative robot in a pick-and-place scenario. In less than three minutes of exploratory robot motions and less than four additional minutes of computation time, we obtain an accurate model that enables collision-free motions. Our approach is intuitive, easy-to-use, making it accessible to users without specialized technical knowledge.
|
| |
| 17:55-18:05, Paper ThBT4.8 | Add to My Program |
| Visual Scene Understanding-Based Task Planning for an Efficient Multipurpose Agricultural Robot System |
|
| Park, Yonghyun | Chonnam National University |
| Son, Hyoung Il | Chonnam National University |
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation, Robotics and Automation in Life Sciences
Abstract: This study introduces a visual scene understanding (VSU) pipeline that fuses scene graph generation (SGG) with task planning for agricultural robots. Mask R-CNN detects fruits, leaves, and stems; Object features feed heads for predicates and attributes such as rigidity and ripeness. The resulting graph triggers a rule-based planner that chooses among harvesting, pruning, or thinning and decides on single- or dual-arm execution. Evaluated on a re-annotated custom dataset, the full pipeline reaches 38.9% relationship R@50, 70.1% attribute R@50, 72.3% task-decision accuracy, and 53.7% cooperative-control accuracy. Results show dual-arm selection is twice as sensitive to perception errors as task type assignment. The work provides an agriculture-specific task planning that distinguishes flexible from rigid obstacles, demonstrating that relational and attribute improve perception in agricultural scenes.
|
| |
| 18:05-18:15, Paper ThBT4.9 | Add to My Program |
| MineInsight: A Multi-Sensor Dataset for Humanitarian Demining Robotics in Off-Road Environments |
|
| Malizia, Mario | Royal Military Academy |
| Hamesse, Charles | Royal Military Academy |
| Hasselmann, Ken | Royal Military Academy |
| De Cubber, Geert | Royal Military Academy |
| Tsiogkas, Nikolaos | KU Leuven |
| Demeester, Eric | KU Leuven |
| Haelterman, Rob | Royal Military Academy |
Keywords: Data Sets for Robotic Vision, Field Robots, Robotics in Hazardous Fields
Abstract: The use of robotics in humanitarian demining increasingly involves computer vision techniques to improve landmine detection capabilities. However, in the absence of diverse and realistic datasets, the reliable validation of algorithms remains a challenge for the research community. In this paper, we introduce MineInsight, a publicly available multi-sensor, multi-spectral dataset designed for off-road landmine detection. The dataset features 35 different targets (15 landmines and 20 commonly found objects) distributed along three distinct tracks, providing a diverse and realistic testing environment. MineInsight is, to the best of our knowledge, the first dataset to integrate dual-view sensor scans from both an Unmanned Ground Vehicle and its robotic arm, offering multiple viewpoints to mitigate occlusions and improve spatial awareness. It features two LiDARs, as well as images captured at diverse spectral ranges, including visible (RGB, monochrome), visible short-wave infrared (VIS-SWIR), and long-wave infrared (LWIR). Additionally, the dataset provides bounding boxes generated by an automated pipeline and refined with human supervision. We recorded approximately one hour of data in both daylight and nighttime conditions, resulting in around 38,000 RGB frames, 53,000 VIS-SWIR frames, and 108,000 LWIR frames. MineInsight serves as a benchmark for developing and evaluating landmine detection algorithms. Our dataset is available at https://github.com/mariomlz99/mineinsight .
|
| |