IROS 2025 Program | Thursday October 23, 2025


ThAT1	401
Intention Recognition 1	Regular Session
Co-Chair: Lorthioir, Guillaume	AIST

10:30-10:35, Paper ThAT1.1
Gaze-Guided 3D Hand Motion Prediction for Detecting Intent in Egocentric Grasping Tasks

He, Yufei	Delft University of Technology
Zhang, Xucong	Delft University of Technology
Stienen, Arno H.A.	Delft University of Technology
Keywords: Intention Recognition, Deep Learning Methods, Neurorobotics Abstract: Human intention detection with hand motion prediction is critical to drive the upper-extremity assistive robots in neurorehabilitation applications. However, the traditional methods relying on physiological signal measurement are restrictive and often lack environmental context. We propose a novel approach that predicts future sequences of both hand poses and joint positions.This method integrates gaze information, historical hand motion sequences, and environmental object data, adapting dynamically to the assistive needs of the patient without prior knowledge of the intended object for grasping. Specifically, we use a vector-quantized variational autoencoder for robust hand pose encoding with an autoregressive generative transformer for effective hand motion sequence prediction. We demonstrate the usability of these novel techniques in a pilot study with healthy subjects.To train and evaluate the proposed method, we collect a dataset consisting of various types of grasp actions on different objects from multiple subjects. Through extensive experiments, we demonstrate that the proposed method can successfully predict sequential hand movement. Especially, the gaze information shows significant enhancements in prediction capabilities, particularly with fewer input frames, highlighting the potential of the proposed method for real-world applications.

10:35-10:40, Paper ThAT1.2
Online-HMM with Two-Layer Bayesian Method for Operator's Expected Speed Estimation in Teleoperated Gluing Tasks

Zhou, Wenke	Huazhong University of Science and Technology
Gao, Zhitao	Huazhong University of Science and Technology
Chen, Chen	Wuhan University of Science and Technology
Peng, Fangyu	Huazhong University of Science and Technology
Zhang, Yukui	Huazhong University of Science and Technology
Yan, Rong	Huazhong University of Science and Technology
Tang, Xiaowei	Huazhong University of Science and Technology
Wang, Yu	Huazhong University of Science and Technology
Keywords: Telerobotics and Teleoperation, Human-Robot Collaboration, Intention Recognition Abstract: For direct teleoperation tasks, the follower robot accomplishes tasks by strictly executing the inputs from the operator. However, the operator's physiological tremor seriously reduces the smoothness of the trajectory, especially in tasks relying on operator’s experience such as gluing, while the random tremor is hard to be described and suppressed online. To navigate this challenge, this paper proposes an Online Hidden Markov Model with two-layer Bayesian (TLB-OHMM) method to suppress the tremor by estimating the operator's expected speed, which can recognize new intention online and is adaptive to complex trajectories. First, the actual moving speed of the human hand is modeled as an HMM. Then an Online-HMM method based on online expectation maximum (EM) algorithm is introduced to shorten the training time and realize the online updating of the HMM parameters. Finally, a two-layer Bayesian method is proposed to estimate the expected speed of the human hand. Experimental results in simulation and real teleoperated gluing task show that the proposed method greatly reduces computation time and improves the quality of trajectories, especially for complex curve trajectories, compared with the traditional HMM based method.

10:40-10:45, Paper ThAT1.3
GDTS: Goal-Guided Diffusion Model with Tree Sampling for Multi-Modal Pedestrian Trajectory Prediction

Sun, Ge	The Hong Kong University of Science and Technology
Wang, Sheng	Hong Kong University of Science and Technology
Zhu, Lei	The Hong Kong University of Science and Technology (Guangzhou)
Liu, Ming	Hong Kong University of Science and Technology (Guangzhou)
Ma, Jun	The Hong Kong University of Science and Technology
Keywords: Intention Recognition, Intelligent Transportation Systems, Human Detection and Tracking Abstract: Accurate prediction of pedestrian trajectories is crucial for improving the safety of autonomous driving. However, this task is generally nontrivial due to the inherent stochasticity of human motion, which naturally requires the predictor to generate multi-modal prediction. Previous works leverage various generative methods, such as GAN and VAE, for pedestrian trajectory prediction. Nevertheless, these methods may suffer from mode collapse and relatively low-quality results. The denoising diffusion probabilistic model (DDPM) has recently been applied to trajectory prediction due to its simple training process and powerful reconstruction ability. However, current diffusion-based methods do not fully utilize input information and usually require many denoising iterations that lead to a long inference time or an additional network for initialization. To address these challenges and facilitate the use of diffusion models in multi-modal trajectory prediction, we propose GDTS, a novel Goal-Guided Diffusion Model with Tree Sampling for multi-modal trajectory prediction. Considering the "goal-driven" characteristics of human motion, GDTS leverages goal estimation to guide the generation of the diffusion network. A two-stage tree sampling algorithm is presented, which leverages common features to reduce the inference time and improve accuracy for multi-modal prediction. Experimental results demonstrate that our proposed framework achieves comparable state-of-the-art performance with real-time inference speed in public datasets.

10:45-10:50, Paper ThAT1.4
Real-Time Manipulation Action Recognition with a Factorized Graph Sequence Encoder

Erdoğan, Enes	Istanbul Technical University
Sariel, Sanem	Istanbul Technical University
Aksoy, Eren Erdal	Halmstad University
Keywords: Intention Recognition, Multi-Modal Perception for HRI, Deep Learning Methods Abstract: Recognition of human manipulation actions in real-time is essential for safe and effective human-robot interaction and collaboration. The challenge lies in developing a model that is both lightweight enough for real-time execution and capable of generalization. While some existing methods in the literature can run in real-time, they struggle with temporal scalability, i.e., they fail to adapt to long-duration manipulations effectively. To address this, leveraging the generalizable scene graph representations, we propose a new Factorized Graph Sequence Encoder network that not only runs in real-time but also scales effectively in the temporal dimension, thanks to its factorized encoder architecture. Additionally, we introduce Hand Pooling operation, a simple pooling operation for more focused extraction of the graph-level embeddings. Our model outperforms the previous state-of-the-art real-time approach, achieving a 14.3% and 5.6% improvement in F1-macro score on the KIT Bimanual Action (Bimacs) Dataset and Collaborative Action (CoAx) Dataset, respectively. Moreover, we conduct an extensive ablation study to validate our network design choices. Finally, we compare our model with its architecturally similar RGB-based model on the Bimacs dataset and show the limitations of this model in contrast to ours on such an object-centric manipulation dataset. Our code and trained models are available at url{https://github.com/eneserdo/FGSE}.

10:50-10:55, Paper ThAT1.5
Cross-Activity sEMG-Driven Joint Angle Estimation Via Hybrid Attention Fusion: Bridging Traditional Features and Deep Spatial Representations

Tang, Zhimin	South China University of Technology
Deng, Xiaoyan	South China University of Technology
Wen, Yinke	South China University of Technology
Han, Xi	South China University of Technology
Wu, Jiatong	South China University of Technology
Yu, Zhu Liang	South China University of Technology
Keywords: Intention Recognition, Prosthetics and Exoskeletons, Rehabilitation Robotics Abstract: The growing prevalence of stroke necessitates advanced lower-limb exoskeleton control. This paper proposes HybridFusionAtt, a novel model for continuous joint angle estimation using surface electromyography (sEMG). Unlike conventional approaches, our framework uniquely integrates traditional time-domain features with CNN-extracted high-dimensional spatial features through an attention mechanism, where traditional features dynamically guide feature fusion as attention queries. The model was validated using data collected from eight participants performing four activities of daily living (walking, stair climbing, stair descending, and obstacle crossing). The proposed model achieves average R² values for knee and hip joint angle prediction of 0.8682 (walking), 0.8482 (obstacle crossing), 0.9294 (stair climbing), and 0.8676 (stair descending). Experimental results show that the proposed model significantly outperforms traditional LSTM and CNN-LSTM models in terms of accuracy and robustness, particularly in handling non-periodic actions such as obstacle crossing. The model achieves high performance by effectively fusing features and adaptively focusing on key features, enabling it to maintain robustness even under noisy conditions and significant individual differences. This demonstrates the model's broad application potential, especially in rehabilitation and prosthetic control systems.

10:55-11:00, Paper ThAT1.6
A Gait Phase Detection and Gait Spatio-Temporal Features Extraction Method Based on the Inertial Measurement Unit

Fan, Shuai	University of Electronic Science and Technology of China
Luo, Huiyong	Chengdu University of Technology
Xiao, Yao	Chengdu University of Technology
Liang, Ye	Chengdu University of Technology
Su, Zelin	Chengdu University of Technology
Song, Guangkui	University of Electronic Science and Technology of China
Chen, Peng	University of Electronic Science and Technology of China
Keywords: Intention Recognition, Rehabilitation Robotics Abstract: The quantitative evaluation of the improvement of physical function is crucial for patients with impaired motor function, such as stroke, in conducting related rehabilitation training activities. Specially, a practical and easy-to-operate gait feature detection and extraction system for a home is urgently needed. In this study, a home gait feature extraction method based on the inertial measurement unit is proposed. The subjects' walking distance and speed are calculated using the double integral and the number of strides is calculated using the local maximum peak approach, while the stance phase and swing phase are calculated using the local trough approach. The compared result shows that the average walking distance accuracy is about 91.32% and the average stride accuracy is about 96.55%. The proportion of the stance period (59.01%) and swing period (40.99%) estimated by the inertial measurement unit is close to the ratio of the two at normal speed. The experimental results demonstrate that the great accuracy of the gait spatio-temporal features is retrieved. The proposed method facilitates gait evaluation in clinics and at home, including the extraction of gait features and real-time evaluation.

11:00-11:05, Paper ThAT1.7
Demonstration Based Explainable AI for Learning from Demonstration Methods

Gu, Morris Sung	Monash University
Croft, Elizabeth	University of Victoria
Kulic, Dana	Monash University
Keywords: Intention Recognition, Human-Robot Collaboration, Learning from Demonstration Abstract: Learning from Demonstration (LfD) is a powerful type of machine learning that can allow novices to teach robots to complete various tasks. However, the learning process for these systems may still be difficult for novices to interpret and understand, making effective teaching challenging. Explainable artificial intelligence (XAI) aims to address this challenge by explaining a system to the user. In this work, we investigate XAI within LfD by implementing an adaptive explanatory feedback system on an inverse reinforcement learning (IRL) algorithm. The feedback is implemented by demonstrating selected learnt trajectories to users. The system adapts to user teaching by categorizing and then selectively sampling trajectories shown to a user, to show a representative sample of both successful and unsuccessful trajectories. The system was evaluated through a user study with 26 participants teaching a robot a multi-goal navigation task. The results of the user study demonstrated that the proposed explanatory feedback system can improve robot performance, teaching efficiency and the user's ability to predict the robot's goals and actions.

11:05-11:10, Paper ThAT1.8
A Motion Logic Network for Pedestrian Motion Prediction (I)

Guo, Jia	Nanyang Technological University
Lv, Pengfei	Jiangsu University of Science and Technology
Guo, Pengyu	Beihang University
Li, Dongyu	Beihang University
Keywords: Intention Recognition, Deep Learning Methods, Cognitive Modeling Abstract: Accurate and fast motion prediction such as pedestrian motion prediction (PMP) is crucial for safe autonomous driving. Much research effort has been devoted to studying the reactive behaviors of pedestrians, such as the interaction between pedestrians or the interaction between pedestrians and the environment. However, compared with behavioral logic, the current motion state of pedestrians has a greater influence on the future trajectory. In this work, we propose a motion logic network (MLN) to improve both the accuracy and efficiency of pedestrian motion prediction. Compared with the traditional data-driven neural networks, the concept of motion logic is directly introduced into the network so that the training of the network is not required. In particular, instead of fitting the network based on inputs and target outputs, the proposed method directly adopts motion logic to predict the future trajectory. To illustrate the performance of the proposed MLN, several experiments have been performed. For acceleration and deceleration motion in practical experiment, the average displacement error (ADE) of MLN has an improvement of as high as 6.25% than CVM's, while the final displacement error (FDE) of MLN has an improvement of 11.1%. In terms of efficiency, MLN is a hundred times faster than LSTM. It indicates that motion logic plays an important role in the development of prediction algorithms for pedestrian motion. Note to Practitioners-This article is motivated by the effects of pedestrians' different motion states on the future trajectory. For example, the pedestrians' state may also change drastically in some situations such as dashing across the road or stopping in anticipation of a bicycle crossing the path. This work explores the use of pedestrians' physical motion states to develop a network that emphasizes the logical features of pedestrian motion so as to directly estimate the future motion trajectory and trend based on the motion logic. For the first time, the concept of motion logic and the resultant training-free motion logic network (MLN) is proposed, which takes into account the accuracy and efficiency performance of the algorithm. Compared with the state-of-the-art pedestrian prediction methods, the proposed method can better predict the trajectory of pedestrians in more complex motion states. Moreover, a series of simulations and practical experiments with pseudo-constant velocity motion and acceleration/deceleration motion was taken to verify the performance of the proposed MLN.


ThAT2	402
Industrial Robots and Actuators 1	Regular Session
Chair: Zheng, Fangyan	Wuhan University of Technology
Co-Chair: Leu, Jessica En Shiuan	University of California, Berkeley

10:30-10:35, Paper ThAT2.1
Time-Optimal Trajectory Generation with Multi-Level Continuous Kinodynamics Constraints

Liu, Ruixuan	Carnegie Mellon University
Liu, Changliu	Carnegie Mellon University
Leu, Jessica En Shiuan	University of California, Berkeley
Keywords: Industrial Robots, Manipulation Planning, Optimization and Optimal Control Abstract: Time-optimal trajectory generation (TOTG) is critical in robotics applications to minimize travel time and increase robot task efficiency. To ensure the trajectory is feasible and executable by the robot, it is important to constrain the trajectory kinodynamics subject to the robot actuator limits. A typical actuator has multiple limits, 1) peak limit, and 2) multilevel continuous limits with different operation time windows. The peak limit bounds the instantaneous kinodynamics (IKD), whereas the continuous limits bound the system continuous kinodynamics (CKD). Existing works only constrain IKD, usually by the actuator peak limit, to achieve time optimality. However, a joint capable of operating at its peak limit momentarily will overheat and damage robot life if the motion continues. Alternatively, users can constrain the IKD with a reduced peak limit to avoid violating continuous limits. However, the reduced peak limit would inevitably sacrifice task efficiency. To address the challenge, this paper studies TOTG with both IKD and CKD, and proposes TOTG-C. It formulates the TOTG as a nonlinear programming (NLP). In particular, it proposes a novel formulation to encode the multi-level CKD constraints efficiently. To the best of our knowledge, TOTG-C is the first work that explicitly considers multi-level CKD constraints. We demonstrate the effectiveness and robustness of the proposed TOTG-C both in simulation and real robot experiments.

10:35-10:40, Paper ThAT2.2
SLAM-Based Performance Evaluation of an Industrial Robotic Arm

Liao, Chieh-Yu	National Taiwan University
Zhao, Yu-Lin	National Taiwan University
Huang, Han-Pang	National Taiwan University
Keywords: Industrial Robots, SLAM, Performance Evaluation and Benchmarking Abstract: Performance evaluation is critical for ensuring the accuracy, efficiency, and reliability of industrial robotic arms. Traditional measurement methods offer high precision but are often costly, complex to install, and constrained by environmental factors. To address these limitations, this study proposes a SLAM-based performance evaluation method that leverages LiDAR to track robotic motion without requiring external calibration references. This approach provides a cost-effective and flexible alternative to conventional metrology techniques. However, integrating SLAM into the ISO 9283 framework presents challenges related to accuracy, stability, and measurement consistency. To assess its feasibility, this study evaluates the SLAM-based system by analyzing key performance parameters, ensuring its alignment with industrial requirements. The results demonstrate that the LiDAR-based SLAM system achieves an RMSE of 0.0353 mm in trajectory estimation, confirming its precision and stability. These findings validate the system’s capability as a reliable benchmarking tool for robotic arm performance assessment.

10:40-10:45, Paper ThAT2.3
Removing Feasibility Conditions on Force Control for a Compliant Grinding Device with Asymmetric Full-State Constraints and Hysteresis Nonlinearity (I)

Liu, Jidong	Nankai University
Zhou, Lu	Nankai University
Niu, Ben	Shandong Normal University
Sun, Lei	Nankai University
Keywords: Hydraulic/Pneumatic Actuators, Force Control, Robust/Adaptive Control Abstract: A “planning and control” scheme with hysteresis compensation is developed in this article for a compliant grinding device (CGD) driven by a pneumatic actuator with asymmetric full-state constraints. First, in the planning part, we propose an improved hysteresis model and construct its inverse model through the modified inverse multiplicative structure method to deal with the hysteresis nonlinearity between air pressure and contact force in CGD. In the control part, a neural state observer (SO) is developed to observe the rate of change of air pressure by combining a neural network (NN) and an SO. By fusing an NN and a disturbance observer (DO), a neural DO is presented to observe external disturbances. Asymmetric full-state constraints are realized without the feasibility conditions (FCs) of the virtual controller by the fusion of the backstepping design process and nonlinear state-dependent transformations. It is proven that the air pressure closely tracks the planned desired air pressure, as the tracking error of the air pressure is uniformly ultimately bounded. Finally, hardware experiments show that the force control accuracy of the proposed method is within 2N, which achieves a satisfactory force control effect and verifies the effectiveness and robustness of the presented method.

10:45-10:50, Paper ThAT2.4
Adaptive Bayesian Optimization for High-Precision Motion Systems (I)

König, Christopher	Inspire AG
Krishnadas, Raamadaas	Inspire AG
Balta, Efe	Inspire AG
Rupenyan, Alisa	Zurich University of Applied Sciences
Keywords: Probabilistic Inference, Industrial Robots, Intelligent and Flexible Manufacturing Abstract: Controller tuning and parameter optimization are crucial in system design to improve closed-loop system performance. Bayesian optimization has been established as an efficient model-free controller tuning and adaptation method. However, Bayesian optimization methods are computationally expensive and therefore difficult to use in real-time critical scenarios. In this work, we propose a real-time purely data-driven, model-free approach for adaptive control, by online tuning low-level controller parameters. We base our algorithm on GoOSE, an algorithm for safe and sample-efficient Bayesian optimization, for handling performance and stability criteria. We introduce multiple computational and algorithmic modifications for computational efficiency and parallelization of optimization steps. We further evaluate the algorithm’s performance on a real precision-motion system utilized in the semiconductor industry applications by modifying the payload and reference stepsize and comparing it to an interpolated constrained optimization-based baseline approach. Note to Practitioners—This work is motivated by developing a comprehensive control and optimization framework for high-precision motion systems. Precision motion is an integral application of advanced mechatronics and a cornerstone technology for high-value industrial processes such as semiconductor manufacturing. The proposed method framework relies on data-driven optimization methods that can be designed by prescribing desired system performance. By using a method based on Bayesian Optimization and safe exploration, our method optimizes desired parameters based on the prescribed system performance. A key benefit is the incorporation of input and output constraints, which are satisfied throughout the optimization procedure. Therefore, the method is suitable for use in practical systems where safety or operational constraints are of concern. Our method includes a variable to incorporate contextual information, which we name the task parameter. Using this variable, users can input external changes, such as changing step sizes for a motion system, or changing weight on top of the motion system. We provide two parallel implementation variants of our framework to make it suitable for run-time operation under context changes and applicable for continuous operation in industrial systems. We demonstrate the optimization framework on simulated examples and experiment on an industrial motion system to showcase its applicability in practice.

10:50-10:55, Paper ThAT2.5
A Novel Parallel Kinematic Mechanism with Single Actuator for Multi-DoF Forming Machine

Zheng, Fangyan	Wuhan University of Technology
Xin, Shuai	Wuhan University of Technology
Han, Xinghui	Wuhan University of Technology
Hua, Lin	Wuhan University of Technology
Keywords: Industrial Robots, Parallel Robots, Mechanism Design Abstract: The parallel kinematic mechanism (PKM) is typically equipped with multiple actuators to realize the precise and arbitrary spatial motion, but resulting in complex mechanical and control systems and high cost. Actually, in many specific fields, such as heavy load metal forming process, the motion of PKM is only needed to be specific and the motion precision requirement is not extremely high. This paper proposes a new approximate mechanism synthesis method for PKM with single actuator (PKMSA), which not only can simplify the complexity of PKM and reduce the cost, but also can realize the specific motion with permissible error. Firstly, the design criteria of the consistency between the motion pattern of the PKMSA and required motion DoF is determined to avoid the occurrence of kinematic redundancy error. Then a general kinematic model for PKMSA is derived based on the screw theory, and the general constraints are obtained for PKMSA to realize the specific motion. On this basis, a 3-RSS/S PKMSA configured with a single input and triple output actuator layout realized by a gear set is proposed. Finally, a heavy load multi-DoF forming machine (load of 200kN) with PKMSA is developed and, with that, the multi-DoF forming experiment of a typical metal component is conducted with forming load of about 180kN. The geometric deviation of formed component is in the range of -40~55 um (it can completely meet the forming accuracy requirement), validating the feasibility of the proposed approximate mechanism synthesis method for PKMSA.

10:55-11:00, Paper ThAT2.6
Soft Transistor Valve for Versatile and Fast Soft Vacuum Gripper

Jang, Geunyeong	Pohang University of Science and Technology (POSTECH)
Shin, Hyung Gon	POSTECH
Chung, Wan Kyun	POSTECH
Keywords: Industrial Robots, Grippers and Other End-Effectors Abstract: Vacuum grippers are popular for their simplicity, reliability, and strength. However, their rigidity limits them to grip only relatively flat objects. To design a soft vacuum gripper, leakage must be prevented through a suction cup, not in contact with the object. Various soft valves have been proposed that mimic the working principle of conventional rigid valves. Still, they have limitations, such as the inability to grip objects with various surface characteristics and slow gripping speed. In this paper, we propose a new soft valve for a versatile and fast soft vacuum gripper. The proposed valve is inspired by the working principle of the air fuse, and the feasibility of the mechanism has been experimentally confirmed. Two performance indices are proposed for the optimization of the proposed valve. Analytical models are proposed to predict the pressure conservation and operating time according to the geometric parameters of the proposed valve, and the validity of the model is experimentally confirmed. Finally, a soft vacuum gripper with the proposed valves was fabricated to verify that the proposed valve can grip objects with various surface characteristics rapidly.

11:00-11:05, Paper ThAT2.7
Safe and Fluent Industrial Human Robot Collaboration Via Combination of PFL, SSM and Escape Trajectories

Manzardo, Matteo	Free University of Bozen-Bolzano
Vidoni, Renato	University of Udine
Keywords: Industrial Robots, Human-Robot Collaboration, Physical Human-Robot Interaction Abstract: In industrial collaborative robotics, safety, fluency as well as productivity are among the most important requirements to implement effective applications. In this work, a generalized formulation that combines the Power and Force Limiting and the Speed and Separation Monitoring criteria is firstly developed; then, a novel method that combines it with the possibility of executing an escape trajectory with the aim to enhance the levels of fluency and performance while guaranteeing human safety is presented. In particular, starting from a reference trajectory assigned to the manipulator to execute a specific task, a suitable joint acceleration is searched by formulating the problem as a Quadratic Programming problem under linearized kinematic and safety constraints. The method is then tested and experimentally validated. Results show, through fluency metrics, the effectiveness of the developed method with respect to the Power and Force Limiting criterion, the Speed and Separation Monitoring criterion as well as their combination.

11:05-11:10, Paper ThAT2.8
A Novel GA-CP Method for Fixed-Type Multi-Robot Collaborative Scheduling in Flexible Job Shop (I)

Huang, Jin	Huazhong University of Science and Technology
Li, Xinyu	Huazhong University of Science and Technology
Gao, Liang	Huazhong Univ. of Sci. & Tech
Keywords: Industrial Robots, Multi-Robot Systems, Planning, Scheduling and Coordination Abstract: With the rapid development of intelligent manufacturing, multi-robot collaborative systems are increasingly integrated into various production processes. In the flexible job shop environment of automotive stamping, achieving smooth operation and efficient manufacturing of production lines hinges on solving the critical issues of multi-robot task allocation and scheduling. However, for such fixed-type multi-robot collaboration problems, robots are constrained by specific areas or predetermined trajectories, and processing times can only be adjusted by varying the number of available robots. Therefore, the scheduling problem in multi-robot collaborative flexible job shop problems (MCFJSP) is divided into two sub-problems: FJSP with controllable processing times and multi-robot collaborative task balancing. To address these, we propose three distinct methods: mixed integer linear programming (MILP), constraint programming (CP), and a hybrid genetic algorithm-constraint programming (GA-CP). Finally, a set of 48 benchmark cases and two real-world cases are developed to test these methods. Comparative experiments demonstrate that the MILP model is superior in small-scale cases, while the GA-CP model exhibits the best overall performance in medium to large-scale cases. Furthermore, through comparisons with two advanced algorithms, the effectiveness and superiority of the GA-CP method in addressing real-world cases are confirmed.[8pt]Note to Practitioners—In modern manufacturing environments, particularly in industries like automotive manufacturing, multiple robots working together on complex tasks are increasingly common. This paper addresses the practical challenge of effectively scheduling these robots to maximize efficiency while reducing the number of robots assigned to each task. This study introduces and compares different methods, including MILP, CP, and GA-CP methods, that can help practitioners determine the best way to allocate tasks among robots and schedule them efficiently. For example, in small-scale tasks, the MILP model can quickly provide the best solution. However, as the complexity and scale of the task increase, the GA-CP method becomes more practical, offering high-quality solutions within a reasonable timeframe. The study provides actionable insights that can be applied directly to real-world production scenarios, helping practitioners in industries like automotive stamping to maximize job shop productivity while reducing energy consumption losses in robot processing.


ThAT3	403
Physical Human-Robot Interaction 1	Regular Session
Chair: Kim, Wansoo	Hanyang University ERICA
Co-Chair: Yu, Pengkang	Amazon.com Inc

10:30-10:35, Paper ThAT3.1
Virtually Constrained Admittance Control Using Feedback Linearization for Physical Human-Robot Interaction with Rehabilitation Exoskeletons (I)

Sun, Jianwei	University of California Los Angeles
Foroutani, Yasamin	University of California, Los Angeles
Rosen, Jacob	University of California, Los Angeles
Keywords: Physical Human-Robot Interaction, Rehabilitation Robotics, Compliance and Impedance Control Abstract: Robot-assisted rehabilitation focuses in part on path-based assist-as-needed reaching rehabilitation, which dynamically adapts the level of robot assistance during physical therapy to ensure patient progress along a predefined trajectory without over-reliance on the system. Additionally, bimanual exoskeletons have enabled asymmetric rehabilitation schemes, which leverage the patient's healthy side to guide the rehabilitation through interactions with objects in virtual reality that replicate activities of daily living. Within the context of physical human-robot interaction, these tasks can be formulated as constraints on the space of allowable motions. This study introduces a novel feedback linearization-inspired time-invariant admittance control scheme that enforces these motion constraints by isolating and stabilizing the component of the virtual dynamics transversal to the constraint. The methodology is applied to two rehabilitation tasks: (1) a path-guided reaching task with restoring force field, and (2) a bimanual interaction with a virtual object. Each task is then evaluated on one of two drastically different exoskeleton systems: (1) the V-Rex, a non-anthropomorphic full-body haptic device, and (2) the EXO-UL8, an anthropomorphic bimanual upper-limb exoskeleton. The two systems exist on opposite ends of the task/joint space control, non-redundant/redundant, off-the-shelf (industrial)/custom, non-anthropomorphic/anthropomorphic spectra. Experimental results validate and support the methodology as a generalizable approach to enabling constrained admittance control for rehabilitation robots.

10:40-10:45, Paper ThAT3.3
Multimodal Human Activity Recognition with a Large Language Model for Enhanced Human-Robot Interaction

Khodabandelou, Ghazaleh	University of Paris-Est Créteil
Chibani, Abdelghani	Lissi Lab Paris EST University
Amirat, Yacine	University of Paris Est Créteil (UPEC)
Keywords: Deep Learning Methods, Physical Human-Robot Interaction, Machine Learning for Robot Control Abstract: This paper presents a novel framework for Human Activity Recognition (HAR) by unifying all sensor streams, visual, audio, and inertial, into a single textual domain, enabling the direct application of GPT-3 for multimodal data classification. Unlike traditional pipelines that use dedicated encoders for each modality, we show that converting sensor outputs into text tokens offers both simplicity and a powerful proof-of-concept for large language models (LLMs). To further boost performance, we introduce a composite loss function combining cross-entropy, Kullback-Leibler divergence, total variation, and multimodal consistency terms, ensuring both temporal smoothness and cross-modal alignment. We conduct extensive experiments on the CMU-MMAC dataset, achieving up to {98%} accuracy and significantly outperforming baseline methods. We also demonstrate robustness under missing sensor streams via partial tokenization, maintaining strong performance despite sensor failures. These results highlight the potential of LLM-driven HAR for enhanced human-robot interaction in real-world scenarios, and pave the way for broader multimodal applications of next-generation language models.

10:45-10:50, Paper ThAT3.4
Robot Behavior Adaptation in Physical Human-Robot Interactions Based on Learned Safety Preferences

Majd, Keyvan	Toyota Research Institute NA
Soltani Zarrin, Rana	Honda Research Institute - USA
Keywords: Physical Human-Robot Interaction, Model Learning for Control, Safety in HRI Abstract: Robots that can physically interact with humans in a safe and comfortable manner have the potential to revolutionize application domains like home assistance and nursing care. However, to become long-term companions, such robots must learn user-specific preferences and adapt their behaviors in real time. We propose a Constrained Partially Observable Markov Decision Process framework for modeling human safety preferences over representative variables like force, velocity, and proximity. These variables are modeled as adaptive linear constraints, with a belief over their upper bounds that is updated online based on noisy human feedback. By modeling the belief as phase dependent, the model captures varying preferences across different task phases. The robot then solves a hierarchical optimization to select actions that respect both the learned constraints and robot motion limits. Our method does not require offline training data and can be applied directly to diverse physical interaction tasks and operation modes (tele-operated or autonomous). A pilot study shows that our approach effectively learns user preferences and improves perceived safety while reducing user effort compared to baselines.

10:50-10:55, Paper ThAT3.5
ORBiT: Optimizing Robot-Assisted Bite Transfer Leveraging a Real2Sim2Real Framework

Chan, Sherwin Stephen	Nanyang Technological University
Yow, J-Anne	Nanyang Technological University
San, Yi Heng	Nanyang Technological University Singapore
Ravichandram, Vasanthamaran	NTU Singapore
Wang, Yifan	Nanyang Technological University
Lim, Lek Syn	Nanyang Technological University
Ang, Wei Tech	Nanyang Technological University
Keywords: Physical Human-Robot Interaction, Physically Assistive Devices, Simulation and Animation Abstract: Robot-assisted feeding has the potential to enhance the independence of individuals requiring assistance, yet the bite transfer process remains particularly challenging, especially for those with complex conditions. In this paper, we present ORBiT, a novel Real2Sim2Real framework designed to optimize bite transfer in robot-assisted feeding. By integrating motion capture-driven, high-fidelity soft-body simulation with systematic parameter tuning, ORBiT effectively replicates realistic head, neck and jaw dynamics during feeding interactions to provide a safe simulation-driven approach to optimize bite transfer strategies. In our approach, motion capture data drives a personalized dynamic head model that, together with a comprehensive parameter search over variables such as entry angle, exit angle, exit depth, height offset, and distance to mouth, identifies the bite transfer parameters that minimize contact forces on the user. The optimal parameters are then transferred to a real-world robotic system and validated through a pilot user study involving five subjects. Results from real user evaluations mirror the trends in simulation, indicating that bite transfer parameters, especially those related to entry and exit angles, substantially affect user comfort and overall satisfaction. Our findings validate that simulation-derived optimizations can effectively guide improvements in bite transfer strategies, laying the groundwork for a safe, personalized approach to robot-assisted feeding.

10:55-11:00, Paper ThAT3.6
Stable Variable Impedance Control Via CLF-MPC for Physical Human-Robot Interaction

Choi, SeungMin	Hanyang University
Hwang, Soonwoong	Hanyang University
Kim, Wansoo	Hanyang University ERICA
Keywords: Physical Human-Robot Interaction, Compliance and Impedance Control, Optimization and Optimal Control Abstract: Variable impedance control is a control strategy widely used in physical human-robot collaboration (pHRC) and physical human-robot interaction (pHRI). Variable stiffness and damping parameters improve adaptability to changing environments and enhance safety in human-robot interaction. However, these adaptive parameters can compromise the stability of the system without proper management, particularly in dynamic environments. To address this, we propose a real-time parameter prediction method for variable impedance control using model predictive control (MPC) with Control Lyapunov Function (CLF). Unlike the method that sets the terminal constraint as the equilibrium position, the proposed method guarantees system stability even when parameters change or external disturbances occur, ensuring safe and adaptive robot behavior. Moreover, the infeasibility issue is resolved by applying CLF instead of relying on the equilibrium position. Furthermore, considering stability throughout the prediction horizon, the stability of the system is strictly guaranteed. The proposed method was validated through comparative experiments with the method that sets the terminal constraint as the equilibrium position in both simulations and real-world environments using the Franka Emika Panda robot. Through these experiments, the proposed controller demonstrated a significant reduction in parameter computation time, achieving approximately 97.13% and 96.20% faster computation in simulation tests compared to conventional method, while consistently ensuring stability under various disturbances including human interaction, tool vibration, and contact loss scenarios.

11:00-11:05, Paper ThAT3.7
A VisuoMotor Human-Robot Interaction Framework for Attention-Motion-Integrated Training

Chen, Chen	University of Electronic Science and Technology of China
Yuan, Shuhe	University of Electronic Science and Technology of China
Zhang, Jingting	University of Electronic Science and Technology of China
Mu, Fengjun	University of Electronic Science and Technology of China
Zou, Chaobin	University of Electronic Science and Technology of China
Cheng, Hong	University of Electronic Science and Technology
Keywords: Physical Human-Robot Interaction, Rehabilitation Robotics, Modeling and Simulating Humans Abstract: Focus of attention is one of the most influential factors facilitating motor training performance. Most of robotic training methods have not well solved the negative effect of divided-attention on motor execution performance, resulting in limited rehabilitation efficiency for motor-cognitive dysfunction. In this study, we propose a novel visuomotor human-robot interaction framework by integrating a gaze-visual game and force-movement robot, to realize more efficient training for both attentional and motor function. An important novelty of this framework is to design a dynamical pattern recognition scheme for the hierarchical-coupled behavior of attentional and motor execution, to facilitate efficient human-robot interaction in both cognitive and motor perspectives. Specifically, an attentional-motor dynamical system modeling method is first developed by using the gaze, force and movement data collected from the human under different attentional-motor behavior. Then, an online dynamical pattern recognition scheme can be design with these models to online recognizing the human’s attentional and motor behavior states. The training robot system can dynamically adjust the parameters according to the recognition results, to guide the collaboration of both attentional and motor training. Experimental study are conducted to demonstrate the desired accuracy and efficiency of our designed approaches in attentional-motor behavior recognition and training.

11:05-11:10, Paper ThAT3.8
Collision Detection for Low-Cost Robot Manipulators Using Probabilistic Residual Torque Modeling

Shao, Yifei	University of Pennsylvania
Singh, Baljeet	Amazon Inc
Morozovsky, Nicholas	Accel Robotics Corporation
Yu, Pengkang	Amazon.com Inc
Keywords: Physical Human-Robot Interaction, Model Learning for Control, Collision Avoidance Abstract: With the advancement of robot manipulator technologies, developing custom-made low-cost manipulator arms has gained increasing popularity in the field of robotics. However, these low-cost systems often lack precise sensing and control capabilities, making reliable collision detection particularly critical to ensure safe operation. Popular methods of collision detection, which estimate external disturbances such as generalized Momentum Observer (MO) critically rely on an accurate dynamics model, and precise system identification requires joint torque sensors that are expensive for low-cost manipulators. This paper presents a novel approach to improve collision detection for low-cost robot manipulators where modeling error is present and torque sensing is not available. We propose a probabilistic framework using Gaussian Mixture Models (GMM) to capture friction and other unmodeled dynamics of a robot manipulator. Instead of explicitly identifying model parameters, GMM is created on the total residual torque of the MO while running an excitation trajectory. The GMM is then deployed in the main control loop to identify external disturbances from the residual torque of the same momentum observer. The approach is validated on a custom-made 4-Degrees-of-Freedom (DoF) robot arm with modeling error, unmodeled dynamics, and with only joint position measurement. Our method achieves reliable detection on both hard and soft collisions, demonstrating a reduction in the false positive rate by more than 50% compared to conventional MO-based methods with the same true positive rate on the same hardware.


ThAT4	404
AI-Enabled Robotics 1	Regular Session
Chair: Valada, Abhinav	University of Freiburg
Co-Chair: Zhao, Na	Dalian Maritime University

10:30-10:35, Paper ThAT4.1
Online Iterative Learning with Forward Simulation for Sub-Minimum End-Effector Displacement Positioning

Qu, Weiming	Peking Universitiy
Liu, Tianlin	Peking University
Du, Jiawei	Peking University
Wu, Xihong	Peking University
Luo, Dingsheng	Peking University
Keywords: AI-Enabled Robotics, Control Architectures and Programming, Reactive and Sensor-Based Planning Abstract: Precision is a crucial performance indicator for robot arms. During interacting with human, high precision enables a robot arm to be used effectively and safely, while low precision may lead to safety issues. Traditional methods for improving robot arm precision rely on error compensation. However, these methods are often not robust and lack adaptability. Learning-based methods offer greater flexibility and adaptability, while current researches show that they often fall short in achieving high precision and struggle to handle many scenarios requiring high precision. In this paper, we propose a novel high-precision robot arm manipulation framework based on online iterative learning and forward simulation, which can achieve positioning error (precision) less than end-effector physical minimum displacement. In other words, our proposed method can compensate for the precision-limitation of the hardware structure of the robot arms. Furthermore, we consider the joint angular resolution of the real robot arm, which is usually neglected in related works. A series of experiments on both simulation and real UR3 robot arm platforms demonstrate that our proposed method is effective and promising. The related code will be available soon.

10:35-10:40, Paper ThAT4.2
Triple-S: A Collaborative Multi-LLM Framework for Solving Long-Horizon Implicative Tasks in Robotics

Jia, Zixi	Northeastern University
Gao, Hongbin	Northeastern University
Li, Fashe	SIASUN Robot Automation CO. Ltd
Liu, Jiqiang	Northeastern University
Li, Hexiao	Northeastern University
Liu, Qinghua	Northeastern University
Keywords: AI-Enabled Robotics, Task Planning, Manipulation Planning Abstract: Leveraging Large Language Models (LLMs) to write policy code for controlling robots has gained significant attention. However, in long-horizon implicative tasks, this approach often results in API parameter, comments and sequencing errors, leading to task failure. To address this problem, we propose a collaborative Triple-S framework that involves multiple LLMs. Through In-Context Learning, different LLMs assume specific roles in a closed-loop Simplification-Solution-Summary process, effectively improving success rates and robustness in long-horizon implicative tasks. Additionally, a novel demonstration library update mechanism which learned from success allows it to generalize to previously failed tasks. We validate the framework in the Long-horizon Desktop Implicative Placement (LDIP) dataset across various baseline models, where Triple-S successfully executes 89% of tasks in both observable and partially observable scenarios. Experiments in both simulation and real-world robot settings further validated the effectiveness of Triple-S. Our code and dataset is available at: https://github.com/Ghbbbbb/Triple-S.

10:40-10:45, Paper ThAT4.3
AnyBipe: An Automated End-To-End Framework for Training and Deploying Bipedal Robots Powered by Large Language Models

Yao, Yifei	Shanghai Jiao Tong University
He, Wentao	Shanghai Jiao Tong University
Gu, Chenyu	SJTU
Du, Jiaheng	Shanghai Jiao Tong University
Tan, Fuwei	Shanghai Jiao Tong University
Zhu, Zhen	Shanghai Jiao Tong University
Lu, Junguo	Shanghai Jiaotong University
Keywords: AI-Enabled Robotics, Humanoid and Bipedal Locomotion, Machine Learning for Robot Control Abstract: Training and deploying reinforcement learning (RL) policies for robots is a complex task, requiring careful design of reward functions, sim-to-real transfer, and performance evaluation across various robot configurations. These tasks traditionally demand significant human expertise and effort. To address these challenges, this paper introduces textit{Anybipe}, a novel, fully automated, end-to-end framework for training and deploying bipedal robots, leveraging large language models (LLMs) for reward function generation, while supervising model training, evaluation, and deployment. The framework integrates comprehensive quantitative metrics to assess policy performance, deployment effectiveness, and safety. Additionally, it allows users to incorporate prior knowledge and preferences, improving the accuracy and alignment of generated policies with expectations. We demonstrate how textit{Anybipe} reduces human labor while maintaining high levels of accuracy and safety, examined on three different bipedal robots, showcasing its potential for autonomous RL training and deployment.

10:45-10:50, Paper ThAT4.4
Data-Driven MPC for Attitude Control of Autonomous Underwater Robot

Gao, Tianzhu	Dalian Maritime University
Luo, Yudong	Dalian Maritime University
Zhao, Na	Dalian Maritime University
Wang, Jianda	Dalian Maritime University
Yanyuanchu, Yanyuanchu	Dalian Maritime University, Dalian, Liaoning, 116026 China
Fu, Xianping	Dalian Maritime University
Luo, Xi	Yichang Testing Tech. Research Institution
Shen, Yantao	University of Nevada, Reno
Keywords: AI-Enabled Robotics, Model Learning for Control Abstract: High maneuverability is essential to the autonomous operation of underwater robots. To achieve real-time maneuvering motion, the control strategy must take into account nonlinear hydrodynamic effects, which are extremely difficult to accurately capture during motion and therefore a balance must be struck between accuracy and real-time computational efficiency. Therefore, this paper proposes a data-driven approach to model the dynamics of the underwater robot using Sparse Identification of Nonlinear Dynamics (SINDy). Compared with existing works, our method does not require any physical prior knowledge and only uses a short period of onboard sensor data. Subsequently, the learned dynamic model is incorporated into a model predictive controller (MPC) to enable precise attitude control. Finally, the proposed method is implemented on our developed fully vectored propulsion underwater robot, and a series of attitude tracking experiments are conducted in an indoor water tank. Experimental results reveal that our approach significantly improves the model accuracy and reduces the attitude tracking errors by over 79% at a control frequency of 20 Hz, which proves the effectiveness and real-time performance of the method.

10:50-10:55, Paper ThAT4.5
A Novel Terrain Classification System with Planar ECT Sensor

Yang, Wenju	Beijing University of Posts and Telecommunications
Shi, Duanpeng	Beijing University of Post and Communication
Yang, Wuqiang	The University of Manchester
Sun, Tengchen	Beijing Tashan Technology Co., Ltd
Liu, Huaping	Tsinghua University
Guo, Di	Beijing University of Posts and Telecommunications
Keywords: AI-Enabled Robotics, Engineering for Robotic Systems Abstract: Terrain classification is crucial for robotic navigation especially in unknown environment. Existing terrain classification methods usually have high requirements for environment conditions and robot motions, making them challenging to apply to real-world scenarios. In this paper, we develop a novel terrain classification system with the planar electrical capacitance tomography (ECT) sensor, which provides a non-contact, real-time, and cost-effective way for terrain classification. Specifically, we design a planar ECT sensor and integrate it at the bottom of a mobile robot. The proposed system leverages the collected capacitance measurements to reflect the inherent differences in dielectric permittivity across various terrain types. And a multilayer perception networks is used to fuse the collected capacitance and IMU measurements for classification. Additionally, a large scale ECT dataset including 10 different types of terrains is collected with the proposed system. Extensive experiments are conducted demonstrating the effectiveness and robustness of the proposed system.

10:55-11:00, Paper ThAT4.6
LightPlanner: Unleashing the Reasoning Capabilities of Lightweight Large Language Models in Task Planning

Zhou, Weijie	Beijing Jiaotong University
Tao, Manli	Institute of Automation, Chinese Academy of Sciences
Zhao, Chaoyang	Institute of Automation, Chinese Academy of Sciences
Dong, Honghui	Beijing Jiaotong University
Tang, Ming	Institute of Automation, Chinese Academy of Sciences
Wang, Jinqiao	Institute of Automation, Chinese Academy of Sciences
Keywords: AI-Enabled Robotics, Task Planning, Embedded Systems for Robotic and Automation Abstract: In recent years, lightweight large language models (LLMs) have garnered significant attention in the robotics field due to their low computational resource requirements and suitability for edge deployment. However, in task planning—particularly for complex tasks that involve dynamic semantic logic reasoning—lightweight LLMs have underperformed. To address this limitation, we propose a novel task planner, LightPlanner, which enhances the performance of lightweight LLMs in complex task planning by fully leveraging their reasoning capabilities. Unlike conventional planners that use fixed skill templates, LightPlanner controls robot actions via parameterized function calls, dynamically generating parameter values. This approach allows for fine-grained skill control and improves task planning success rates in complex scenarios. Furthermore, we introduce hierarchical deep reasoning. Before generating each action decision step, LightPlanner thoroughly considers three levels: action execution (feedback verification), semantic parsing (goal consistency verification), and parameter generation (parameter validity verification). This ensures the correctness of subsequent action controls. Additionally, we incorporate a memory module to store historical actions, thereby reducing context length and enhancing planning efficiency for long-term tasks. We train the LightPlanner-1.5B model on our LightPlan-3k dataset, which comprises action chains consisting of tasks with 2 to 8 steps. Experiments demonstrate that our model achieves the highest task success rate despite having the smallest number of parameters. In tasks involving spatial semantic reasoning, the success rate exceeds that of ReAct by 14.9%. Moreover, we demonstrate LightPlanner's potential to operate on edge devices.

11:00-11:05, Paper ThAT4.7
Robotic Task Ambiguity Resolution Via Natural Language Interaction

Chisari, Eugenio	University of Freiburg
von Hartz, Jan Ole	University of Freiburg
Despinoy, Fabien	Toyota Motor Europe
Valada, Abhinav	University of Freiburg
Keywords: AI-Enabled Robotics, Natural Dialog for HRI, Deep Learning for Visual Perception Abstract: Language-conditioned policies have recently gained substantial adoption in robotics as they allow users to specify tasks using natural language, making them highly versatile. While much research has focused on improving the action prediction of language-conditioned policies, reasoning about task descriptions has been largely overlooked. Ambiguous task descriptions often lead to downstream policy failures due to misinterpretation by the robotic agent. To address this challenge, we introduce AmbResVLM, a novel method that grounds language goals in the observed scene and explicitly reasons about task ambiguity. We extensively evaluate its effectiveness in both simulated and real-world domains, demonstrating superior task ambiguity detection and resolution compared to recent state-of-the-art baselines. Finally, real robot experiments show that our model improves the performance of downstream robot policies, increasing the average success rate from 69.6% to 97.1%. We make the data, code, and trained models publicly available at https://ambres.cs.uni-freiburg.de.

11:05-11:10, Paper ThAT4.8
FlowPlan: Zero-Shot Task Planning with LLM Flow Engineering for Robotic Instruction Following

Lin, Zijun	Southern University of Science and Technology
Tang, Chao	Southern University of Science and Technology
Ye, Hanjing	Southern University of Science and Technology
Zhang, Hong	Southern University of Science and Technology
Keywords: AI-Enabled Robotics, Autonomous Agents, Vision-Based Navigation Abstract: Robotic instruction following tasks require seamless integration of visual perception, task planning, target localization, and motion execution. However, existing task planning methods for instruction following are either data-driven or underperform in zero-shot scenarios due to difficulties in grounding lengthy instructions into actionable plans under operational constraints. To address this, we propose FlowPlan, a structured multi-stage LLM workflow that elevates zero-shot pipeline and bridges the performance gap between zero-shot and data-driven in-context learning methods. By decomposing the planning process into modular stages--task information retrieval, language-level reasoning, symbolic-level planning, and logical evaluation--FlowPlan generates logically coherent action sequences while adhering to operational constraints and further extracts contextual guidance for precise instance-level target localization. Benchmarked on the ALFRED and validated in real-world applications, our method achieves competitive performance relative to data-driven in-context learning methods and demonstrates adaptability across diverse environments. This work advances zero-shot task planning in robotic systems without reliance on labeled data. Project website: https://instruction-following-project.github.io/.


ThAT5	407
Formal Method in Robotics and Automation 1	Regular Session
Chair: Sun, Zhiyong	Peking University (PKU)
Co-Chair: Osada, Masahiko	The University of Tokyo / Honda R&D Co. Ltd

10:30-10:35, Paper ThAT5.1
Q-Learning-Based Optimal Force-Tracking Control of Grinding Robots in Uncertain Environments

Yang, Rui	Beihang University
Wu, Han	Beihang University
Zheng, Jianying	Beihang University
Wang, Xinyu	Beihang University
Hu, Qinglei	Beihang University
Keywords: Force Control, Machine Learning for Robot Control, Industrial Robots Abstract: This paper proposes a novel Q-learning-based dual-loop force tracking control framework for robot grinding tasks in uncertain environments. A complete system state-space model is established, incorporating interaction dynamics and the desired force. By augmenting the system state, a discount cost function is defined to quantify the tracking errors of the force and reference trajectory. The modified Q-learning method is systematically designed to iteratively compute the optimal control gain in a model-free manner. To mitigate force overshoot during the transition from free space to contact space, a force reference model and a transition mechanism for the control gain are designed. Simulations and experiments validate the method's effectiveness in precise force tracking with minimal overshoot and robustness to environmental variations.

10:35-10:40, Paper ThAT5.2
Force Control Using Internal Spring in Electrostatic Linear Motors and Switching between Position and Force Control (I)

Osada, Masahiko	The University of Tokyo / Honda R&D Co. Ltd
Zhang, Guangwei	The University of Tokyo
Yoshimoto, Shunsuke	The University of Osaka
Yamamoto, Akio	The University of Tokyo
Keywords: Force Control, Actuation and Joint Mechanisms, Soft Sensors and Actuators Abstract: This article proposes a method for controlling force and position in a synchronous direct-drive electrostatic linear motor. Using a spring-like behavior of synchronous motors, the proposed controller regulates the contact force in a manner similar to that of series elastic actuators. Discussions regarding similarities and differences between series elastic actuators and the proposed method imply that their dynamic behaviors are different. The proposed controller consists of a force control part and a position control part, one of which is automatically selected based on the operating conditions. The position and force controllers each independently command the velocity based on the position or force error. The lower of the two velocities is selected, allowing a smooth automatic transition between the two control modes. Based on the selected velocity, the driving signal is calculated under the assumption of synchronous operation and fed to the actuator. The proposed control method is simulated and experiments are performed using prototype electrostatic linear motors. The experimental results confirm the smooth transition between the position and force control modes, as well as the good controllability of the contact force.

10:40-10:45, Paper ThAT5.3
Constrained Reinforcement Learning Using Distributional Representation for Trustworthy Quadrotor UAV Tracking Control (I)

Wang, Yanran	Imperial College London
Boyle, David	Imperial College London
Keywords: Formal Methods in Robotics and Automation, Autonomous Vehicle Navigation Abstract: Simultaneously accurate and reliable tracking control for quadrotors in complex dynamic environments is challenging. The chaotic nature of aerodynamics, derived from drag forces and moment variations, makes precise identification difficult. Consequently, many existing quadrotor tracking systems treat these aerodynamic effects as simple `disturbances' in conventional control approaches. We propose a novel and interpretable trajectory tracker integrating a distributional Reinforcement Learning (RL) disturbance estimator for unknown aerodynamic effects with a Stochastic Model Predictive Controller (SMPC). Specifically, the proposed estimator `Constrained Distributional REinforced-Disturbance-estimator' (ConsDRED) effectively identifies uncertainties between the true and estimated values of aerodynamic effects. Control parameterization employs simplified affine disturbance feedback to ensure convexity, which is seamlessly integrated with the SMPC. We theoretically guarantee that ConsDRED achieves an optimal global convergence rate, and sublinear rates if constraints are violated with certain error decreases as neural network dimensions increase. To demonstrate practicality, we show convergent training, in simulation and real-world experiments, and empirically verify that ConsDRED is less sensitive to hyperparameter settings compared with canonical constrained RL. Our system substantially improves accumulative tracking errors by at least 70%, compared with the recent art. Importantly, the proposed ConsDRED-SMPC framework balances the trade-off between pursuing high performance and obeying conservative constraints for practical implementations.

10:45-10:50, Paper ThAT5.4
Towards Safe Reinforcement Learning with Reduced Conservativeness: A Case Study on Drone Flight Control

Hadjiloizou, Loizos	KTH Royal Institute of Technology
Welle, Michael C.	KTH Royal Institute of Technology
Yin, Hang	University of Copenhagen
Kragic, Danica	KTH
Keywords: Formal Methods in Robotics and Automation, Robot Safety, Reinforcement Learning Abstract: Incorporating formal methods into reinforcement learning (RL) has the potential to result in the best of both worlds, combining the robustness of formal guarantees with the adaptability and learning capabilities of RL, though careful design is needed to balance safety and exploration. In this work, we propose a framework to mitigate this loss of exploration while still allowing for the safety of the system to be ensured. Specifically, we introduce a less restrictive method that can reduce the conservativeness of formal methods by refining a disturbance model using online collected data and it evaluates the safety of a learning-based controller, using computationally efficient zonotopic reachability analysis for the safety analysis to facilitate a real-time implementation. We validate the framework in a real-world drone flight through a canyon, where the drone is subjected to unknown external disturbances and the framework is tasked with learning those disturbances online and adjusting the safety guarantees accordingly. The results show that the framework enables a less restrictive online training of learning-based controllers without compromising the safety of the system.

10:50-10:55, Paper ThAT5.5
Risk-Aware Autonomous Driving with Linear Temporal Logic Specifications

Qi, Shuhao	Eindhoven University of Technology
Zhang, Zengjie	Eindhoven University of Technology
Sun, Zhiyong	Peking University (PKU)
Haesaert, Sofie	Eindhoven University of Technology
Keywords: Formal Methods in Robotics and Automation, Autonomous Vehicle Navigation Abstract: Human drivers naturally balance the risks of different concerns while driving, including traffic rule violations, minor accidents, and fatalities. However, achieving the same behavior in autonomous driving systems remains an open problem. This paper extends a risk metric that has been verified in human-like driving studies to encompass more complex driving scenarios specified by linear temporal logic (LTL) that go beyond just collision risks. This extension incorporates the timing and severity of events into LTL specifications, thereby reflecting a human-like risk awareness. Without sacrificing expressivity for traffic rules, we adopt LTL specifications composed of safety and co-safety formulas, allowing the control synthesis problem to be reformulated as a reachability problem. By leveraging occupation measures, we further formulate a linear programming (LP) problem for this LTL-based risk metric. Consequently, the synthesized policy balances different types of driving risks, including both collision risks and traffic rule violations. The effectiveness of the proposed approach is validated by three typical traffic scenarios in Carla simulator.

10:55-11:00, Paper ThAT5.6
Online Synthesis of Control Barrier Functions with Local Occupancy Grid Maps for Safe Navigation in Unknown Environments

Zhang, Yuepeng	Shanghai Jiao Tong University
Chen, Yu	Shanghai Jiao Tong University
Li, Yuda	Shanghai Jiao Tong University
Li, Shaoyuan	Shanghai Jiao Tong University
Yin, Xiang	Shanghai Jiao Tong Univ
Keywords: Formal Methods in Robotics and Automation, Reactive and Sensor-Based Planning, Robot Safety Abstract: Control Barrier Functions (CBFs) have emerged as an effective and non-invasive safety filter for ensuring the safety of autonomous systems in dynamic environments with formal guarantees. However, most existing works on CBF synthesis focus on fully known settings. Synthesizing CBFs online based on perception data in unknown environments poses particular challenges. Specifically, this requires the construction of CBFs from high-dimensional data efficiently in real time. This paper proposes a new approach for online synthesis of CBFs directly from local Occupancy Grid Maps (OGMs). Inspired by steady-state thermal fields, we show that the smoothness requirement of CBFs corresponds to the solution of the steady-state heat conduction equation with suitably chosen boundary conditions. By leveraging the sparsity of the coefficient matrix in Laplace’s equation, our approach allows for efficient computation of safety values for each grid cell in the map. Simulation and real-world experiments demonstrate the effectiveness of our approach. Specifically, the results show that our CBFs can be synthesized in an average of milliseconds on a 200×200 grid map, highlighting its real-time applicability.

11:00-11:05, Paper ThAT5.7
Real-Time Guaranteed Monitoring for a Drone Using Interval Analysis and Signal Temporal Logic

Besset, Antoine	ENSTA Paris
Alexandre dit Sandretto, Julien	Ensta Paris
Tillet, Joris	ENSTA Paris
Keywords: Formal Methods in Robotics and Automation, Robot Safety, Dynamics Abstract: This paper presents a guaranteed model-based approach for monitoring drone trajectory, providing real-time guarantees with a simplified dynamic model. ROS components are introduced for real-time implementation, enabling monitoring and adjustments in both simulations and actual systems. We extend the application of set-based simulation by formalizing timing conditions with Signal Temporal Logic (STL) and incorporating Boolean interval arithmetic to handle undetermined behaviors. The method compares model-based fault prediction using a stochastic approach with a set-based method, which manages bounded uncertainties and offers guarantees. Experimental validation, including comparisons against Monte Carlo methods, demonstrates the approach ability to ensure safety in worst-case scenarios while remaining suitable for real-time processing.


ThAT6	301
Deep Learning in Grasping and Manipulation 5	Regular Session
Chair: Rayyes, Rania	Karlsruhe Institute for Technology (KIT)
Co-Chair: Huang, Baoru	Imperial College London

10:30-10:35, Paper ThAT6.1
Disambiguate Gripper State in Grasp-Based Tasks: Pseudo-Tactile As Feedback Enables Pure Simulation Learning

Yang, Yifei	Zhejiang University
Chen, Lu	Zhejiang University
Song, Zherui	Zhejiang University
Chen, Yenan	Southern University of Science and Technology
Sun, WenTao	Beijing Institute of Technology
Zhou, Zhongxiang	Zhejiang University
Xiong, Rong	Zhejiang University
Wang, Yue	Zhejiang University
Keywords: Deep Learning in Grasping and Manipulation, Imitation Learning Abstract: Grasp-based manipulation tasks are fundamental to robots interacting with their environments, yet gripper state ambiguity significantly reduces the robustness of imitation learning policies for these tasks. Data-driven solutions face the challenge of high real-world data costs, while simulation data, despite its low costs, is limited by the sim-to-real gap. We identify the root cause of gripper state ambiguity as the lack of tactile feedback. To address this, we propose a novel approach employing pseudo-tactile as feedback, inspired by the idea of using a force-controlled gripper as a tactile sensor. This method enhances policy robustness without additional data collection and hardware involvement, while providing a noise-free binary gripper state observation for the policy and thus facilitating pure simulation learning to unleash the power of simulation. Experimental results across three real-world grasp-based tasks demonstrate the necessity, effectiveness, and efficiency of our approach.

10:35-10:40, Paper ThAT6.2
Zero-Shot Peg Insertion: Identifying Mating Holes and Estimating SE(2) Poses with Vision-Language Models

Yajima, Masaru	Institute of Science Tokyo
Ota, Kei	Mitsubishi Electric
Kanezaki, Asako	Tokyo Institute of Technology
Kawakami, Rei	Tokyo Institute of Technology
Keywords: Deep Learning in Grasping and Manipulation, Deep Learning for Visual Perception, Deep Learning Methods Abstract: Achieving zero-shot peg insertion, where inserting an arbitrary peg into an unseen hole without task-specific training, remains a fundamental challenge in robotics. This task demands a highly generalizable perception system capable of detecting potential holes, selecting the correct mating hole from multiple candidates, estimating its precise pose, and executing insertion despite uncertainties. While learning-based methods have been applied to peg insertion, they often fail to generalize beyond the specific peg-hole pairs encountered during training. Recent advancements in Vision-Language Models (VLMs) offer a promising alternative, leveraging large-scale datasets to enable robust generalization across diverse tasks. Inspired by their success, we introduce a novel zero-shot peg insertion framework that utilizes a VLM to identify mating holes and estimate their poses without prior knowledge of their geometry. Extensive experiments demonstrate that our method achieves 90.2% accuracy, significantly outperforming baselines in identifying the correct mating hole across a wide range of previously unseen peg-hole pairs, including 3D-printed objects, toy puzzles, and industrial connectors. Furthermore, we validate the effectiveness of our approach in a real-world connector insertion task on a backpanel of a PC, where our system successfully detects holes, identifies the correct mating hole, estimates its pose, and completes the insertion with a success rate of 88.3%. These results highlight the potential of VLM-driven zero-shot reasoning for enabling robust and generalizable robotic assembly.

10:40-10:45, Paper ThAT6.3
PhyGrasp: Generalizing Robotic Grasping with Physics-Informed Large Multimodal Models

Guo, Dingkun	Carnegie Mellon University
Xiang, Yuqi	Carnegie Mellon University
Zhao, Shuqi	University of California, Berkeley
Zhu, Xinghao	University of California, Berkeley
Tomizuka, Masayoshi	University of California
Ding, Mingyu	University of North Carolina at Chapel Hill
Zhan, Wei	Univeristy of California, Berkeley
Keywords: Deep Learning in Grasping and Manipulation, Grasping, Data Sets for Robot Learning Abstract: Robotic grasping, crucial for robot interaction with objects, still struggles with counter-intuitive or long-tailed scenarios like uncommon materials and shapes. Humans, however, intuitively adjust grasps with their physics-informed interpretations of the object, using visual and linguistic cues. This work introduces PhyGrasp, a large multimodal model and dataset that enhance robotic manipulation by combining natural language and 3D point clouds using a bridge module to integrate these inputs. The language modality exhibits robust reasoning capabilities concerning the impacts of diverse physical properties on grasping, while the 3D modality comprehends object shapes and parts. With these two capabilities, PhyGrasp is able to accurately assess the physical properties of object parts and determine optimal grasping poses. Additionally, the model’s language comprehension enables human instruction interpretation, generating grasping poses that align with human preferences. To train PhyGrasp, we construct a dataset PhyPartNet with 195K object instances with varying physical properties and human preferences, alongside their corresponding language descriptions. Extensive experiments conducted in the simulation and on the real robots demonstrate that PhyGrasp achieves state-of-the-art performance, particularly in long-tailed cases, e.g., about 10% improvement in success rate over GraspNet. More demos and information are available on https://sites.google.com/view/phygrasp

10:45-10:50, Paper ThAT6.4
Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation

Xie, Senwei	Institute of Computing Technology, Chinese Academy of Sciences
Wang, Hongyu	Institute of Computing Technology, Chinese Academy of Sciences
Xiao, Zhanqi	Institute of Computing Technology, Chinese Academy of Sciences
Wang, Ruiping	Institute of Computing Technology, Chinese Academy of Sciences
Chen, Xilin	Institute of Computing Technology, Chinese Academy
Keywords: AI-Based Methods, Deep Learning in Grasping and Manipulation, Task and Motion Planning Abstract: Zero-shot generalization across various robots, tasks and environments remains a significant challenge in robotic manipulation. Policy code generation methods use executable code to connect high-level task descriptions and low-level action sequences, leveraging the generalization capabilities of large language models and atomic skill libraries. In this work, we propose Robotic Programmer (RoboPro), a robotic foundation model, enabling the capability of perceiving visual information and following free-form instructions to perform robotic manipulation with policy code in a zero-shot manner. To address low efficiency and high cost in collecting runtime code data for robotic tasks, we devise Video2Code to synthesize executable code from extensive videos in-the-wild with off-the-shelf vision-language model and code-domain large language model. Extensive experiments show that RoboPro achieves the state-of-the-art zero-shot performance on robotic manipulation in both simulators and real-world environments. Specifically, the zero-shot success rate of RoboPro on RLBench surpasses Code-as-Policies equipped with the state-of-the-art model GPT-4o by 11.6%. Furthermore, RoboPro is robust to variations on API formats and skill sets. Our website can be found at https://video2code.github.io/RoboPro-website/.

10:50-10:55, Paper ThAT6.5
VISO-Grasp: Vision-Language Informed Spatial Object-Centric 6-DoF Active View Planning and Grasping in Clutter and Invisibility

Shi, Yitian	Karlsruhe Institute of Technology
Wen, Di	Karlsruhe Institute of Technology
Chen, Guanqi	Karlsruhe Institute of Technology
Welte, Edgar	Karlsruhe Institute of Technology (KIT)
Liu, Sheng	Karlsruhe Institute of Technology
Peng, Kunyu	Karlsruhe Institute of Technology
Stiefelhagen, Rainer	Karlsruhe Institute of Technology
Rayyes, Rania	Karlsruhe Institute for Technology (KIT)
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation, Grasping Abstract: We propose VISO-Grasp, a novel vision-language-informed system designed to systematically address visibility constraints for grasping in severely occluded environments. By leveraging Foundation Models (FMs) for spatial reasoning and active view planning, our framework constructs and updates an instance-centric representation of spatial relationships, enhancing grasp success under challenging occlusions. Furthermore, this representation facilitates active Next-Best-View (NBV) planning and optimizes sequential grasping strategies when direct grasping is infeasible. Additionally, we introduce a multi-view uncertainty-driven grasp fusion mechanism that refines grasp confidence and directional uncertainty in real-time, ensuring robust and stable grasp execution. Extensive real-world experiments demonstrate that VISO-Grasp achieves a success rate of 87.5% in target-oriented grasping with the fewest grasp attempts outperforming baselines. To the best of our knowledge, VISO-Grasp is the first unified framework integrating FMs into target-aware active view planning and 6-DoF grasping in environments with severe occlusions and entire invisibility constraints. Code is available at: https://github.com/YitianShi/vMF-Contact

10:55-11:00, Paper ThAT6.6
GraspMAS: Zero-Shot Language-Driven Grasp Detection with Multi-Agent System

Nguyen, Quang	FPT Software AI Center, Hanoi University of Science and Technolo
Le, Tri	FPT Software AI Center
Hoang Nguyen, Huy	Austrian Institute of Technology
Vo, Thieu	National University of Singapore
Ta, Tung D.	The University of Tokyo
Huang, Baoru	Imperial College London
Vu, Minh Nhat	TU Wien, Austria
Nguyen, Anh	University of Liverpool
Keywords: Deep Learning in Grasping and Manipulation, AI-Enabled Robotics Abstract: Language-driven grasp detection has the potential to revolutionize human-robot interaction by allowing robots to understand and execute grasping tasks based on natural language commands. However, existing approaches face two key challenges. First, they often struggle to interpret complex text instructions or operate ineffectively in densely cluttered environments. Second, most methods require a training or fine-tuning step to adapt to new domains, limiting their generation in real-world applications. In this paper, we introduce GraspMAS, a new multi-agent system framework for language-driven grasp detection. GraspMAS is designed to reason through ambiguities and improve decision-making in real-world scenarios. Our framework consists of three specialized agents: Planner, responsible for strategizing complex queries; Coder, which generates and executes source code; and Observer, which evaluates the outcomes. Intensive experiments on two large-scale public datasets demonstrate that our GraspMAS significantly outperforms existing baselines. Additionally, robot experiments conducted in both simulation and real-world settings further validate the effectiveness of our approach.

11:00-11:05, Paper ThAT6.7
Seeing through Uncertainty: Robot Pose Estimation Based on Imperfect Prior Kinematic Knowledge

Klüpfel, Leonard	German Aerospace Center (DLR)
Burkhard, Lukas	German Aerospace Center (DLR)
Reichert, Anne Elisabeth	German Aerospace Center
Durner, Maximilian	German Aerospace Center DLR
Triebel, Rudolph	German Aerospace Center (DLR)
Keywords: Deep Learning in Robotics and Automation, Computer Vision for Other Robotic Applications, Sensor Fusion, Visual Tracking Abstract: We present PK-ROKED, a learning-based pipeline for probabilistic robot pose estimation relative to a camera, addressing inaccuracies in forward kinematics, particularly in systems with elastic and lightweight modules. Our approach integrates a probabilistic 2D keypoint detection mechanism that leverages prior knowledge derived from the robot’s imprecise kinematics. We further improve the detection accuracy and geometric understanding by incorporating segmentation of the robot arm. The method computes reliable uncertainty estimates, enabling a robust 2D-6D fusion for precise robot arm pose estimation from a single detected keypoint. PK-ROKED requires only synthetic training data, effectively exploits imperfect kinematics as valuable prior knowledge, and introduces a novel fusion framework for enhanced robot pose estimation. We validate our method on the Panda-Orb dataset, demonstrating competitive performance against state-of-the-art approaches. Additionally, we evaluate on two other robotic systems in real-world scenarios and show its practicality by using the predictions to initialize a tracking algorithm. Code and pre-trained models will be made available.


ThAT7	307
Human-Aware Motion Planning 1	Regular Session
Chair: Wang, Xiangyang	Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (CAS)
Co-Chair: Cao, Yu	University of Leeds

10:30-10:35, Paper ThAT7.1
A Wearable Centaur Robot with Wheel-Legged Transformation for Enhanced Load-Carrying Assistance

Li, Songhao	Huazhong University of Science and Technology
Cao, Yu	University of Leeds
Di, Zhiyuan	Huazhong University of Science and Technology
Guo, Yifei	Huazhong University of Science & Technology
Huang, Jian	Huazhong University of Science and Technology
Keywords: Human Performance Augmentation, Mechanism Design, Wearable Robotics Abstract: The execution of long-distance load-carrying tasks across multiple terrains remains a frequent requirement. These tasks often involve heavy loads, resulting in fatigue, decreased efficiency, and potential safety risks. To address this issue, this paper proposes a wearable centaur robot with wheel-legged transformation for human load-carrying assistance. The key feature of this robotic mechanism is the independent wheel-legged transformable structure, enabling transitions between the wheeled and legged modes. The wheeled mode ensures high load-carrying efficiency, while in the legged mode, the wheels are laid flat, transforming the ankle joint into a locked support surface that provides stable gait support. This design enables efficient and stable load carriage over complex terrains, all while preserving the natural gait of the user. Next, we develop a unified control framework for human-robot collaborative locomotion across different terrains, which includes velocity control based on an admittance model for the wheeled mode, gait control using a Bézier trajectory for the legged mode, and the transition between the two modes. The preliminary experiments include wheeled-mode, legged-mode, mode transition and obstacle crossing under human-robot collaborative locomotion, validating the proposed robot's adaptability to different terrains while assisting with human load carriage.

10:35-10:40, Paper ThAT7.2
Underwater Exosuit Actuator Design for Unrestricted Bidirectional Hip Assistance During Flutter Kicking

Wang, Xiangyang	Shenzhen Institute of Advanced Technology, Chinese Academy of Sc
Du, Sida	Shenzhen Institute of Advanced Technology, CAS
Ma, Yue	Shenzhen Institutes of Advanced Technology, Chinese Academy of S
Sun, Jianquan	Shenzhen Institutes of Advanced Technology
Hong, Yongxuan	Southern University of Science and Technology
Zhang, Jiale	Shenzhen Institute of Advanced Technology, Chinese Academy of Sc
Chen, Chunjie	Shenzhen Institutes of Advanced Technology，Chinese Academ
Wu, Xinyu	CAS
Keywords: Human Performance Augmentation, Prosthetics and Exoskeletons, Marine Robotics Abstract: Underwater assistance is crucial for individuals who depend on diving for their livelihood. In this paper, we propose a novel underwater exosuit actuator designed to assist with flutter kicking during diving, thereby decreasing the effort the diver has to exert. The actuator can provide bidirectional assistance to the up and downbeats when the diver kicks underwater, and has no restriction on leg movements when it is deactivated. Both the benchtop experiment and human subject tests were conducted to verify its performance. The benchtop experiment verified its kinematic features, while tests with five participants validated its assistive performance. The results indicate that the actuator delivers a peak torque of 0.0947 Nm/kg and a peak force of 100 N in both directions, while allowing free leg movement during walking or kicking when not powered, thus ensuring safety during diving.

10:40-10:45, Paper ThAT7.3
SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-Based Reinforcement Learning

Ni, Hexian	University of Chinese Academy of Sciences
Lu, Tao	Institute of Automation, Chinese Academy of Sciences
Hu, Haoyuan	Institute of Automation，Chinese Academy of Sciences
Cai, Yinghao	Institute of Automation, Chinese Academy of Sciences
Wang, Shuo	Chinese Academy of Sciences
Keywords: Reinforcement Learning, Human Factors and Human-in-the-Loop, Machine Learning for Robot Control Abstract: Preference-based Reinforcement Learning (PbRL) methods provide a solution to avoid reward engineering by learning reward models based on human preferences. However, poor feedback- and sample- efficiency still remain the problems that hinder the application of PbRL. In this paper, we present a novel efficient query selection and preference-guided exploration method, called SENIOR, which could select the meaningful and easy-to-comparison behavior segment pairs to improve human feedback-efficiency and accelerate policy learning with the designed preference-guided intrinsic rewards. Our key idea is twofold: (1) We designed a Motion-Distinction-based Selection scheme (MDS). It selects segment pairs with apparent motion and different directions through kernel density estimation of states, which is more task-related and easy for human preference labeling; (2) We proposed a novel preference-guided exploration method (PGE). It encourages the exploration towards the states with high preference and low visits and continuously guides the agent achieving the valuable samples. The synergy between the two mechanisms could significantly accelerate the progress of reward and policy learning. Our experiments show that SENIOR outperforms other five existing methods in both human feedback-efficiency and policy convergence speed on six complex robot manipulation tasks from simulation and four real-worlds.Videos can be found on our project website:https://2025senior.github.io/

10:45-10:50, Paper ThAT7.4
Haptic Feedback of Front Car Motion May Improve Driving Control

Cheng, Xiaoxiao	The University of Manchester
Geng, Xianzhe	Imperial College London
Huang, Yanpei	University of Sussex
Burdet, Etienne	Imperial College London
Keywords: Human Performance Augmentation, Haptics and Haptic Interfaces, Human-Robot Collaboration Abstract: This study investigates the role of haptic feedback in a car-following scenario, where information about the motion of the front vehicle is provided through a virtual elastic connection with it. Using a robotic interface in a simulated driving environment, we examined the impact of varying levels of such haptic feedback on the driver's ability to follow the road while avoiding obstacles. The results of an experiment with 15 subjects indicate that haptic feedback from the front car's motion can significantly improve driving control (i.e., reduce motion jerk and deviation from the road) and reduce mental load (evaluated via questionnaire). This suggests that haptic communication, as observed between physically interacting humans, can be leveraged to improve safety and efficiency in automated driving systems, warranting further testing in real driving scenarios.

10:50-10:55, Paper ThAT7.5
Comparison of Solo and Collaborative Trimanual Operation of a Supernumerary Limb in Tasks with Varying Physical Coupling

Eden, Jonathan	University of Melborune
Khoramshahi, Mahdi	Sorbonne University
Huang, Yanpei	University of Sussex
Poignant, Alexis	Sorbonne Université, ISIR UMR 7222 CNRS
Burdet, Etienne	Imperial College London
Jarrassé, Nathanael	Sorbonne Université, ISIR UMR 7222 CNRS
Keywords: Human Performance Augmentation, Physical Human-Robot Interaction, Human-Robot Collaboration Abstract: Through the use of robotic supernumerary limbs, it has been proposed that a single user could perform tasks like surgery or industrial assembly that currently require a team. Although validation studies, often conducted in virtual reality, have demonstrated that individuals can learn to command supernumerary limbs, comparisons typically suggest that a team initially outperforms a supernumerary limb operating individual. In this study, we examined (i) the impact of using a commercially available physical robot setup instead of a virtual reality system and (ii) the effect of differences between limb couplings on user performance during a series of trimanual operations. Contrary to previous findings, our results indicate no clear difference in user performance when working as a trimanual user, in the pick and place of three objects, compared to when working as a team. Additionally, for this task we observe that while users prefer working with a partner when they control the majority of the limbs, we find no clear difference in their preference between solo trimanual operation and when they work with a partner and control the third limb. These findings indicate that factors typically not present in virtual reality such as visual occlusion and haptic feedback may be vital to consider for the effective operation of supernumerary limbs, and provide initial evidence to support the viability of supernumerary limbs for a range of physical tasks.

10:55-11:00, Paper ThAT7.6
A Wearable Scissored-Pair Control Moment Gyroscopes Utilized for Reactionless Support in Human Locomotion (I)

Lin, Weiqi	Harbin Institute of Technology
Dong, Wei	Harbin Institute of Technology
Gao, Yongzhuo	Harbin Institute of Technology
Chi, Yutian	Harbin Institute of Technology
Shi, Yongjun	Harbin Institute of Technology
Wu, Dongmei	Harbin Institute of Technology
Dong, Hui	Harbin Institute of Technology
Keywords: Human Performance Augmentation, Prosthetics and Exoskeletons, Mechanism Design Abstract: Wearable assistive devices (WADs) show potential for future applications in industrial and medical fields. However, wearing discomfort, caused by factors such as additional weight penalties, inappropriate straps, and joint misalignment, may disrupt the natural balance of human motion, leading to increased metabolic expenditure, muscle fatigue, and even physical injury. These challenges severely restrict the practical application of WADs. To address these challenges, we have developed a reactionless, lightweight WAD that utilizes a single device attached to the shoe, eliminating the need for additional rigid links or straps on the shank or thigh. This design provides assistive torque directly to the target limbs during human locomotion. The proposed device weighs 0.67 kg and can provide an assistive torque of 12 Nm and mechanical power of 10 W. Furthermore, we introduced a human-in-the-loop multi-objective optimization approach considering the assistance effectiveness, power consumption, system size, service safety, and actuator constraints to achieve relative optimality in performance of the proposed device. Cardiopulmonary exercise test and surface electromyography test results from seven participants indicated that the device can reduce the average gross metabolic rate by 12.62%, and the muscle activation of the rectus femoris and soleus by 35.12% and 4.10%, respectively.

11:00-11:05, Paper ThAT7.7
Disentangling Uncertainty for Safe Social Navigation Using Deep Reinforcement Learning

Flögel, Daniel	FZI Research Center for Information Technology, Karlsruhe Instit
Villafane, Marcos Gómez	University of Buenos Aires
Ransiek, Joshua	FZI Research Center for Information Technology
Hohmann, Sören	Institute of Control Systems, Karlsruhe Institute of Technology
Keywords: Human-Aware Motion Planning, Safety in HRI, Reinforcement Learning Abstract: Autonomous mobile robots are increasingly used in pedestrian-rich environments where safe navigation and appropriate human interaction are crucial. While Deep Reinforcement Learning (DRL) enables socially integrated robot behavior, challenges persist in novel or perturbed scenarios to indicate when and why the policy is uncertain. Unknown uncertainty in decision-making can lead to collisions or human discomfort and is one reason why safe and risk-aware navigation is still an open problem. This work introduces a novel approach that integrates aleatoric, epistemic} and predictive uncertainty estimation into a DRL navigation framework for policy distribution uncertainty estimates. We, therefore, incorporate Observation-Dependent Variance (ODV) and dropout into the Proximal Policy Optimization (PPO) algorithm. For different types of perturbations, we compare the ability of deep ensembles and Monte-Carlo Dropout (MC-Dropout) to estimate the uncertainties of the policy. In uncertain decision-making situations, we propose to change the robot's social behavior to conservative collision avoidance. The results show improved training performance with ODV and dropout in PPO and reveal that the training scenario has an impact on the generalization. In addition, MC-Dropout is more sensitive to perturbations and correlates the uncertainty type to the perturbation better. With the safe action selection, the robot can navigate in perturbed environments with fewer collisions.


ThAT8	308
Human-Robot Collaboration and Teaming 1	Regular Session
Chair: Chen, Fei	T-Stone Robotics Institute, the Chinese University of Hong Kong
Co-Chair: Lam, Tin Lun	The Chinese University of Hong Kong, Shenzhen

10:30-10:35, Paper ThAT8.1
LangGrasp: Leveraging Fine-Tuned LLMs for Language Interactive Robot Grasping with Ambiguous Instruction

Lin, Yunhan	Wuhan University of Science and Technology
Wu, Wenqi	Wuhan University of Science and Technology
Zhang, Zhijie	Wuhan University of Science and Technology
Min, Huasong	Robotics Institute of Beihang University of China
Keywords: Human-Robot Collaboration, Grasping, Human-Centered Robotics Abstract: The existing language-driven grasping methods struggle to fully handle ambiguous instructions containing implicit intents. To tackle this challenge, we propose LangGrasp, a novel language-interactive robotic grasping framework. The framework integrates fine-tuned large language models (LLMs) to leverage their robust commonsense understanding and environmental perception capabilities, thereby deducing implicit intents from linguistic instructions and clarifying task requirements along with target manipulation objects. Furthermore, our designed point cloud localization module, guided by 2D part segmentation, enables partial point cloud localization in scenes, thereby extending grasping operations from coarse-grained object-level to fine-grained part-level manipulation. Experimental results show that the LangGrasp framework accurately resolves implicit intents in ambiguous instructions, identifying critical operations and target information that are unstated yet essential for task completion. Additionally, it dynamically selects optimal grasping poses by integrating environmental information. This enables high-precision grasping from object-level to part-level manipulation, significantly enhancing the adaptability and task execution efficiency of robots in unstructured environments. More information and code are available here: https://github.com/wu467/LangGrasp.

10:35-10:40, Paper ThAT8.2
Integrating Ergonomics and Manipulability for Upper Limb Postural Optimization in Bimanual Human-Robot Collaboration

Li, Chenzui	The Chinese University of Hong Kong
Chen, Yiming	The Chinese Univesity of Hong Kong
Wu, Xi	The Chinese University of Hong Kong
Barresi, Giacinto	University of the West of England
Chen, Fei	T-Stone Robotics Institute, the Chinese University of Hong Kong
Keywords: Human-Robot Collaboration, Human Factors and Human-in-the-Loop, Optimization and Optimal Control Abstract: This paper introduces an upper limb postural optimization method for enhancing physical ergonomics and force manipulability during bimanual human-robot co-carrying tasks. Existing research typically emphasizes human safety or manipulative efficiency, whereas our proposed method uniquely integrates both aspects to strengthen collaboration across diverse conditions (e.g., different grasping postures of humans, and different shapes of objects). Specifically, the joint angles of a simplified human skeleton model are optimized by minimizing the cost function to prioritize safety and manipulative capability. To guide humans towards the optimized posture, the reference end-effector poses of the robot are generated through a transformation module. A bimanual model predictive impedance controller (MPIC) is proposed for our human-like robot, CURI, to recalibrate the end effector poses through planned trajectories. The proposed method has been validated through various subjects and objects during human-human collaboration (HHC) and human-robot collaboration (HRC). The experimental results demonstrate significant improvement in muscle conditions by comparing the activation of target muscles before and after optimization.

10:40-10:45, Paper ThAT8.3
Human-Robot Cooperative Heavy Payload Manipulation Based on Whole-Body Model Predictive Control

Wang, Ning	Nanjing University of Aeronautics and Astronautics
Liu, Shuo	Shanghai Jiao Tong University
Lam, Tin Lun	The Chinese University of Hong Kong, Shenzhen
Zhang, Tianwei	The University of Tokyo
Keywords: Human-Robot Collaboration, Whole-Body Motion Planning and Control, Mobile Manipulation Abstract: Human-robot collaborative manipulation with mobile, multiple minipulators is crucial for expanding robotic applications, requiring precise handling of coupled force position constraints between partners. Current systems, however, exhibit end-effector oscillations and instability during dynamic interactions. To overcome these limitations, this work develops a collaborative framework integrating a collaborative controller and a whole-body controller. The collaborative controller employs the object’s center-of-mass dynamics model with real-time contact forces and motion states to predict trajectories while coordinating with an attitude stabilization controller to adjust the desired end-effector poses. The whole-body controller utilizes model predictive control to generate coordinated motions that strictly follow pose commands from the collaborative controller, ensuring stable transportation. Simulation and physical experiments validate the proposed framework’s effectiveness in real-world scenarios.

10:45-10:50, Paper ThAT8.4
DVRP-MHSI: Dynamic Visualization Research Platform for Multimodal Human-Swarm Interaction

Zhu, Pengming	National University of Defense Technology
Zeng, Zhiwen	National University of Defense Technology
Yao, Weijia	Hunan University
Dai, Wei	National University of Defense Technology
Lu, Huimin	National University of Defense Technology
Zhou, Zongtan	National University of Defense Technology
Keywords: Human-Robot Collaboration, Multi-Modal Perception for HRI, Human-Centered Automation Abstract: In recent years, there has been a significant amount of research on algorithms and control methods for distributed collaborative robots. However, the emergence of collective behavior in a swarm is still difficult to predict and control. Nevertheless, human interaction with the swarm helps render the swarm more predictable and controllable, as human operators can utilize intuition or knowledge that is not always available to the swarm. Therefore, this paper designs the Dynamic Visualization Research Platform for Multimodal Human-Swarm Interaction (DVRP-MHSI), which is an innovative open system that can perform real-time dynamic visualization and is specifically designed to accommodate a multitude of interaction modalities (such as brain-computer, eye-tracking, electromyographic, and touch-based interfaces), thereby expediting progress in human-swarm interaction research. Specifically, the platform consists of custom-made low-cost omnidirectional wheeled mobile robots, multitouch screens and two workstations. In particular, the mutitouch screens can recognize human gestures and the shapes of objects placed on them, and they can also dynamically render diverse scenes. One of the workstations processes communication information within robots and the other one implements human-robot interaction methods. The development of DVRP-MHSI frees researchers from hardware or software details and allows them to focus on versatile swarm algorithms and human-swarm interaction methods without being limited to predefined and static scenarios, tasks, and interfaces. The effectiveness and potential of the platform for human-swarm interaction studies are validated by several demonstrative experiments.

10:50-10:55, Paper ThAT8.5
Task-Oriented Adaptive Position/Force Control for Robotic Systems under Hybrid Constraints (I)

Ding, Shuai	Zhengzhou University
Peng, Jinzhu	Zhengzhou University
Xin, Jianbin	Zhengzhou University
Zhang, Hui	Hunan University
Wang, Yaonan	Hunan University
Keywords: Human-Robot Collaboration, Physical Human-Robot Interaction, Robot Safety Abstract: By mapping the performances of the task requirement and the inherent physical characteristics of the robotic systems to the hybrid constraints, this paper proposes a task-oriented adaptive position/force control (TOAPFC) scheme for the robotic systems to ensure the execution of the predefined tasks and the safety of robotic manipulators and humans in the task workspace. In the proposed scheme, a reference trajectory generation strategy and admittance model are regarded as the outer-loop of TOAPFC to obtain and shape the robotic system's task trajectory that guarantees the safety of the interaction system. An admittance-based adaptive position/force control scheme unifying the position and force into a control law is used as the inner-loop of TOAPFC to track the shaped task trajectory, where a barrier Lyapunov function is utilized to constrain the tracking errors within permitted ranges. Moreover, the system uncertainties and lumped disturbances are compensated by the radial basis function neural network and robust compensator, respectively. Meanwhile, the stability of the proposed admittance-based adaptive position/force control scheme is analyzed by using the Lyapunov stability theory. Finally, the effectiveness of the proposed scheme is verified by experiments on a real robotic system.

10:55-11:00, Paper ThAT8.6
Human-Aware Reactive Task Planning of Sequential Robotic Manipulation Tasks (I)

Ma, Wanyu	The Hong Kong Polytechnic University
Duan, Anqing	Mohamed Bin Zayed University of Artificial Intelligence
Lee, Hoi-Yin	The Hong Kong Polytechnic University
Zheng, Pai	The Hong Kong Polytechnic University
Navarro-Alarcon, David	The Hong Kong Polytechnic University
Keywords: Human-Robot Collaboration, Reactive and Sensor-Based Planning, Planning, Scheduling and Coordination Abstract: The recent emergence of Industry 5.0 underscores the need for increased autonomy in human–robot interaction (HRI), presenting both motivation and challenges in achieving resilient and energy-efficient production systems. To address this, in this article, we introduce a strategy for seamless collaboration between humans and robots in manufacturing and maintenance tasks. Our method enables smooth switching between temporary HRI (human-aware mode) and long-horizon automated manufacturing (fully automatic mode), effectively solving the human–robot coexistence problem. We develop a task progress monitor that decomposes complex tasks into robot-centric action sequences, further divided into three-phase subtasks. A trigger signal orchestrates mode switches based on detected human actions and their contribution to the task. In addition, we introduce a human agent coefficient matrix, computed using selected environmental features, to determine cut-points for reactive execution by each robot. To validate our approach, we conducted extensive experiments involving robotic manipulators performing representative manufacturing tasks in collaboration with humans. The results show promise for advancing HRI, offering pathways to enhancing sustainability within Industry 5.0. Our work lays the foundation for intelligent manufacturing processes in future societies, marking a pivotal step toward realizing the full potential of human–robot collaboration.

11:00-11:05, Paper ThAT8.7
Robotic Grinding Skills Learning Based on Geodesic Length Dynamic Motion Primitives (I)

Ke, Shuai	Huazhong University of Science and Technology
Zhao, Huan	Huazhong University of Science and Technology
Li, Xiangfei	Huazhong University of Science and Technology
Wei, Zhiao	School of Mechanical Science and Engineering of Huazhong Univer
Yin, Yecan	Huazhong University of Science and Technology
Ding, Han	Huazhong University of Science and Technology
Keywords: Human-Robot Collaboration, Imitation Learning, Learning from Demonstration Abstract: Learning grinding skills from human craftsmen by imitation learning has emerged as a prominent research topic in the field of robotic machining. Given their robust trajectory generalization ability and resilience to various external disturbances and environmental changes, Dynamical Movement Primitives (DMPs) provide a promising skills learning solution for the robotic grinding. However, challenges arise when directly applying DMPs to grinding tasks, including low orientation accuracy, inaccurate synchronization of position, orientation, and force, and the inability to generalize surface trajectories. To address these issues, this paper proposes a robotic grinding skills learning method based on geodesic length DMPs (Geo-DMPs). First, a normalized two-dimensional weighted Gaussian kernel function and intrinsic mean clustering algorithm are proposed to extract surface geometric features from multiple demonstration trajectories. Then, an orientation manifold distance metric is introduced to exclude the time factor from the classical orientation DMPs, thereby constructing Geo-DMPs for the orientation learning to improve the orientation trajectory generation accuracy. On this basis, a synchronization encoding framework for position, orientation, and force skills is established, using a phase function related to geodesic length. This framework enables the generation of robotic grinding actions between any two points on the surface. Finally, experiments on robotic chamfer grinding and free-form surface grinding demonstrate that the proposed method exhibits high geometric accuracy and good generalization capabilities in encoding and generating grinding skills. This method holds significant implications for learning and promoting robotic grinding skills. To the best of our knowledge, this may be the first attempt to use DMPs to generate grinding skills for position, orientation, and force on model-free surfaces, thereby presenting a novel approach to robotic grinding skills learning.

11:05-11:10, Paper ThAT8.8
Adapting Robot's Explanation for Failures Based on Observed Human Behavior in Human-Robot Collaboration

Naoum, Andreas	KTH Royal Institute of Technology
Khanna, Parag	KTH Royal Institute of Technology
Yadollahi, Elmira	Lancaster University
Björkman, Mårten	KTH
Smith, Claes Christian	KTH Royal Institute of Technology
Keywords: Human-Robot Collaboration, Human Factors and Human-in-the-Loop, Gesture, Posture and Facial Expressions Abstract: This work aims to interpret human behavior to anticipate potential user confusion when a robot provides explanations for failure, allowing the robot to adapt its explanations for more natural and efficient collaboration. Using a dataset [1] that included facial emotion detection, eye gaze estimation, and gestures from 55 participants in a user study [2], we analyzed how human behavior changed in response to different types of failures and varying explanation levels. Our goal is to assess whether human collaborators are ready to accept less detailed explanations without inducing confusion. We formulate a data-driven predictor to predict human confusion during robot failure explanations. We also propose and evaluate a mechanism, based on the predictor, to adapt the explanation level according to observed human behavior. The promising results from this evaluation indicate the potential of this research in adapting a robot’s explanations for failures to enhance the collaborative experience.


ThAT9	309
Object Detection, Segmentation and Categorization 5	Regular Session
Chair: Goulette, François	MINES ParisTech
Co-Chair: Guo, Shuxiang	Southern University of Science and Technology

10:30-10:35, Paper ThAT9.1
HFSENet: Hierarchical Fusion Semantic Enhancement Network for RGB-T Semantic Segmentation in Annealing Furnace Operation Area

Yuan, Haoyu	Beijing Institute of Technology
Zhang, Lin	Beijing Institute of Technology
Bao, Runjiao	Beijing Institute of Technology
Si, Jinge	Beijing Institute of Technology
Wang, Shoukun	Beijing Institute of Technology
Niu, Tianwei	Beijing Institute of Technology
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception, Industrial Robots Abstract: Regular temperature measurement of critical parts of an annealing furnace has always been a difficult task. Due to the harsh environment of high temperature, high noise, and darkness in the annealing furnace operation area, unmanned vehicles equipped with the RGB-T semantic segmentation model are usually adopted in most factories for inspection. However, existing RGB-T semantic segmentation models usually rely on good lighting or thermal conditions, which are generally difficult to fulfill in annealing furnace operation areas. In this paper, we propose a new hierarchical fusion-based semantic enhancement network, HFSENet. We first adopt the two-stream structure and the siamese structure to extract the low-level and high-level features of unimodal modalities, respectively. Then, considering the differences between the features in different hierarchical levels, we introduce a novel low-level feature spatial fusion module and a high-level feature channel fusion module to perform the multi-modal feature hierarchical fusion. On this basis, we also propose the semantic feature complementary enhancement module, which utilizes the appearance information set and object information set extracted from RGB and thermal infrared (TIR) branches to enhance the fused features and give them more semantic information. Finally, segmentation results with refined edges are obtained by an edge refinement decoder that includes a local search extraction module. The unmanned inspection vehicle we built with the proposed HFSENet has successfully passed the test, and the recognition performance of the four targets exceeds the current state-of-the-art (SOTA) method on our homemade annealing furnace operation area dataset.

10:35-10:40, Paper ThAT9.2
HD-OOD3D: Supervised and Unsupervised Out-Of-Distribution Object Detection in LiDAR Data

Soum-Fontez, Louis	Mines Paris - PSL
Deschaud, Jean-Emmanuel	ARMINES
Goulette, François	MINES ParisTech
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception, Autonomous Vehicle Navigation Abstract: Autonomous systems rely on accurate 3D object detection from LiDAR data, yet most detectors are limited to a predefined set of known classes, making them vulnerable to unexpected out-of-distribution (OOD) objects. In this work, we present HD-OOD3D, a novel two-stage method for detecting unknown objects. We demonstrate the superiority of two-stage approaches over single-stage methods, achieving more robust detection of unknown objects. Furthermore, we conduct an in-depth analysis of the standard evaluation protocol for OOD detection, revealing the critical impact of hyperparameter choices. To address the challenge of scaling the learning of unknown objects, we explore unsupervised training strategies to generate pseudo-labels for unknowns. Among the different approaches evaluated, our experiments show that top-K auto-labelling offers more promising performance compared to simple resizing techniques.

10:40-10:45, Paper ThAT9.3
Boosting Omnidirectional Stereo Matching with a Pre-Trained Depth Foundation Model

Endres, Jannik	École Polytechnique Fédérale De Lausanne
Hahn, Oliver	Technical University of Darmstadt
Corbière, Charles	EPFL
Schaub-Meyer, Simone	TU Darmstadt
Roth, Stefan	TU Darmstadt
Alahi, Alexandre	EPFL
Keywords: Omnidirectional Vision, Deep Learning for Visual Perception, Computer Vision for Transportation Abstract: Omnidirectional depth perception is essential for mobile robotics applications that require scene understanding across a full 360° field of view. Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps without relying on expensive active sensing. However, existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments, depth ranges, and lighting conditions, due to the scarcity of real-world data. We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation within an iterative optimization-based stereo matching architecture. We introduce a dedicated two-stage training strategy to utilize the relative monocular depth features for our omnidirectional stereo matching before scale-invariant fine-tuning. DFI-OmniStereo achieves state-of-the-art results on the real-world Helvipad dataset, reducing disparity MAE by approximately 16% compared to the previous best omnidirectional stereo method. More details, code, and model weights are available at https://vita-epfl.github.io/DFI-OmniStereo-website/.

10:45-10:50, Paper ThAT9.4
Embodied Domain Adaptation for Object Detection

Shi, Xiangyu	The University of Adelaide
Qiao, Yanyuan	The University of Adelaide
Liu, Lingqiao	University of Adelaide
Dayoub, Feras	The University of Adelaide
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception, Transfer Learning Abstract: Mobile robots rely on object detectors for perception and object localization in indoor environments. However, standard closed-set methods struggle to handle the diverse objects and dynamic conditions encountered in real homes and labs. Open-vocabulary object detection (OVOD), driven by Vision Language Models (VLMs), extends beyond fixed labels but still struggles with domain shifts in indoor environments. We introduce a Source-Free Domain Adaptation (SFDA) approach that adapts a pre-trained model without accessing source data. We refine pseudo labels via temporal clustering, employ multi-scale threshold fusion, and apply a Mean Teacher framework with contrastive learning. Our Embodied Domain Adaptation for Object Detection (EDAOD) benchmark evaluates adaptation under sequential changes in lighting, layout, and object diversity. Our experiments show significant gains in zero-shot detection performance and flexible adaptation to dynamic indoor conditions.

10:50-10:55, Paper ThAT9.5
WFDA: Wavelet-Based Frequency Decomposition and Aggregation for Underwater Object Detection

Liu, Xueting	Southern University of Science and Technology
Chunying, Li	Southern University of Science and Technology
Guo, Shuxiang	Southern University of Science and Technology
Keywords: Object Detection, Segmentation and Categorization, Computer Vision for Automation, Recognition Abstract: Underwater Object Detection (UOD) techniques are critical for Autonomous Underwater Vehicle (AUV), which must operate in harsh underwater environments characterized by low visibility while satisfying the lightweight and real-time constraints required for vehicle-mounted systems. Current methods typically rely on underwater image enhancement combined with object detection to adapt to underwater conditions. However, these approaches mainly focus on the spatial domain, often overlooking the frequency-domain characteristics of the underwater environment. This oversight limits the removal of noise factors, such as scattering, blurring, distortion, and uneven illumination, and diminishes the focus of object edges and textures. Additionally, the increased parameter size and higher computational cost render them less suitable for real-time detection. To address these, the Wavelet-based Frequency Decomposition and Aggregation Network (WFDA) was proposed, which leverages the Wavelet Transform (WT) to decompose features into high- and low-frequency components for effective feature modeling and fusion-based downsampling. Specifically, the Wavelet-Based Feature Decomposition Modeling (WDM) module utilized multi-level wavelet decomposition to hierarchically model features across different frequency bands, while the Wavelet-Based Feature Aggregation Downsampling (WAD) module refined and extracted core features through single-level wavelet decomposition combined with channel aggregation. Evaluations on four public datasets demonstrate that WFDA achieved state-of-the-art (SOTA) performance and efficiency, making it well-suited for real-time, high-accuracy detection on robotic platforms. Code is available at https://github.com/Mariiiiooooo/WFDA.

11:00-11:05, Paper ThAT9.7
Interactive Fine-Grained Few-Shot Detection of Tools

Keller, Philip	FZI Forschungszentrum Informatik
Strecker, Leon	Karlsruhe Institute of Technology
Durchdewald, Felix	Karlsruhe Institute of Technology
Graaf, Friedrich	FZI Research Center for Information Technology
Schnell, Tristan	FZI Forschungszentrum Informatik
Dillmann, Rüdiger	FZI - Forschungszentrum Informatik - Karlsruhe
Keywords: Object Detection, Segmentation and Categorization, Incremental Learning, Human Factors and Human-in-the-Loop Abstract: Few-shot object detection is especially interesting for applications with mobile robots and becomes even more challenging when task-related classes are very similar. This work focuses on such a scenario: detecting different types of household and industrial tools. Such tools can be rare and specific and are usually not covered by existing large datasets, except for common ones such as screwdrivers. Additionally, the target classes might change frequently depending on the robot’s missions. Therefore, we propose DE-fine-ViT, a fine-grained few-shot object detection model that does not require fine-tuning. We build our architecture on top of the elaborate DE-ViT model, extending it with specialized components to improve the fine-grained detection capabilities. The user can construct class and part prototypes tailored to the task in an interactive preparation phase. During inference, our proposed re-evaluation module leverages the multi-granularity of prototypes for fine-grained class differentiation. We evaluate our model in multiple realistic experiments, including a specifically created fine-grained dataset, demonstrating its efficacy and suitability for scenarios with little data and low inter-class variance.

11:05-11:10, Paper ThAT9.8
UAV-DETR: Efficient End-To-End Object Detection for Unmanned Aerial Vehicle Imagery

Zhang, Huaxiang	Fudan University
Zhang, Hao	Fudan University
Liu, Kai	Fudan University
Gan, Zhongxue	Fudan University
Zhu, Guo-Niu	Fudan Unversity
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception, AI-Based Methods Abstract: Unmanned aerial vehicle object detection (UAV-OD) has been widely used in various scenarios. However, most existing UAV-OD algorithms rely on manually designed components, which require extensive tuning. End-to-end models that do not depend on such manually designed components are mainly designed for natural images, which are less effective for UAV imagery. To address such challenges, this paper proposes an efficient detection transformer (DETR) framework tailored for UAV imagery, i.e., UAV-DETR. The framework includes a multi-scale feature fusion with frequency enhancement module, which captures both spatial and frequency information at different scales. In addition, a frequency-focused downsampling module is presented to retain critical spatial details during downsampling. A semantic alignment and calibration module is developed to align and fuse features from different fusion paths. Experimental results demonstrate the effectiveness and generalization of our approach across various UAV imagery datasets. On the VisDrone dataset, our method improves AP by 3.1% and AP50 by 4.2% over the baseline. Similar enhancements are observed on the UAVVaste dataset. The project page is available at https://github.com/ValiantDiligent/UAV-DETR.


ThAT10	310
Visual-Inertial SLAM	Regular Session
Chair: Zhao, Liang	The University of Edinburgh
Co-Chair: Huang, Yulong	Harbin Engineering University

10:30-10:35, Paper ThAT10.1
DW-VIO: Deep Weighted Visual-Inertial Odometry

Chen, Guyuan	Zhejiang University
Guo, Xiyue	Zhejiang University
Pan, Xiaokun	Zhejiang University
Shen, Yujun	Ant Group
Zhang, Guofeng	Zhejiang University
Bao, Hujun	Zhejiang University
Cui, Zhaopeng	Zhejiang University
Keywords: Visual-Inertial SLAM, SLAM, Sensor Fusion Abstract: Visual-inertial odometry (VIO) has made significant progress in various applications. However, one of the key challenges in VIO is the efficient and robust fusion of visual and inertial measurements, particularly while mitigating the impact of sensor failures. To address this challenge, we propose a new learning-based VIO system, i.e., DW-VIO, which is able to integrate multiple sensors and provide robust state estimations. To this end, we design a novel deep learning-based data-fusion approach that dynamically associates information from multiple sensors to predict sensor weights for optimization. Moreover, in order to improve the efficiency, we present several real-time optimization techniques including a fast patch graph constructor and efficient GPU-accelerated multi-factor bundle adjustment layer. Experimental results show that DW-VIO outperforms most state-of-the-art (SOTA) methods on the EuRoC MAV, ETH3D-SLAM, and KITTI-360 benchmarks across various challenging sequences. Additionally, it maintains a minimum of 20 frames per second (FPS) on a single RTX 3060 GPU with high-resolution input, highlighting its efficiency.

10:35-10:40, Paper ThAT10.2
PC-SRIF: Preconditioned Cholesky-Based Square Root Information Filter for Vision-Aided Inertial Navigation

Ke, Tong	Google LLC
Agrawal, Parth	University of California Los Angeles, Google
Zhang, Yun	Google
Zhen, Weikun	Carnegie Mellon University
Guo, Chao	Google
Sharp, Toby	Google
DuToit, Ryan, C	Google
Keywords: Visual-Inertial SLAM, SLAM Abstract: In this paper, we introduce a novel estimator for vision-aided inertial navigation systems (VINS), the Preconditioned Cholesky-based Square Root Information Filter (PC-SRIF). When solving linear systems, employing Cholesky decomposition offers superior efficiency but can compromise numerical stability. Due to this, existing VINS utilizing (Square Root) Information Filters often opt for QR decomposition on platforms where single precision is preferred, avoiding the numerical challenges associated with Cholesky decomposition. While these issues are often attributed to the ill-conditioned information matrix in VINS, our analysis reveals that this is not an inherent property of VINS but rather a consequence of specific parameterizations. We identify several factors that contribute to an ill-conditioned information matrix and propose a preconditioning technique to mitigate these conditioning issues. Building on this analysis, we present PC-SRIF, which exhibits remarkable stability in performing Cholesky decomposition in single precision when solving linear systems in VINS. Consequently, PC-SRIF achieves superior theoretical efficiency compared to alternative estimators. To validate the efficiency advantages and numerical stability of PC-SRIF based VINS, we have conducted well controlled experiments, which provide empirical evidence in support of our theoretical findings. Remarkably, in our VINS implementation, PC-SRIF's runtime is 41% faster than QR-based SRIF.

10:40-10:45, Paper ThAT10.3
FastTrack: GPU-Accelerated Tracking for Visual SLAM

Khabiri, Kimia	Simon Fraser University
Hosseininejad, Parsa	Simon Fraser University
Gopinath, Shishir	Simon Fraser University
Dantu, Karthik	University of Buffalo
Ko, Steve	Simon Fraser University
Keywords: Visual-Inertial SLAM, Visual Tracking, Performance Evaluation and Benchmarking Abstract: The tracking module of a visual-inertial SLAM system processes incoming image frames and IMU data to estimate the position of the frame in relation to the map. It is important for the tracking to complete in a timely manner for each frame to avoid poor localization or tracking loss. We therefore present a new approach which leverages GPU computing power to accelerate time-consuming components of tracking in order to improve its performance. These components include stereo feature matching and local map tracking. We implement our design inside the ORB-SLAM3 tracking process using CUDA. Our evaluation demonstrates an overall improvement in tracking performance of up to 2.8x on a desktop and Jetson Xavier NX board in stereo-inertial mode, using the well-known SLAM datasets EuRoC and TUM-VI.

10:45-10:50, Paper ThAT10.4
CAMSCKF: A Multi-State Constraint Kalman Filter with Adaptive Multivariate Noise Parameters Clustering and Estimation for Visual-Inertial Odometry

Tang, Yiyang	Harbin Engineering University
Zhang, Hanxuan	Harbin Engineering University
Yu, Yichen	Harbin Engineering University
Li, Xiaofeng	Harbin Engineering University
Huang, Yulong	Harbin Engineering University
Keywords: Visual-Inertial SLAM, Vision-Based Navigation, Sensor Fusion Abstract: The Visual-Inertial Odometry has been widely deployed on autonomous robots traveling in open outdoor scenarios. However, the visual measurements will be influenced heavily by the observation distances, perspectives, lighting and texture conditions, with distinct and time-varying noise distributions of measurements. Existing methods for handling time-varying noise in Visual-Inertial Odometry regard all measurement noise as identically distributed, unable to effectively deal with the distinct noise in open outdoor scenarios, which degrades the localization accuracy. In this paper, a Multi-State Constraint Kalman Filter with Adaptive multivariate noise parameters Clustering and estimation for visual-inertial odometry (CAMSCKF) is proposed to address the issue, which can separately track the measurement noise covariance matrix (MNCM) of different measurement clusters and adjust the MNCM in real-time. Firstly, the joint distribution of the state and the MNCM coefficients for each cluster is modeled as an Gaussian-Multivariate Generalized Inverse Gaussian distribution. Subsequently, an Expectation Maximization algorithm-based stepwise adaptive measurement clustering method is designed, which clusters measurements according to their corresponding innovations. Finally, an analytical update method for the joint posterior distribution without fixed-point iteration is implemented, achieving adaptive adjustment of the MNCM, thereby enabling accurate and robust Visual-Inertial Odometry localization. The superiority of the proposed method is demonstrated by simulations and dataset experiments, especially under the aggressive motion. In the experiments of the challenging outdoor dataset UZH-FPV, the proposed method has improved the average position and attitude estimation accuracy by 35.69% and 32.88%, respectively, compared with the state-of-the-art ANGIG-KF.

10:50-10:55, Paper ThAT10.5
GeoFlow-SLAM: A Robust Tightly-Coupled RGBD-Inertial and Legged Odometry Fusion SLAM for Dynamic Legged Robotics

Xiao, Tingyang	Horizon Robotics
Zhou, Xiaolin	Horizon Robotics
Liu, Liu	Horizon Robotic
Sui, Wei	Soochow University
Feng, Wei	Horizon Robotics
Qiu, Jiaxiong	UESTC
Wang, Xinjie	Horizon Robotics
Su, Zhizhong	Horizon Robotics
Keywords: Visual-Inertial SLAM, Legged Robots, Sensor Fusion Abstract: This paper presents GeoFlow-SLAM, a robust and effective Tightly-Coupled RGBD-Inertial and Legged Odometry Fusion SLAM for legged robotics undergoing aggressive and high-frequency motions. By integrating geometric consistency, legged odometry constraints, and dual-stream optical flow (GeoFlow), our method addresses three critical challenges: feature matching and pose initialization failures during fast locomotion and visual feature scarcity in texture-less scenes. Specifically, in rapid motion scenarios， feature matching is notably enhanced by leveraging dual-stream optical flow, which combines prior map points and poses. Additionally, we propose a robust pose initialization method for fast locomotion and IMU error in legged robots, integrating IMU/Legged odometry, inter-frame Perspective-n-Point (PnP), and Generalized Iterative Closest Point (GICP). Furthermore, a novel optimization framework that tightly couples depth-to-map and GICP geometric constraints is first introduced to improve the robustness and accuracy in long-duration, visually texture-less environments. The proposed algorithms achieve state-of-the-art (SOTA) on collected legged robots and open-source datasets. To further promote research and development, the open-source datasets and code will be made publicly available at https://github.com/HorizonRobotics/GeoFlowSlam.

10:55-11:00, Paper ThAT10.6
VINS-MLD2: Monocular Visual-Inertial SLAM with Multi-Level Detector and Descriptor

Nian, Xiaohong	Central South University
Cai, Qidong	Central South University
Dai, Xunhua	Central South University
Chen, Yong	Central South University
Keywords: Visual-Inertial SLAM, SLAM, Deep Learning for Visual Perception Abstract: The performance of a vision simultaneous localization and mapping (SLAM) system based on hand-crafted features degrades significantly in harsh environments due to unstable feature tracking. With the breakthrough of convolutional neural networks in deep feature extraction tasks, many researchers have tried to incorporate them into SLAM systems. However, it's challenging to guarantee the real-time performance of the entire SLAM system, and the erroneous usage scenarios limit the superior performance of deep feature extraction methods. To overcome these problems, we propose a visual-inertial SLAM system with multi-level detector and descriptor, called VINS-MLD2. In our framework, we first design an efficient deep feature extraction network that has the same performance as R2D2 by concatenating multi-level features, but runs 3 times faster under the image resolution commonly used in SLAM. Then, based on the camera baseline, we introduce the Matching Fusion, a matching method that fuses deep descriptor matching and optical flow matching results to improve matching accuracy for both short and wide baselines. In addition, an adaptive matching strategy is proposed to balance the running time and accuracy by adaptively adjusting the matching method. Experimental results in unmanned aerial vehicle (UAV) deployments and real-world environments demonstrate that the proposed method tracks features more stably and accurately.

11:00-11:05, Paper ThAT10.7
Observability Investigation for Rotational Calibration of (Global-Pose Aided) VIO under Straight Line Motion

Song, Junlin	University of Luxembourg
Richard, Antoine	University of Luxembourg
Olivares-Mendez, Miguel A.	Interdisciplinary Centre for Security, Reliability and Trust - U
Keywords: Visual-Inertial SLAM Abstract: Online extrinsic calibration is crucial for building "power-on-and-go" moving platforms, like robots and AR devices. However, blindly performing online calibration for unobservable parameter may lead to unpredictable results. In the literature, extensive studies have been conducted on the extrinsic calibration between IMU and camera, from theory to practice. It is well-known that the observability of extrinsic parameter can be guaranteed under sufficient motion excitation. Furthermore, the impacts of degenerate motions are also investigated. Despite these successful analyses, we identify an issue with respect to the existing observability conclusion. This paper focuses on the observability investigation for straight line motion, which is a common-seen and fundamental degenerate motion in applications. We analytically prove that pure translational straight line motion can lead to the unobservability of the rotational extrinsic parameter between IMU and camera (at least one degree of freedom). By correcting the existing observability conclusion, our novel theoretical finding disseminates more precise principle to the research community and provides explainable calibration guideline for practitioners. Our analysis is validated by rigorous theory and experiments.

11:05-11:10, Paper ThAT10.8
PL-VIWO: A Lightweight and Robust Point-Line Monocular Visual Inertial Wheel Odometry

Zhang, Zhixin	University of Manchester
Bai, Wenzhi	University of Manchester
Zhao, Liang	The University of Edinburgh
Ladosz, Pawel	University of Manchester
Keywords: Visual-Inertial SLAM, Sensor Fusion, Intelligent Transportation Systems Abstract: This paper presents a novel tightly coupled Filter-based monocular visual inertial-wheel odometry (VIWO) system for ground robots, designed to deliver accurate and robust localization in long-term complex outdoor navigation scenarios. As an external sensor, the camera enhances localization performance by introducing visual constraints. However, obtaining a sufficient number of effective visual features is often challenging, particularly in dynamic or low texture environments. To address this issue, we incorporate the line features for additional geometric constraints. Unlike traditional approaches that treat point and line features independently, our method exploits the geometric relationships between points and lines in 2D images, enabling fast and robust line matching and triangulation. Additionally, we introduce Motion Consistency Check (MCC) to filter out potential dynamic points, ensuring the effectiveness of point feature updates. The proposed system was evaluated on publicly available datasets and benchmarked against state-of-the-art methods. Experimental results demonstrate superior performance in terms of accuracy, robustness, and efficiency. The source code is publicly available at: https://github.com/Happy-ZZX/PL-VIWO.


ThAT11	311A
Reinforcement Learning 9	Regular Session
Chair: Shi, Qing	Beijing Institute of Technology
Co-Chair: Xiao, Xuesu	George Mason University

10:30-10:35, Paper ThAT11.1
Deep Reinforcement Learning with Multiple Unrelated Rewards for AGV Mapless Navigation (I)

Cai, Boliang	Cardiff University
Wei, Changyun	Hohai University
Ji, Ze	Cardiff University
Keywords: Collision Avoidance, Autonomous Vehicle Navigation, Reinforcement Learning Abstract: Mapless navigation for Automated Guided Vehicles (AGV) via Deep Reinforcement Learning (DRL) algorithms has attracted significantly rising attention in recent years. Collision avoidance from dynamic obstacles in unstructured environments, such as pedestrians and other vehicles, is one of the key challenges for mapless navigation. Autonomous navigation requires a policy to make decisions to optimize the path distance towards the goal but also to reduce the probability of collisions with obstacles. Mostly, the reward for AGV navigation is calculated by combining multiple reward functions for different purposes, such as encouraging the robot to move towards the goal or avoiding collisions, as a state-conditioned function. The combined reward, however, may lead to biased behaviours due to the empirically chosen weights when multiple rewards are combined and dangerous situations are misjudged. Therefore, this paper proposes a learning-based method with multiple unrelated rewards, which represent the evaluation of different behaviours respectively. The policy network, named Multi-Feature Policy Gradients (MFPG), is conducted by two separate Q networks that are constructed by two individual rewards, corresponding to goal distance shortening and collision avoidance, respectively. In addition, we also propose an auto-tuning method, named Ada-MFPG, that allows the MFPG algorithm to automatically adjust the weights for the two separate policy gradients. For collision avoidance, we present a new social norm-oriented continuous biased reward for performing specific social norm so as to reduce the probabilities of AGV collisions. By adding an offset gain to one of the reward functions, vehicles conducted by the proposed algorithm exhibited the predetermined features. The work was tested in different simulation environments under multiple scenarios with a single robot or multiple robots. The proposed MFPG method is compared with standard Deep Deterministic Policy Gradient (DDPG), the modified DDPG, SAC and TD3 with a social norm mechanism. MFPG significantly increases the success rate in robot navigation tasks compared with the DDPG. Besides, among all the benchmarking algorithms, the MFPG-based algorithms have the optimal task completion duration and lower variance compared with the baselines. The work has also been tested on real robots. Experiments on the real robots demonstrate the viability of the trained model for the real world scenarios. The learned model can be used for multi-robot mapless navigation in complex environments, such as a warehouse, that need multi-robot cooperation.

10:35-10:40, Paper ThAT11.2
Robust Reinforcement Learning Based on Momentum Adversarial Training

He, Li	Fudan University
Liu, Hanchen	Fudan University
Sheng, Junru	Fudan University
ZHang, Lihua	Fudan University
Dong, Zhiyan	Fudan University
Keywords: AI-Enabled Robotics, Reinforcement Learning, Machine Learning for Robot Control Abstract: Reinforcement learning (RL) is a fundamental and pivotal algorithm in the advancement of Embodied Intelligence. The performance of RL directly influences the quality and efficiency of a robot's decision-making and execution during interactions with its environment. However, the robustness of RL remains a critical challenge that needs to be addressed. A promising approach to enhancing robustness is adversarial reinforcement learning. Existing methods primarily focus on perturbations in the state space, while perturbations in the action space have been relatively underexplored. Given that action space plays an equally crucial role as state space in Embodied Intelligence, action-space perturbations provide a more comprehensive evaluation of RL robustness. Therefore, investigating RL robustness under action-space perturbations is both necessary and valuable for the development of Embodied Intelligence.To this end, we propose an adversarial learning framework that employs momentum-based gradient descent to model perturbations in the action space, such as actuator disturbances. Furthermore, we introduce an improved optimization method that integrates historical gradient information into conventional Stochastic Gradient Descent (SGD). This approach enhances training stability and improves perturbation efficiency. The proposed method is evaluated through simulations in the MuJoCo environment and UAV control experiments in GymFC, demonstrating significant improvements in robustness and adaptability under action-space perturbations. Additionally, real-world UAV flight tests were conducted to further validate the effectiveness of the proposed framework. The results confirm that the Sim-to-Real transfer is successful, providing empirical evidence for the applicability of our method in real-world scenarios.This study establishes that enhancing RL robustness through action-space perturbations is both feasible and effective. More importantly, our findings contribute to the future development of Embodied Intelligence, particularly in improving its resilience to uncertainties and dynamic environments.

10:40-10:45, Paper ThAT11.3
Learning Whole-Body Control for Small-Sized Quadruped Robots with a Flexible Spine

Jiang, Dixuan	Beijing Institue of Technology
Jia, Guanglu	Beijing Institute of Technology
Dong, Changwen	Beijing Institute of Technology
Su, Jiajun	Beijing Institute of Technology
Yu, Zhiqiang	Beijing Institute of Technology
Shi, Qing	Beijing Institute of Technology
Keywords: Biologically-Inspired Robots, Legged Robots, Reinforcement Learning Abstract: Improving the adaptability of small-sized quadruped robots has been a longstanding challenge in robotics.However, the weak whole-body coordination in existing small-sized quadruped robots limits their locomotion in many environments.In this work, we propose a teacher-student online learning framework for agile whole-body control of small-sized quadruped robots with a flexible spine. We first select a simple and effective gait pattern, the diagonal symmetrical sequence, using a dynamics model.Based on the reference motions provided by the gait pattern and combined with privileged information, we train a teacher policy to generate high-quality motion data. After setting the state space to match the actual robot's state space, we initialize the robot's initial state using the teacher data and train a student policy. Finally, we deploy the student policy on the SQuRo-Lite, a small-sized quadruped robot with a flexible spine, demonstrating that our approach can achieve stable yet dynamic locomotion for walking and turning. In the variable-spacing slalom experiment, the robot is able to flexibly adjust the motion patterns of its spine and legs based on commands, enabling dynamic changes in its turning radius. This further validates that our approach can achieve agile whole-body control for small-sized quadruped robots. This work helps broaden the application scenarios of small quadruped robots.

10:45-10:50, Paper ThAT11.4
A Skill-Based Hierarchical Framework with Dangerous Action Masking for Autonomous Navigation of Jumping Robots

Li, Gangyang	Beijing Institute of Technology
Zhou, Qijie	Beijing Institute of Technology
Xu, Yi	Beijing Institute of Technology
Zhang, Weitao	Beijing Institute of Technology
Shi, Qing	Beijing Institute of Technology
Keywords: Biologically-Inspired Robots, Reinforcement Learning, Collision Avoidance Abstract: Achieving autonomous navigation for biologically inspired jumping robots remains a long-standing challenge, due to the inherent instability of jumping motions and the limitations in onboard sensor capabilities. This paper proposes a skill-based hierarchical framework with dangerous action masking (SH-DAM) for autonomous navigation of jumping robot. The framework, based on hierarchical reinforcement learning, includes a low-level controller that learns locomotion skills (crawling, turning and jumping) to overcome various obstacles. A high-level controller selects and coordinates these skills, while also incorporating curriculum learning to enhance the performance of navigation tasks. For safe navigation, we utilize dangerous action masking to suppress the probability of selecting jump motions in dangerous regions. We improved the locust-inspired jumping robot platform JumpBot-S, by integrating a lightweight time-of-flight (ToF) sensor, and constructed a range of complex environments for experiments. Simulation results demonstrate that SH-DAM enables the robot to autonomously complete challenging navigation tasks. Compared to baseline algorithms, our method achieves a 12.57% increase in success rate, a 55.88% reduction in stuck rate, and a 57.89% reduction in rollover rate. Finally, we deployed our framework in real-world environments and conducted experiments in both normal lit and dimly lit conditions. This framework provides a new paradigm for jumping robot navigation in complex environments.

10:50-10:55, Paper ThAT11.5
Risk-Aware Reinforcement Learning with Group Opinion for Autonomous Driving

Zhao, Guanyi	City University of Hong Kong
Xu, Meng	City University of Hong Kong
Wen, Zihao	City University of Hong Kong
Wang, Jianping	City University of Hong Kong
Keywords: Autonomous Agents, Reinforcement Learning, Deep Learning Methods Abstract: To avoid dangerous situations, such as collisions in dynamic environments, autonomous vehicles must predict the risks of the current scene to take safe actions. Traditional rule-based risk prediction methods and existing reinforcement learning (RL) approaches, which typically rely on manually designed driving decision rules or heuristic reward functions, often fail to capture the complexity of real-world dangerous scenarios, leading to suboptimal and unsafe driving decisions. To address this limitation, we develop a novel RL method, called Group Opinion Risk-Aware Reinforcement Learning (GORA-RL), for safer driving decisions that align with real-world conditions. Specifically, we first introduce surveys of human drivers to assess risk in real-world driving situations. Using these real group opinions as training data, we train a risk prediction model, referred to as the risk prediction model with a Transformer (RPT), that captures the crucial characteristics of these scenarios, resulting in more realistic and reliable risk predictions. This model is then integrated as a reward function to train an RL algorithm for making driving decisions in various scenarios. The experiments validate that our approach outperforms two state-of-the-art (SOTA) methods in challenging congested scenarios, such as merging and intersections, in terms of reward and several other metrics. Project site: url{https://github.com/naiyisiji/RPT}.

10:55-11:00, Paper ThAT11.6
Reward Training Wheels: Adaptive Auxiliary Rewards for Robotics Reinforcement Learning

Wang, Linji	George Mason University
Xu, Tong	George Mason University
Lu, Yuanjie	George Mason University
Xiao, Xuesu	George Mason University
Keywords: Autonomous Vehicle Navigation, Machine Learning for Robot Control, Reinforcement Learning Abstract: Robotics Reinforcement Learning (RL) often relies on carefully engineered auxiliary rewards to supplement sparse primary learning objectives to compensate for the lack of large-scale, real-world, trial-and-error data. While these auxiliary rewards accelerate learning, they require significant engineering effort, may introduce human biases, and cannot adapt to the robot's evolving capabilities during training. In this paper, we introduce Reward Training Wheels (RTW), a teacher-student framework that automates auxiliary reward adaptation for robotics RL. To be specific, the RTW teacher dynamically adjusts auxiliary reward weights based on the student's evolving capabilities to determine which auxiliary reward aspects require more or less emphasis to improve the primary objective. We demonstrate RTW on two challenging robot tasks: navigation in highly constrained spaces and off-road vehicle mobility on vertically challenging terrain. In simulation, RTW outperforms expert-designed rewards by 2.35% in navigation success rate and improves off-road mobility performance by 122.62%, while achieving 35% and 3X faster training efficiency, respectively. Physical robot experiments further validate RTW's effectiveness, achieving a perfect success rate (5/5 trials vs. 2/5 for expert-designed rewards) and improving vehicle stability with up to 47.4% reduction in orientation angles.

11:00-11:05, Paper ThAT11.7
Continuously Improved Reinforcement Learning for Automated Driving

Yan, Xuerun	Tongji University
Lian, Zhexi	Tongji University
Hu, Jia	Tongji University
Feng, Yongwei	Tongji University
Song, Binyang	NTU
Wang, Haoran	Tongji University
Keywords: Autonomous Vehicle Navigation, Reinforcement Learning, Intelligent Transportation Systems Abstract: Reinforcement Learning (RL) offers a promising solution to enable evolutionary automated driving. However, conventional RL methods often struggle with risk performance, as updated policies may fail to enhance performance or even lead to deterioration. To address this challenge, this research introduces a High Confidence Policy Improvement Reinforcement Learning-based (HCPI-RL) planner, designed to achieve the monotonic evolution of automated driving. The HCPI-RL planner features a novel RL policy update paradigm, ensuring that each newly learned policy outperforms previous policies, achieving monotonic performance enhancement. Hence, the proposed HCPI-RL planner has the following features: i) Evolutionary automated driving with guaranteed monotonic performance enhancement; ii) Capability of handling scenarios with emergency; iii) Enhanced decision-making optimality. Experimental results demonstrate that the proposed HCPI-RL planner enhances policy return by at least 20.1% and driving efficiency by at least 15.6%, compared to the conventional RL-based planners.

11:05-11:10, Paper ThAT11.8
Context-Aware Multi-Agent Trajectory Transformer

Park, Jeongho	Seoul National Universitiy
Oh, Songhwai	Seoul National University
Keywords: AI-Based Methods, Multi-Robot Systems, Reinforcement Learning Abstract: Transformer-based sequence models have proven effective in offline reinforcement learning for modeling agent trajectories using large-scale datasets. However, applying these models directly to multi-agent offline reinforcement learning introduces additional challenges, especially in managing complex inter-agent dynamics that arise as multiple agents interact with both their environment and each other. To overcome these issues, we propose the context-aware multi-agent trajectory transformer (COMAT), a novel model designed for offline multi-agent reinforcement learning tasks which predicts the future trajectory of each agent by incorporating the history of adjacent agents—referred to as context—into its sequence modeling. COMAT consists of three key modules: the transformer module to process input trajectories, the context encoder to extract relevant information from adjacent agents’ histories, and the context aggregator to integrate this information into the agent’s trajectory prediction process. Built upon these modules, COMAT predicts the agents’ future trajectories and actively leverages this capability as a tool for planning, enabling the search for optimal actions in multi-agent environments. We evaluate COMAT on multi-agent MuJoCo and StarCraft Multi-Agent Challenge tasks, on which it demonstrates superior performance compared to existing baselines.


ThAT12	311B
Vision-Based Navigation 1	Regular Session
Chair: Tzes, Anthony	New York University Abu Dhabi
Co-Chair: Wen, Congcong	New York University Abu Dhabi

10:30-10:35, Paper ThAT12.1
Image-Goal Navigation Using Refined Feature Guidance and Scene Graph Enhancement

Feng, Zhicheng	National University of Defense Technology
Chen, Xieyuanli	National University of Defense Technology
Shi, Chenghao	NUDT
Luo, Lun	Zhejiang University
Chen, Zhichao	Jiangxi University of Science and Technology
Liu, Yunhui	Chinese University of Hong Kong
Lu, Huimin	National University of Defense Technology
Keywords: Vision-Based Navigation, Deep Learning Methods Abstract: In this paper, we introduce a novel image-goal navigation approach, named RFSG. Our focus lies in leveraging the fine-grained connections between goals, observations, and the environment within limited image data, all the while keeping the navigation architecture simple and lightweight. To this end, we propose the spatial-channel attention mechanism, enabling the network to learn the importance of multi-dimensional features to fuse the goal and observation features. In addition, a self-distillation mechanism is incorporated to further enhance the feature representation capabilities. Given that the navigation task needs surrounding environmental information for more efficient navigation, we propose an image scene graph to establish feature associations at both the image and object levels, effectively encoding the surrounding scene information. Cross-scene performance validation was conducted on the Gibson and HM3D datasets, and the proposed method achieved state-of-the-art results among mainstream methods, with a speed of up to 53.5 frames per second on an RTX3080. This contributes to the realization of end-to-end image-goal navigation in real-world scenarios. The implementation and model of our method have been released at: https://github.com/nubot-nudt/RFSG.

10:35-10:40, Paper ThAT12.2
Competency-Aware Planning for Probabilistically Safe Navigation under Perception Uncertainty

Pohland, Sara	University of California, Berkeley
Tomlin, Claire	UC Berkeley
Keywords: Vision-Based Navigation, Deep Learning for Visual Perception, Probability and Statistical Methods Abstract: Perception-based navigation systems are useful for unmanned ground vehicle (UGV) navigation in complex terrains, where traditional depth-based navigation schemes are insufficient. However, these data-driven methods are highly dependent on their training data and can fail in surprising and dramatic ways with little warning. To ensure the safety of the vehicle and the surrounding environment, it is imperative that the navigation system is able to recognize the predictive uncertainty of the perception model and respond safely and effectively in the face of uncertainty. In an effort to enable safe navigation under perception uncertainty, we develop a probabilistic and reconstruction-based competency estimation (PaRCE) method to estimate the model's level of familiarity with an input image as a whole and with specific regions in the image. We find that the overall competency score can accurately predict correctly classified, misclassified, and out-of-distribution (OOD) samples. We also confirm that the regional competency maps can accurately distinguish between familiar and unfamiliar regions across images. We then use this competency information to develop a planning and control scheme that enables effective navigation while maintaining a low probability of error. We find that the competency-aware scheme greatly reduces the number of collisions with unfamiliar obstacles, compared to a baseline controller with no competency awareness. Furthermore, the regional competency information is particularly valuable in enabling efficient navigation.

10:40-10:45, Paper ThAT12.3
CVLN-Think: Causal Inference with Counterfactual Style Adaptation for Continuous Vision-And-Language Navigation

Liu, Ruonan	Shanghai Jiao Tong University
Wu, Shuai	Tianjin University
Lin, Di	Tianjin University
Zhang, Weidong	Shanghai JiaoTong University
Keywords: Vision-Based Navigation, AI-Enabled Robotics, AI-Based Methods Abstract: Vision-and-Language Navigation in Continuous Environments (VLN-CE) presents challenges due to environmental variations and domain shifts, making it difficult for agents to generalize beyond seen environments. Most existing methods rely on learning correlations between observations and actions from training data, which leads to spurious dependencies on environmental biases. To address this, we propose CVLN-Think (CVT), a novel navigation model that incorporates causal inference to enhance robustness and adaptability. Specifically, Style Causal Adjuster (SCA) generates counterfactual style observations, enabling agents to learn invariant spatial structures rather than overfitting to dataset-specific visual patterns. Furthermore, Thinking Cause Navigation Engine (TCNE) applies causal intervention to adjust navigation decisions by identifying and mitigating biases from prior experience. Unlike conventional approaches that passively learn from data distributions, our model actively thinks along the ``observation-action'' chain to make more reliable navigation predictions. Experimental results demonstrate that our approach achieves satisfactory performance on VLN-CE tasks. Further analysis indicates that our method possesses stronger generalization capabilities, highlighting the superiority of our proposed approach.

10:45-10:50, Paper ThAT12.4
Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation

Xie, Yifan	Xi'an Jiaotong University
Ou, Binkai	Boardware Information System Limited
Ma, Fei	Guangdong Laboratory of Artificial Intelligence and Digital Econ
Liu, Yaohua	Guangdong Institute of Intelligence Science and Technology
Keywords: Vision-Based Navigation, Deep Learning Methods, Visual Learning Abstract: Vision and Language Navigation (VLN) requires an agent to navigate through environments following natural language instructions. However, existing methods often struggle with effectively integrating visual observations and instruction details during navigation, leading to suboptimal path planning and limited success rates. In this paper, we propose OIKG (Observation-graph Interaction and Key-detail Guidance), a novel framework that addresses these limitations through two key components: (1) an observation-graph interaction module that decouples angular and visual information while strengthening edge representations in the navigation space, and (2) a key-detail guidance module that dynamically extracts and utilizes fine-grained location and object information from instructions. By enabling more precise cross-modal alignment and dynamic instruction interpretation, our approach significantly improves the agent's ability to follow complex navigation instructions. Extensive experiments on the R2R and RxR datasets demonstrate that OIKG achieves state-of-the-art performance across multiple evaluation metrics, validating the effectiveness of our method in enhancing navigation precision through better observation-instruction alignment.

10:50-10:55, Paper ThAT12.5
Socially-Aware Robot Navigation Enhanced by Bidirectional Natural Language Conversations Using Large Language Models

Wen, Congcong	New York University Abu Dhabi
Liu, Yifan	University of California, Los Angeles
Bethala, Geeta Chandra Raju	New York University Abu Dhabi
Yuan, Shuaihang	New York University
Huang, Hao	New York University
Hao, Yu	New York University
Wang, Mengyu	Harvard University
Liu, Yu-Shen	Tsinghua University
Tzes, Anthony	New York University Abu Dhabi
Fang, Yi	New York University
Keywords: Vision-Based Navigation, Deep Learning Methods Abstract: Robotic navigation plays a pivotal role in a wide range of real-world applications. While traditional navigation systems focus on efficiency and obstacle avoidance, their inability to model complex human behaviors in shared spaces has underscored the growing need for socially aware navigation. In this work, we explore a novel paradigm of socially aware robot navigation empowered by large language models (LLMs), and propose HSAC-LLM, a hybrid framework that seamlessly integrates deep reinforcement learning with the reasoning and communication capabilities of LLMs. Unlike prior approaches that passively predict pedestrian trajectories or issue pre-scripted alerts, HSAC-LLM enables bidirectional natural language interaction, allowing robots to proactively engage in dialogue with pedestrians to resolve potential conflicts and negotiate path decisions. Extensive evaluations across 2D simulations, Gazebo environments, and real-world deployments demonstrate that HSAC-LLM consistently outperforms state-of-the-art DRL baselines under our proposed socially aware navigation metric, which covers safety, efficiency, and human comfort. By bridging linguistic reasoning and interactive motion planning, our results highlight the potential of LLM-augmented agents for robust, adaptive, and human-aligned navigation in real-world settings. Project page: https://hsacllm.github.io/.

10:55-11:00, Paper ThAT12.6
Vision-Language Navigation with Continual Learning for Unseen Environments

Li, Zhiyuan	Institute of Automation, Chinese Academy of Sciences (CASIA)
Lu, Yanfeng	Institute of Automation, Chinese Academy of Sciences
Shang, Di	Institute of Automation，Chinese Academy of Sciences
Tu, Ziqin	University of Chinese Academy of Sciences
Qiao, Hong	Institute of Automation, Chinese Academy of Sciences
Keywords: Vision-Based Navigation, Incremental Learning Abstract: Vision-language navigation (VLN) is a pivotal area within embodied intelligence, where agents must navigate based on natural language instructions. While traditional VLN research has focused on enhancing environmental comprehension and decision-making policy, these methods often reveal substantial performance gaps when agents are deployed in novel environments. This issue primarily arises from the lack of diverse training data. Expanding datasets to encompass a broader range of environments is impractical and costly. To address this challenge, we propose Vision-Language Navigation with Continuous Learning (VLNCL), a framework that allows agents to learn from new environments while preserving previous knowledge incrementally. We introduce a novel dual-loop scenario replay method (Dual-SR) inspired by brain memory mechanisms integrated with VLN agents. This approach helps consolidate past experiences and improves generalization across novel tasks. As a result, the agent exhibits enhanced adaptability to new environments and mitigates catastrophic forgetting. Our experiment demonstrates that VLN agents with Dual-SR effectively resist forgetting and adapt to unfamiliar environments. Combining VLN with continual learning significantly boosts the performance of otherwise average models, achieving SOTA results.

11:00-11:05, Paper ThAT12.7
Weakly-Supervised VLM-Guided Partial Contrastive Learning for Visual Language Navigation

Wang, Ruoyu	University of New South Wales
Yu, Tong	Adobe Research
Wu, Junda	University of California San Diego
Liu, Yao	Macquarie University
McAuley, Julian	Australian National University/NICTA
Yao, Lina	Csiro & Unsw
Keywords: Vision-Based Navigation, Deep Learning Methods Abstract: Visual Language Navigation (VLN) is a fundamental task within the field of Embodied AI, focusing on the ability of agents to navigate complex environments based on natural language instructions. Despite the progress made by existing methods, these methods often present some common challenges. First, they rely on pre-trained backbone models for visual perception, which struggle with the dynamic viewpoints in VLN scenarios. Second, the performance is limited when using pre-trained LLMs or VLMs without fine-tuning, due to the absence of VLN domain knowledge. Third, while fine-tuning LLMs and VLMs can improve results, their computational costs are higher than those without fine-tuning. To address these limitations, we propose Weakly-supervised Partial Contrastive Learning (WPCL), a method that enhances an agent's ability to identify objects from dynamic viewpoints in VLN scenarios by effectively integrating pre-trained VLM knowledge into the perception process, without requiring VLM fine-tuning. Our method enhances the agent's ability to interpret and respond to environmental cues while ensuring computational efficiency. Experimental results have shown that our method outperforms the baseline methods on multiple benchmarks, which validates the effectiveness, robustness, and generalizability of our method.

11:05-11:10, Paper ThAT12.8
A Survey of Object Goal Navigation (I)

Sun, Jingwen	Cardiff University
Wu, Jing	Cardiff University
Ji, Ze	Cardiff University
Lai, Yu-Kun	Cardiff University
Keywords: Vision-Based Navigation, Computer Vision for Automation Abstract: Object Goal Navigation (ObjectNav) refers to an agent navigating to an object in an unseen environment, which is an ability often required in the accomplishment of complex tasks. Though it has drawn increasing attention from researchers in the Embodied AI community, there has not been a contemporary and comprehensive survey of ObjectNav. In this survey, we give an overview of this field by summarizing more than 70 recent papers. First, we give the preliminaries of the ObjectNav: the definition, the simulator, and the metrics. Then, we group the existing works into three categories: 1) end-to-end methods that directly map the observations to actions, 2) modular methods that consist of a mapping module, a policy module, and a path planning module, and 3) zero-shot methods that use zero-shot learning to do navigation. Finally, we summarize the performance of existing works and the main failure modes and discuss the challenges of ObjectNav. This survey would provide comprehensive information for researchers in this field to have a better understanding of ObjectNav.


ThAT13	311C
Deep Learning for Visual Perception 9	Regular Session
Chair: Hashimoto, Kenji	Waseda University

10:30-10:35, Paper ThAT13.1
The Sampling-Gaussian for Stereo Matching

Pan, Baiyu	University of Macau
Yao, Bowen	Ubtech Corporation
Jiao, Jichao	Beijing University of Posts and Telecommunications
Pang, Jianxin	UBtech Robotics Corp
Cheng, Jun	Shenzhen Institutes of Advanced Technology
Keywords: Deep Learning for Visual Perception, RGB-D Perception, AI-Based Methods Abstract: The soft-argmax operation is widely adopted in neural network-based stereo matching methods to enable differentiable regression of disparity. However, networks trained with soft-argmax tend to predict multimodal probability distributions due to the absence of explicit constraints on the shape of the distribution. Previous methods leveraged Laplacian distributions and cross-entropy for training but failed to effectively improve accuracy and even increased the network's processing time. In this paper, we propose a novel method called Sampling-Gaussian as a substitute for soft-argmax. It improves accuracy without increasing inference time. We innovatively interpret the training process as minimizing the distance in vector space and propose a combined loss of L1 loss and cosine similarity loss. We leveraged the normalized discrete Gaussian distribution for supervision. Moreover, we identified two issues in previous methods and proposed extending the disparity range and employing bilinear interpolation as solutions. We have conducted comprehensive experiments to demonstrate the superior performance of our textit{Sampling-Gaussian} method. The experimental results prove that we have achieved better accuracy on five baseline methods across four datasets. Moreover, we have achieved significant improvements on small datasets and models with weaker generalization capabilities. Our method is easy to implement, and the code is available online.

10:35-10:40, Paper ThAT13.2
Complete Corruption-Aware Retinex Framework for Low-Light Image Enhancement

Zhang, Yifei	Waseda University
Sun, Honglin	Waseda University
Gao, Yuyang	Waseda University
Xie, Jianan	Waseda University
Hashimoto, Kenji	Waseda University
Keywords: Deep Learning for Visual Perception Abstract: Retinex theory, which treats an image as a composition of illuminance and reflectance, has made significant progress in low-light image enhancement. Previous methods attempt to refine the impractical Retinex theory by introducing deviations in estimated illumination and reflectance to develop more practical and robust enhancement techniques. However, the fact that state-of-the-art approaches still produce inferior results suggests that some form of corruption may be overlooked. In this paper, we propose a novel Complete Corruption-Aware Retinex Framework (CCRF), which not only considers corruption in low-light imaging—such as high ISO or long exposure settings—but, more importantly, also accounts for corruption induced by the enhancement method itself. Guided by this framework, we propose a Robust Corruption-Aware Loss (RCL) that enables the model to be robust under extreme darkness and complex light-object interactions. Additionally, we propose a Light-Up Map Denoising (LMD) module, which further eliminates model-induced perturbations. With these two plug-and-play modules, downstream tasks (e.g., low-light object detection) can benefit significantly. Extensive experiments demonstrate that our methods can be seamlessly integrated into state-of-the-art approaches, resulting in significant performance improvements over these methods. Code will be available at github.com/eafi/ccrf

10:40-10:45, Paper ThAT13.3
REOcc: Camera-Radar Fusion with Radar Feature Enrichment for 3D Occupancy Prediction

Song, Chaehee	Korea Advanced Institute of Science and Technology
Kim, Sanmin	Kookmin University
Jeong, Hyeonjun	KAIST
Shin, Juyeb	KAIST
Lim, Joonhee	KAIST
Kum, Dongsuk	KAIST
Keywords: Deep Learning for Visual Perception, Semantic Scene Understanding, Sensor Fusion Abstract: Vision-based 3D occupancy prediction has made significant advancements, but its reliance on cameras alone struggles in challenging environments. This limitation has driven the adoption of sensor fusion, among which camera-radar fusion stands out as a promising solution due to their complementary strengths. However, the sparsity and noise of the radar data limits its effectiveness, leading to suboptimal fusion performance. In this paper, we propose REOcc, a novel camera-radar fusion network designed to enrich radar feature representations for 3D occupancy prediction. Our approach introduces two main components, a Radar Densifier and a Radar Amplifier, which refine radar features by integrating spatial and contextual information, effectively enhancing spatial density and quality. Extensive experiments on the Occ3D-nuScenes benchmark demonstrate that REOcc achieves significant performance gains over the camera-only baseline model, particularly in dynamic object classes. These results underscore REOcc's capability to mitigate the sparsity and noise of the radar data. Consequently, radar complements camera data more effectively, unlocking the full potential of camera-radar fusion for robust and reliable 3D occupancy prediction.

10:45-10:50, Paper ThAT13.4
RAG-6DPose: Retrieval-Augmented 6D Pose Estimation Via Leveraging CAD As Knowledge Base

Wang, Kuanning	Fudan University
Fu, Yuqian	INSAIT
Wang, Tianyu	Fudan University
Fu, Yanwei	Fudan University
Liang, Longfei	Shanghai Neuhelium Neuromorphic Intelligence Tech. Co.Ltd
Jiang, Yu-Gang	Fudan University
Xue, Xiangyang	Fudan University
Keywords: Deep Learning for Visual Perception, Deep Learning Methods, Visual Learning Abstract: Accurate 6D pose estimation is key for robotic manipulation, enabling precise object localization for tasks like grasping. We present RAG-6DPose, a retrieval-augmented approach that leverages 3D CAD models as a knowledge base by integrating both visual and geometric cues. Our RAG-6DPose roughly contains three stages: 1) Building a Multi-Modal CAD Knowledge Base by extracting 2D visual features from multi-view CAD rendered images and also attaching 3D points; 2) Retrieving relevant CAD features from the knowledge base based on the current query image via our ReSPC module; and 3) Incorporating retrieved CAD information to refine pose predictions via retrieval-augmented decoding. Experimental results on standard benchmarks and real-world robotic tasks demonstrate the effectiveness and robustness of our approach, particularly in handling occlusions and novel viewpoints. Supplementary material is available on our project website: https://sressers.github.io/RAG-6DPose. We will release codes and models upon acceptance.

10:50-10:55, Paper ThAT13.5
TerraX: Visual Terrain Classification Enhanced by Vision-Language Models

Li, Hongze	Peking University
Huang, Xuchuan	Peking University
Chang, Xinhai	Peking University
Zhou, Jun	Peking University
Zhao, Huijing	Peking University
Keywords: Deep Learning for Visual Perception, Semantic Scene Understanding, Data Sets for Robotic Vision Abstract: Visual Terrain Classification (VTC) plays a vital role in enabling unmanned ground vehicles to understand complex environments. Existing research relies on image-label pairs annotated by static label sets, where semantic ambiguity and high annotation costs constrain fine-grained terrain characterization. These limitations hinder the model’s adaptation to real-world terrain diversity and restrict its applicability. To address these issues, we propose TerraX, a vision-language learning framework that integrates multi-modal image-label-text data, unifying structured annotations with fine-grained natural language descriptions. The framework introduces a composite dataset TerraData, an evaluation benchmark suite TerraBench, and a CLIP-based visual terrain classification model TerraCLIP. TerraData aggregates multi-source terrain images from public and self-collected datasets, annotated through a VLM-based vision-language data annotation pipeline. TerraBench defines three evaluation benchmarks to systematically assess model robustness and adaptability in real-world terrain classification scenarios. Built on the CLIP model, TerraCLIP utilizes multi-granularity contrastive loss and LoRA fine-tuning to enhance understanding for terrain categories and attributes, and incorporates confidence-weighted inference for accurate predictions. Extensive experiments across benchmarks and real-world platforms demonstrate that our approach significantly enhances VTC performance, highlighting its potential for deployment in complex environments.

10:55-11:00, Paper ThAT13.6
ROA-BEV: 2D Region-Oriented Attention for BEV-Based 3D Object Detection

Chen, Jiwei	The Chinese University of Hong Kong, Shenzhen
Sun, Yubao	Nanjing University of Information Science and Technology
Ding, Laiyan	The Chinese University of Hong Kong, Shenzhen
Huang, Rui	The Chinese University of Hong Kong, Shenzhen
Keywords: Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization Abstract: Vision-based BEV (Bird's-Eye-View) 3D object detection has recently become popular in autonomous driving. However, objects with a high similarity to the background from a camera perspective cannot be detected well by existing methods. In this paper, we propose a BEV-based 3D Object Detection Network with 2D Region-Oriented Attention (ROA-BEV), which enables the backbone to focus more on feature learning of the regions where objects exist. Moreover, our method further enhances the information feature learning ability of ROA through multi-scale structures. Each block of ROA utilizes a large kernel to ensure that the receptive field is large enough to catch information about large objects. Experiments on nuScenes show that ROA-BEV improves the performance based on BEVDepth. The source codes of this work will be available at https://github.com/DFLyan/ROA-BEV.

11:00-11:05, Paper ThAT13.7
Uncertainty-Aware Knowledge Distillation for Compact and Efficient 6DoF Pose Estimation

Ali Ousalah, Nassim	SnT, University of Luxembourg
Kacem, Anis	University of Luxembourg
Ghorbel, Enjie	CRISTAL, ENSI, University of Manouba
Koumandakis, Emmanuel	Infinite Orbits SAS
Aouada, Djamila	SnT, University of Luxembourg
Keywords: Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization, Computer Vision for Automation Abstract: Compact and efficient 6DoF object pose estimation is crucial in applications such as robotics, augmented reality, and space autonomous navigation systems, where lightweight models are critical for real-time accurate performance. This paper introduces a novel uncertainty-aware end-to-end Knowledge Distillation (KD) framework focused on keypoint-based 6DoF pose estimation. Keypoints predicted by a large teacher model exhibit varying levels of uncertainty that can be exploited within the distillation process to enhance the accuracy of the student model while ensuring its compactness. To this end, we propose a distillation strategy that aligns the student and teacher predictions by adjusting the knowledge transfer based on the uncertainty associated with each teacher keypoint prediction. Additionally, the proposed KD leverages this uncertainty-aware alignment of keypoints to transfer the knowledge at key locations of their respective feature maps. Experiments on the widely-used LINEMOD benchmark demonstrate the effectiveness of our method, achieving superior 6DoF object pose estimation with lightweight models compared to state-of-the-art approaches. Further validation on the SPEED+ dataset for spacecraft pose estimation highlights the robustness of our approach under diverse 6DoF pose estimation scenarios.

11:05-11:10, Paper ThAT13.8
SA-MVSNet: Spatial-Aware Multi-View Stereo Network with Attention Cost Volume

Kong, Haoran	Hunan University
Zeng, Fanzi	Hunan University
Dai, Longbao	Hunan University
Hu, Jingyang	University of Science and Technology of China
Cai, Jianghao	Hunan University
Chen, Jianxia	Hunan University
Li, Ruihui	Hunan University
Jiang, Hongbo	Hunan University
Keywords: Deep Learning for Visual Perception, Deep Learning Methods, Semantic Scene Understanding Abstract: Deep learning-based multi-view stereo (MVS) methods enable dense point cloud reconstruction in texture-rich areas. However, existing methods incur significant computational costs to capture pixel dependencies for complete reconstruction in low-texture regions. Additionally, discrete depth layers in occluded environments hinder the cost volume's ability to model object information effectively. To address these issues, we propose a spatial-aware multi-view stereo network with attention cost volume, termed SA-MVSNet. The network introduces the pixel-driven spatial interaction (PDSI) module, which integrates the hierarchical spatial location enhancement mechanism (HSLE) and the spatial context aggregation mechanism (SCA). Leveraging an efficient parallel architecture, the PDSI module captures pixel-level spatial dependencies with the HSLE and strengthens global contextual information through the SCA. This design improves the network's ability to represent features in low-texture regions while maintaining high inference efficiency. Furthermore, SA-MVSNet incorporates an attention weight generation branch that refines the cost volume by aggregating multi-scale depth cues, effectively mitigating the impact of occlusion. Experiments on the DTU dataset and the Tanks and Temples dataset show that our method outperforms other learning-based methods, achieving superior performance and strong generalization ability.


ThAT14	311D
Deep Learning Methods 6	Regular Session
Co-Chair: Ludivig, Philippe	University of Luxembourg

10:30-10:35, Paper ThAT14.1
Mr. Virgil: Learning Multi-Robot Visual-Range Relative Localization

Wang, Si	Zhejiang University
Li, Zhehan	Zhejiang University
Lu, Jiadong	Zhejiang Universisy, Huzhou Institute of Zhejiang University
Xiong, Rong	Zhejiang University
Cao, Yanjun	Zhejiang University, Huzhou Institute of Zhejiang University
Wang, Yue	Zhejiang University
Keywords: Localization, Deep Learning Methods, Multi-Robot Systems Abstract: Ultra-wideband (UWB)-vision fusion localization has achieved extensive applications in the domain of multi-agent relative localization. The challenging matching problem between robots and visual detection renders existing methods highly dependent on identity-encoded hardware or delicate tuning algorithms. Overconfident yet erroneous matches may bring about irreversible damage to the localization system. To address this issue, we introduce Mr. Virgil, an end-to-end learning multi-robot visual-range relative localization framework, consisting of a graph neural network for data association between UWB rangings and visual detections, and a differentiable pose graph optimization (PGO) back-end. The graph-based front-end supplies robust matching results, accurate initial position predictions, and credible uncertainty estimates, which are subsequently integrated into the PGO back-end to elevate the accuracy of the final pose estimation. Additionally, a decentralized system is implemented for real-world applications. Experiments spanning varying robot numbers, simulation and real-world, occlusion and non-occlusion conditions showcase the stability and exactitude under various scenes compared to conventional methods. Our code is available at: https://github.com/HiOnes/Mr-Virgil.

10:35-10:40, Paper ThAT14.2
Interpretable Interaction Modeling for Trajectory Prediction Via Agent Selection and Physical Coefficient

Huang, Shiji	Zhejiang University of Technology
Ye, Lei	Zhejiang University of Technology
Chen, Min	Zhejiang University of Technology
Luo, Wenhai	Zhejiang University of Technology
Wang, Dihong	Zhejiang University of Technology
Xu, Chenqi	Zhejiang University of Technology
Liang, Deyuan	Zhejiang University of Technology
Keywords: Deep Learning Methods, Motion and Path Planning, Intelligent Transportation Systems Abstract: A thorough understanding of the interaction between the target agent and surrounding agents is a prerequisite for accurate trajectory prediction. Although many methods have been explored, they assign correlation coefficients to surrounding agents in a purely learning-based manner. In this study, we present ASPILin, which manually selects interacting agents and replaces the attention scores in Transformer with a newly computed physical correlation coefficient, enhancing the interpretability of interaction modeling. Surprisingly, these simple modifications can significantly improve prediction performance and substantially reduce computational costs. We intentionally simplified our model in other aspects, such as map encoding. Remarkably, experiments conducted on the INTERACTION, highD, and CitySim datasets demonstrate that our method is efficient and straightforward, outperforming other state-of-the-art methods.

10:40-10:45, Paper ThAT14.3
NaviFormer: A Deep Reinforcement Learning Transformer-Like Model to Holistically Solve the Navigation Problem

Fuertes, Daniel	Universidad Politécnica De Madrid
Cavallaro, Andrea	Idiap, EPFL
del-Blanco, Carlos R.	Universidad Politécnica De Madrid
Jaureguizar, Fernando	Universidad Politécnica De Madrid
García, Narciso	Universidad Politécnica De Madrid
Keywords: Deep Learning Methods, Motion and Path Planning, Reinforcement Learning Abstract: Path planning is usually solved by addressing either the (high-level) route planning problem (waypoint sequencing to achieve the final goal) or the (low-level) path planning problem (trajectory prediction between two waypoints avoiding collisions). However, real-world problems usually require simultaneous solutions to the route and path planning subproblems with a holistic and efficient approach. In this paper, we introduce NaviFormer, a deep reinforcement learning model based on a Transformer architecture that solves the global navigation problem by predicting both high-level routes and low-level trajectories. To evaluate NaviFormer, several experiments have been conducted, including comparisons with other algorithms. Results show competitive accuracy from NaviFormer since it can understand the constraints and difficulties of each subproblem and act consequently to improve performance. Moreover, its superior computation speed proves its suitability for real-time missions.

10:45-10:50, Paper ThAT14.4
Focusing on Projection-Stable Patch: Cross-View Localization with Geometric-Semantic Alignment

Qin, Riyu	Nanjing University of Science and Technology
Liu, Zhengyu	Nanjing University of Science and Technology
Wang, Kaiyang	Nanjing University of Science and Technology
Yuan, Xia	Nanjing University of Science and Technology
Keywords: Localization, Deep Learning Methods, Visual Learning Abstract: This paper presents a novel feature alignment strategy for cross-view geo-localization to address the large viewpoint differences between ground and satellite views. Existing methods for cross-view geo-localization often overlook factors such as occlusion and distortion errors caused by viewpoint transformation. These issues lead to reduced accuracy in complex scenes. To address this issue, we propose a framework comprising two novel components: a perspective-driven attention fusion (PDAF) module that aligns ground and satellite features through cross-view semantic correlation, effectively preserving structural consistency during view transformation; and a projection-stable patch-guided pose optimizer (PSPG) that enhances geometric reliability by selectively focusing on projection-stable patch to refine pose estimation. The PDAF module mitigates information loss through attention fusion between ground and bird's-eye-view (BEV) feature maps representations, while the PSPG refines pose estimation by dynamically suppressing unstable features through geometrically unstable token merging. Comprehensive evaluations on KITTI and Ford Multi-AV datasets demonstrate our method's superiority in orientation estimation and competitive location accuracy compared to state-of-the-art approaches. Qualitative results further confirm the framework's robustness in complex localization scenarios.

10:50-10:55, Paper ThAT14.5
Legged Robot State Estimation Using Invariant Neural-Augmented Kalman Filter with a Neural Compensator

Lee, Seokju	KAIST (Korea Advanced Institute of Science and Technology)
Kim, Hyun-Bin	KAIST
Kim, Kyung-Soo	KAIST(Korea Advanced Institute of Science and Technology)
Keywords: Legged Robots, Localization, Deep Learning Methods Abstract: This paper presents an algorithm to improve state estimation for legged robots. Among existing model-based state estimation methods for legged robots, the contact-aided invariant extended Kalman filter defines the state on a Lie group to preserve invariance, thereby significantly accelerating convergence. It achieves more accurate state estimation by leveraging contact information as measurements for the update step. However, when the model exhibits strong nonlinearity, the estimation accuracy decreases. Such nonlinearities can cause initial errors to accumulate and lead to large drifts over time. To address this issue, we propose compensating for errors by augmenting the Kalman filter with an artificial neural network serving as a nonlinear function approximator. Furthermore, we design this neural network to respect the Lie group structure to ensure invariance, resulting in our proposed Invariant Neural-Augmented Kalman Filter (InNKF). The proposed algorithm offers improved state estimation performance by combining the strengths of model-based and learning-based approaches. Project webpage: https://seokju-lee.github.io/innkf_webpage

10:55-11:00, Paper ThAT14.6
Absolute Localization through Vision Transformer Matching of Planetary Surface Perspective Imagery from a Digital Twin

Ludivig, Philippe	University of Luxembourg
Wu, Benjamin	National Astronomical Observatory of Japan
Zurad, Maciej Marcin	University of Luxemboug
Keywords: Localization, Deep Learning Methods, Vision-Based Navigation Abstract: We present a novel machine learning framework and synthetic dataset for performing absolute localization on planetary surfaces where satellite navigation systems are unavailable. Current approaches involve manual surface-to-satellite image matching by human rover operators, limiting the rate of planetary exploration and scientific utilization. Our framework leverages deep neural networks to perform image similarity matching between a rover’s onboard cameras and corresponding ground-view images from a digital twin environment created from extracted satellite and elevation maps. The rover views, satellite, and elevation maps are taken from a photorealistic lunar environment simulated in a 3D graphics engine (Unreal Engine 4). The synthetic ground-view re-projections are generated using an open-source 3D graphics software (Blender). In total, we generate a dataset of 1.68 million images at 210,000 locations. The images and corresponding metadata are then used to train a DINOv2 vision transformer image similarity model through supervised fine-tuning to determine matching locations between the rover views and candidate re-projections. Through this method, our model is able to determine the ground truth location within 5 m using just 2.5% of the search space, outperforming other deep learning and classical image comparison benchmarks.

11:00-11:05, Paper ThAT14.7
AGCNet: Improving Inertial Odometry Via IMU Accelerometer and Gyroscope Online Compensation

Min, Hongyuan	Xi'an Jiao Tong University
Ding, Ning	Xi’an Jiaotong University
Wan, Mingyang	Douyin
Ma, Guojun	Douyin
Jiang, Caigui	Xi'an Jiaotong University
Keywords: Deep Learning Methods, Localization, Sensor Fusion Abstract: This paper presents a learning-based online IMU compensation method (AGCNet) that can compensate for run time errors of the accelerometer and gyroscope to improve inertial odometry. AGCNet employs U-Net architecture with hybrid dilated convolutions to extract multiscale features. It also adopts skip connections and patch-based processing strategy to aggregate local and global information. The network is trained to minimize absolute errors between integration results derived from compensated IMU data and ground truth motion states. The network utilizes IMU measurements from the current time window to correct errors in the subsequent time window, enabling sparser computations. Experiments on two public visual inertial datasets show that AGCNet can accurately estimate the orientation from IMU measurements, outperforming existing learning-based methods. When applied to Open-VINS, AGCNet improves the accuracy of orientation estimation by an average of 29.8% and position estimation by an average of 37.3%.

11:05-11:10, Paper ThAT14.8
UGNA-VPR: A Novel Training Paradigm for Visual Place Recognition Based on Uncertainty-Guided NeRF Augmentation

Shen, Yehui	NorthEast University
Zhang, Lei	Shenyang SIASUN Robot & Automation Co., Ltd
Li, Qingqiu	Fudan University
Zhao, Xiongwei	Harbin Institute of Technology
Wang, Yue	Zhejiang University
Lu, Huimin	National University of Defense Technology
Chen, Xieyuanli	National University of Defense Technology
Keywords: Localization, Deep Learning for Visual Perception Abstract: Visual place recognition (VPR) is crucial for robots to identify previously visited locations, playing an important role in autonomous navigation in both indoor and outdoor environments. However, most existing VPR datasets are limited to single-viewpoint scenarios, leading to reduced recognition accuracy, particularly in multi-directional driving or feature-sparse scenes. Moreover, obtaining additional data to mitigate these limitations is often expensive. This paper introduces a novel training paradigm to improve the performance of existing VPR networks by enhancing multi-view diversity within current datasets through uncertainty estimation and NeRF-based data augmentation. Specifically, we initially train NeRF using the existing VPR dataset. Then, our devised self-supervised uncertainty estimation network identifies places with high uncertainty. The poses of these uncertain places are input into NeRF to generate new synthetic observations for further training of VPR networks. Additionally, we propose an improved storage method for efficient organization of augmented and original training data. We conducted extensive experiments on three datasets and tested three different VPR backbone networks. The results demonstrate that our proposed training paradigm significantly improves VPR performance by fully utilizing existing data, outperforming other training approaches. We further validated the effectiveness of our approach on self-recorded indoor and outdoor datasets, consistently demonstrating superior results. Our dataset and code have been released at https://anonymous.4open.science/r/UGNA-VPR-DDF0.


ThAT15	206
Telerobotics and Teleoperation 1	Regular Session
Chair: Peternel, Luka	Delft University of Technology

10:30-10:35, Paper ThAT15.1
HACTS: A Human-As-Copilot Teleoperation System for Robot Learning

Xu, Zhiyuan	Midea Group
Zhao, Yinuo	Beijing Institute of Technology
Wu, Kun	Syracuse University
Liu, Ning	Beijing Innovation Center of Humanoid Robotics
Che, Zhengping	X-Humanoid
Junjie, Ji	Beijing Innovation Center of Humanoid Robotics
Liu, Chi Harold	Beijing Institute of Technology
Tang, Jian	Midea Group (Shanghai) Co., Ltd
Keywords: Telerobotics and Teleoperation, AI-Enabled Robotics, Learning from Demonstration Abstract: Teleoperation is crucial for autonomous robot learning, especially in manipulation tasks involving data collection for Imitation Learning (IL) and Reinforcement Learning (RL), such as Vision-Language-Action (VLA) and Human-In-The-Loop RL (HITL RL). Existing teleoperation systems typically provide unilateral control to the robot, lacking the ability to synchronize status from the robot. In this work, we introduce HACTS (Human-As-Copilot Teleoperation System), a novel system that enables bilateral, real-time joint synchronization between the robot arm and the teleoperation hardware. This simple yet effective feedback mechanism, akin to a steering wheel in an autonomous vehicle, allows the human copilot to intervene when necessary while simultaneously collecting action-correction data for future learning The HACTS hardware is implemented using 3D-printed components and low-cost, off-the-shelf motors, demonstrating both accessibility and scalability. Our experiments show that HACTS significantly enhances the performance of representative IL and RL methods for robot manipulation. Specifically, action-correction data improves the recovery capabilities of VLA models and facilitates human-in-the-loop reinforcement learning. HACTS paves the way for more effective and interactive human-robot collaboration and data-collection, advancing the capabilities of robot learning.

10:35-10:40, Paper ThAT15.2
Six-DoF Hand-Based Teleoperation for Omnidirectional Aerial Robots

Li, Jinjie	The University of Tokyo
Li, Jiaxuan	Dalian University of Technology
Kaneko, Kotaro	University of Tokyo
Liu, Haokun	The University of Tokyo
Shu, Liming	Dalian University of Technology
Zhao, Moju	The University of Tokyo
Keywords: Telerobotics and Teleoperation, Human Performance Augmentation, Aerial Systems: Applications Abstract: Omnidirectional aerial robots offer full 6-DoF independent control over position and orientation, making them popular for aerial manipulation. Although advancements in robotic autonomy, human operation remains essential in complex aerial environments. Existing teleoperation approaches for multirotors fail to fully leverage the additional DoFs provided by omnidirectional rotation. Additionally, the dexterity of human fingers should be exploited for more engaged interaction. In this work, we propose an aerial teleoperation system that brings the rotational flexibility of human hands into the unbounded aerial workspace. Our system includes two motion-tracking marker sets--one on the shoulder and one on the hand--along with a data glove to capture hand gestures. Using these inputs, we design four interaction modes for different tasks, including Spherical Mode and Cartesian Mode for long-range moving, Operation Mode for precise manipulation, as well as Locking Mode for temporary pauses, where the hand gestures are utilized for seamless mode switching. We evaluate our system on a vertically mounted valve-turning task in the real world, demonstrating how each mode contributes to effective aerial manipulation. This interaction framework bridges human dexterity with aerial robotics, paving the way for enhanced aerial teleoperation in unstructured environments.

10:40-10:45, Paper ThAT15.3
Adaptive Motion Scaling in Teleoperated Robotic Surgery Based on Human Intention and Attention

Zhai, Yiming	Shanghai Jiao Tong University
Liu, Jingsong	Shanghai Jiao Tong University
Luo, Yating	Shanghai Jiao Tong University
Wang, Ziwei	Lancaster University
Guo, Yao	Shanghai Jiao Tong University
Keywords: Telerobotics and Teleoperation, Human Factors and Human-in-the-Loop, Physical Human-Robot Interaction Abstract: In teleoperated surgery, the motion scaling factor directly influences both the operator's control precision of surgical instruments and operational comfort. Previous studies have revealed that the master manipulator state and operator's gaze information can reflect the complexity of surgical operations and the operator's intention to some extent. Although enabling real-time adjustment of scaling factors, they were limited by the narrow range of core parameters and the results were significantly influenced by subjective factors. To tackle these challenges, this paper presents a multi-dimensional adaptive motion scaling strategy based on the Bayesian optimization. The prediction of operator's intention and attention is achieved by integrating multiple dimensional parameters, including master-slave manipulator states, gaze information, as well as pupillary data, all of which have been experimentally validated. Specifically, there exists a significant temporal synchronization between the Index of Pupillary Activity (IPA) and teleoperation tasks, which aligns with research on the correlation between IPA and attention levels. Furthermore, to evaluate the proposed adaptive scaling strategy, we combine subjective questionnaire surveys with objective metric assessments, effectively reducing the excessive influence of operators' personal conditions and proficiency levels on optimization results.

10:45-10:50, Paper ThAT15.4
Teleoperated Teaching of Task and Impedance (TTTI): Multi-Modal Interface Extending Haptic Device for Robotic Skill Transfer

Rots, Astrid	Delft University of Technology
Peternel, Luka	Delft University of Technology
Keywords: Telerobotics and Teleoperation, Learning from Demonstration, Haptics and Haptic Interfaces Abstract: In this paper, we propose a concept of Teleoperated Teaching of Task and Impedance (TTTI) with a novel multi-modal interface that enables online teleoperated teaching of combined low-level impedance-regulation skills and high-level task decision-making skills using a single hand-held haptic device. To this end, we interactively switch the functionality of the haptic device for two modes of operation. To teach impedance-regulation low-level skills, we developed a novel stiffness command interface where the human operator uses the haptic device to manipulate the stiffness ellipsoid of the remote robotic arm endpoint in 3D space. For teaching high-level skills of how and when to employ low-level actions, we developed a GUI that enables a haptic device to remotely modify Behaviour Trees used to encode the robot's task decision-making process. The interface connects both teaching modes, where a newly demonstrated low-level skill appears in the Behaviour Tree at an operator-specified index. To demonstrate the main features of the proposed interface, we performed several proof-of-concept experiments on a teleoperation setup operating a remote shelf-stocker robot in a simulated supermarket environment. We examined the task of placing a product on a shelf that consists of several sub-tasks, where each involves different stiffness strategies, while the Behaviour Tree has to encode the task sequencing and decision-making process.

10:50-10:55, Paper ThAT15.5
Visual-Haptic Model Mediated Teleoperation for Remote Ultrasound

Black, David Gregory	University of British Columbia
Tirindelli, Maria	ImFusion GmbH
Salcudean, Septimiu E.	University of British Columbia
Wein, Wolfgang	ImFusion GmbH
Esposito, Marco	ImFusion GmbH
Keywords: Telerobotics and Teleoperation, Haptics and Haptic Interfaces, Medical Robots and Systems Abstract: Tele-ultrasound has the potential greatly to improve health equity for countless remote communities. However, practical scenarios involve potentially large time delays which cause current implementations of telerobotic ultrasound (US) to fail. Using a local model of the remote environment to provide haptics to the expert operator can decrease teleoperation instability, but the delayed visual feedback remains problematic. This paper introduces a robotic tele-US system in which the local model is not only haptic, but also visual, by re-slicing and rendering a pre-acquired US sweep in real time to provide the operator with a preview of what the delayed image will resemble. A prototype system is presented and tested with 15 volunteer operators. It is found that visual-haptic model-mediated teleoperation (MMT) compensates completely for time delays up to 1000 ms round trip in terms of operator effort and completion time while conventional MMT does not. Visual-haptic MMT also significantly outperforms MMT for longer time delays in terms of motion accuracy and force control. This proof-of-concept study suggests that visual-haptic MMT may facilitate remote robotic tele-US.

10:55-11:00, Paper ThAT15.6
Online Imitation Learning for Manipulation Via Decaying Relative Correction through Teleoperation

Pan, Cheng	Swiss Federal Institute of Technology Lausanne (EPFL)
Cheng, Hung Hon	EPFL
Hughes, Josie	EPFL
Keywords: Telerobotics and Teleoperation, Imitation Learning, Human-Robot Collaboration Abstract: Teleoperated robotic manipulators enable the collection of demonstration data, which can be used to train control policies through imitation learning. However, such methods can require significant amounts of training data to develop robust policies or adapt them to new and unseen tasks. While expert feedback can significantly enhance policy performance, providing continuous feedback can be cognitively demanding and time-consuming for experts. To address this challenge, we propose to use a cable-driven teleoperation system which can provide spatial corrections with 6 degree of freedom to the trajectories generated by a policy model. Specifically, we propose a correction method termed Decaying Relative Correction (DRC) which is based upon the spatial offset vector provided by the expert and exists temporarily, and which reduces the intervention steps required by an expert. Our results demonstrate that DRC reduces the required expert intervention rate by 30% compared to a standard absolute corrective method. Furthermore, we show that integrating DRC within an online imitation learning framework rapidly increases the success rate of manipulation tasks such as raspberry harvesting and cloth wiping.

11:00-11:05, Paper ThAT15.7
A Monocular Vision-Based Robotic Arm Teleoperation Method for Human Arm Configuration Imitation

Xiang, Jindong	Sun Yat-Sen University
Pan, Zhijie	Shenzhen Campus of Sun Yat-Sen University
Wang, Baichuan	Shenzhen Campus of Sun Yat-Sen University
Xiang, Ruiqi	Sun Yat-Sen University
Liu, Han	The Hong Kong Polytechnic University
Li, Mengtang	Shenzhen Campus of Sun Yat-Sen University
Keywords: Telerobotics and Teleoperation, Human Detection and Tracking, Imitation Learning Abstract: Imitation-based teleoperation enables intuitive robot control in hazardous or hard-to-reach environments. Existing methods, however, lack an effective and quickly-deployable system that uses simple visual sensors to achieve end-effector control and human-like arm joint configuration imitation across various robotic arm structures. This paper therefore presents a teleoperation system that utilizes a single RGB camera and advanced computer vision techniques to capture human motion, coupled with a kinematic mapping method to transfer movements from human to robotic arms. The system generates robot motion that ensures both end-effector tracking and human-like joint configuration imitation, adaptable to diverse structures, including those with multiple offset links. Experiments demonstrate that the system produces robot arm poses more closely aligned with human configurations compared to traditional methods that overlook human pose. The performance of the end-effector tracking control and human arm shape imitation is evaluated, with no noticeable error observed when the robot completes its motion and a maximum position error of 17.03% and a maximum orientation error of 0.0925 rad are observed during motion, which are likely attributed to delays cased by filters and communications. Additionally, the system's ability to actively avoid obstacles via arm configuration imitation in specific scenarios is confirmed. Supplementary video is available.

11:05-11:10, Paper ThAT15.8
AirTouch: A Low-Cost Versatile Visuotactile Feedback System for Enhanced Robotic Teleoperation

Li, Shoujie	Tsinghua Shenzhen International Graduate School
Li, Xingting	University of Science and Technology Beijing
Huang, Yan	Tsinghua University
Zheng, Ken Jiankun	University of California, Berkeley
Yu, Ran	Tsinghua University
Wang, Xueqian	Center for Artificial Intelligence and Robotics, Graduate School
Ding, Wenbo	Tsinghua University
Keywords: Telerobotics and Teleoperation, Haptics and Haptic Interfaces, Force and Tactile Sensing Abstract: Vision-based teleoperation systems are widely used due to their cost-effectiveness and intuitive operation. However, these systems often suffer from challenges such as hand occlusions, environmental variability, and the lack of tactile feedback, limiting their precision and applicability in complex tasks. To address these limitations, we present AirTouch, a novel, low-cost visuotactile teleoperation system that integrates air pressure-based tactile feedback with lightweight hand pose estimation. AirTouch features an inflatable tactile bubble that provides adjustable feedback through closed-loop pneumatic control, enhancing the operator's sense of interaction with remote environments. The system's robust hand-tracking algorithm ensures accurate control even under dynamic and occlusion-prone conditions, while its hardware design eliminates the need for wearable devices, enabling intuitive operation. AirTouch supports a wide range of robotic end-effectors, including dexterous hands, parallel grippers, and suction cups, demonstrating versatility across multiple platforms. Extensive experiments validate AirTouch's performance, achieving high precision in hand pose estimation and a 91% success rate in complex teleoperation tasks, all with a hardware cost as low as 39. These results highlight AirTouch as a scalable and practical solution for enhancing robotic teleoperation across industrial, medical, and hazardous scenarios.


ThAT16	207
Task and Motion Planning 1	Regular Session
Chair: Faragasso, Angela	Finger Vision Inc
Co-Chair: Ge, Yueguang	Institute of Automation, Chinese Academy of Sciences

10:30-10:35, Paper ThAT16.1
Double-Feedback: Enhancing Large Language Models Reasoning in Robotic Tasks by Knowledge Graphs

Wang, Haitao	Institute of Automation, Chinese Academy of Sciences
Zhang, Shaolin	Institute of Automation, Chinese Academy of Sciences
Wang, Shuo	Chinese Academy of Sciences
Jiang, Tianyu	Institute of Automation, Chinese Academy of Scienses
Ge, Yueguang	Institute of Automation, Chinese Academy of Sciences
Keywords: Task and Motion Planning, Task Planning, Representation Learning Abstract: Large Language Models (LLMs) have demonstrated exceptional reasoning capabilities. However, in real-world robotics tasks, LLMs face grounding issues and lack precise feedback, resulting in generated solutions that run counter to reality. In this paper, we present double feedback, a method that enhances LLM reasoning through knowledge graphs (KGs). KG plays three key roles in Double-Feedback: prompting LLMs to generate solutions, representing task scenarios, and validating solutions to provide feedback. We design structured knowledge prompts that convey the knowledge background of the task, example solutions, revision principles, and robotic tasks to the LLM. We also introduce distributed representations to quantify task scenarios with interpretability. Based on structured knowledge prompts and distributed representations, we utilize KG to evaluate the feasibility of each step before execution and validate the effectiveness of the solution after the task is completed. LLMs can adapt and reprogram solutions based on KG feedback. Extensive experiments have shown that dual feedback is superior to previous work in ALFRED benchmarks. Additionally, ablation studies have shown that dual feedback guides LLMs in generating solutions that align with real-world robotic tasks.

10:35-10:40, Paper ThAT16.2
Hierarchical Reinforcement Learning for Swarm Confrontation with High Uncertainty (I)

Wu, Qizhen	Beihang University
Liu, Kexin	Beihang University
Chen, Lei	Beijing Institute of Technology
Lv, Jinhu	Beihang University
Keywords: Task and Motion Planning, Reinforcement Learning, Multi-Robot Systems Abstract: In swarm robotics, confrontation including the pursuit-evasion game is a key scenario. High uncertainty caused by unknown opponents' strategies, dynamic obstacles, and insufficient training complicates the action space into a hybrid decision process. Although the deep reinforcement learning method is significant for swarm confrontation since it can handle various sizes, as an end-to-end implementation, it cannot deal with the hybrid process. Here, we propose a novel hierarchical reinforcement learning approach consisting of a target allocation layer, a path planning layer, and the underlying dynamic interaction mechanism between the two layers, which indicates the quantified uncertainty. It decouples the hybrid process into discrete allocation and continuous planning layers, with a probabilistic ensemble model to quantify the uncertainty and regulate the interaction frequency adaptively. Furthermore, to overcome the unstable training process introduced by the two layers, we design an integration training method including pre-training and cross-training, which enhances the training efficiency and stability. Experiment results in both comparison, ablation, and real-robot studies validate the effectiveness and generalization performance of our proposed approach. In our defined experiments with twenty to forty agents, the win rate of the proposed method reaches around ninety percent, outperforming other traditional methods.

10:40-10:45, Paper ThAT16.3
Hierarchy Coverage Path Planning with Proactive Extremum Prevention in Unknown Environments

Li, Lin	National University of Defense Technology
Shi, Dianxi	Defense Innovation Institute
Jin, Songchang	Defense Innovation Institute
Zhou, Xing	National University of Defense Technology
Li, Yahui	Academy of Military Sciences
Bai, Bin	He Bei University of Technology
Keywords: Task and Motion Planning, Motion and Path Planning, Simulation and Animation Abstract: The local extremum is a crucial factor that affects the efficiency of online coverage path planning (CPP). Most online CPP methods generate coverage motions point-by-point in unknown environments. However, these solutions ignore efficient global coverage and probably result in local extremum. This paper presents a hierarchy coverage path planning approach (HCPP) with proactive extremum prevention. HCPP incrementally generates coverage tasks and produces coverage motions in a global-to-local planning manner. Global planning generates a traversal sequence of all coverage tasks, and local planning provides a route from one task to the next. By maintaining the connectivity of the uncovered area from both a global and local perspective, HCPP avoids the local extremums caused by separate areas. The effectiveness of HCPP was confirmed by multiple simulations and physical experiments in a laboratory setting on an Akerman robot. Experimental results indicate that HCPP reduces coverage times by preventing local extremum while achieving complete coverage.

10:45-10:50, Paper ThAT16.4
STAMP: Differentiable Task and Motion Planning Via Stein Variational Gradient Descent

Lee, Yewon	University of Toronto
Li, Andrew	University of Toronto
Huang, Philip	Carnegie Mellon University
Heiden, Eric	NVIDIA
Jatavallabhula, Krishna Murthy	MIT
Damken, Fabian	University of Twente
Smith, Kevin	Massachusetts Institute of Technology
Nowrouzezahrai, Derek	McGill University
Ramos, Fabio	University of Sydney, NVIDIA
Shkurti, Florian	University of Toronto
Keywords: Task and Motion Planning, Probabilistic Inference Abstract: Planning for sequential robotics tasks often requires integrated symbolic and geometric reasoning. Task and Motion Planning (TAMP) algorithms typically solve these problems by performing a tree search over high-level task sequences while checking for kinematic and dynamic feasibility. This can be inefficient because, typically, candidate task plans resulting from the tree search ignore geometric information. This often leads to motion planning failures that require expensive backtracking steps to find alternative task plans. We propose a novel approach to TAMP called Stein Task and Motion Planning (STAMP) that relaxes the hybrid optimization problem into a continuous domain. This allows us to leverage gradients from differentiable physics simulation to fully optimize discrete and continuous plan parameters for TAMP. In particular, we solve the optimization problem using a gradient-based variational inference algorithm called Stein Variational Gradient Descent. This allows us to find a distribution of solutions within a single optimization run. Furthermore, we use an off-the-shelf differentiable physics simulator that is parallelized on the GPU to run parallelized inference over diverse plan parameters. We demonstrate our method on a variety of problems and show that it can find multiple diverse plans in a single optimization run while also being significantly faster than existing approaches. https://rvl.cs.toronto.edu/stamp

10:50-10:55, Paper ThAT16.5
Make a Donut: Hierarchical EMD-Space Planning for Zero-Shot Deformable Manipulation with Tools

You, Yang	Stanford University
Shen, William B,	Stanford University
Deng, Congyue	Stanford
Geng, Haoran	University of California, Berkeley
Wei, Songlin	Soochow University
Wang, He	Peking University
Guibas, Leonidas	Stanford University
Keywords: Task Planning, Manipulation Planning, Deep Learning Methods Abstract: Deformable object manipulation stands as one of the most captivating yet formidable challenges in robotics. While previous techniques have predominantly relied on learning latent dynamics through demonstrations, typically represented as either particles or images, there exists a pertinent limitation: acquiring suitable demonstrations, especially for long-horizon tasks, can be elusive. Moreover, basing learning entirely on demonstrations can hamper the model's ability to generalize beyond the demonstrated tasks. In this work, we introduce a demonstration-free hierarchical planning approach capable of tackling intricate long-horizon deformable manipulation tasks without necessitating any training. We employ large language models (LLMs) to articulate a high-level, stage-by-stage plan corresponding to a specified task. For every individual stage, the LLM provides both the tool's name and the Python code to craft intermediate subgoal point clouds. With the tool and subgoal for a particular stage at our disposal, we present a granular closed-loop model predictive control strategy. This leverages Differentiable Physics with Point-to-Point correspondence (DiffPhysics-P2P) loss in the earth mover distance (EMD) space, applied iteratively. Experimental findings affirm that our technique surpasses multiple benchmarks in dough manipulation, spanning both short and long horizons. Remarkably, our model demonstrates robust generalization capabilities to novel and previously unencountered complex tasks without any preliminary demonstrations. We further substantiate our approach with experimental trials on real-world robotic platforms. Our project page: https://qq456cvb.github.io/projects/donut.

10:55-11:00, Paper ThAT16.6
ReplanVLM: Replanning Robotic Tasks with Visual Language Models

Mei, Aoran	Fudan University
Zhu, Guo-Niu	Fudan Unversity
Zhang, Huaxiang	Fudan University
Gan, Zhongxue	Fudan University
Keywords: Task Planning, Human-Robot Collaboration Abstract: Large language models (LLMs) have gained increasing popularity in robotic task planning due to their exceptional abilities in text analytics and generation, as well as their broad knowledge of the world. However,they fall short in decoding visual cues. LLMs have limited direct perception of the world, which leads to a deficient grasp of the current state of the world. By contrast, the emergence of visual language models (VLMs) fills this gap by integrating visual perception modules, which can enhance the autonomy of robotic task planning. Despite these advancements, VLMs still face challenges, such as the potential for task execution errors, even when provided with accurate instructions. To address such issues, this letter proposes a ReplanVLM framework for robotic task planning. In this study, we focus on error correction interventions. An internal error correction mechanism and an external error correction mechanism are presented to correct errors under corresponding phases. A replan strategy is developed to replan tasks or correct error codes when task execution fails. Experimental results on real robots and in simulation environments have demonstrated the superiority of the proposed framework, with higher success rates and robust error correction capabilities in open-world tasks.

11:00-11:05, Paper ThAT16.7
Group-Aware Robot Navigation in Crowds Using Spatio-Temporal Graph Attention Network with Deep Reinforcement Learning

Lu, Xiaojun	Jiangsu University of Science and Technology
Faragasso, Angela	Finger Vision Inc
Wang, Yongdong	The University of Tokyo
Yamashita, Atsushi	The University of Tokyo
Asama, Hajime	The University of Tokyo
Keywords: Social HRI, Human-Aware Motion Planning, Collision Avoidance Abstract: Robots are becoming essential in human environments, requiring them to behave in a socially compliant manner. Although previous learning-based methods have shown potential in social navigation, most have treated pedestrians as individuals, failing to account for group level interactions. Additionally, these methods have modeled pairwise interactions only in the spatial domain, overlooking the temporal evolution of relations among agents. In this letter, the above limitations are addressed by proposing a novel spatio-temporal graph attention network that explicitly models group level interactions in both spatial and temporal domains. Specifically, a novel group-awareness mechanism is designed to model group-aware behaviors, and a new network is proposed to capture spatio-temporal features of relations among agents while leveraging the model-free deep reinforcement learning to optimize the group-aware navigation policy. The test results show that our approach outperforms the baselines in all metrics in both simulation and real-world experiments. Furthermore, quantitative analysis of questionnaire responses further verifies the benefits of our method in group awareness and social compliance.

11:05-11:10, Paper ThAT16.8
Somersaulting Jump of Wheeled Bipedal Robot: A Comprehensive Planning and Control Strategy

Tang, Shuang	Nankai University
Lu, Biao	Nankai University
Cao, Haixin	Nankai University
Fang, Yongchun	Institute of Robotics and Automatic Information System, College
Keywords: Constrained Motion Planning, Dynamics, Underactuated Robots Abstract: The wheeled bipedal robot (WBR) is a type of robot with strong athletic ability. It combines the advantages of wheeled and legged robots while possessing efficient motion performance and good terrain adaptability. However, research on the jumping motion of WBR is currently limited, mainly focusing on relatively simple jumping actions. This paper explores the WBR's somersaulting jump, expanding the degrees of freedom in jumping actions and further enhancing its mobility. A comprehensive planning and control strategy is proposed. Specifically, the paper first designs the entire process of somersaulting jump, including preparation, flight, and recovery, based on human motion. Building on this, the dynamic models for both phases are derived. Critical states at take-off and landing are carefully designed. Moreover, trajectory planning considers constraints such as system dynamics, joint limits, and torque restrictions, ensuring the rationality of the trajectory. Subsequently, effective controllers are constructed to ensure the stability of the somersaulting jump motion. Finally, the effectiveness and adaptability of the overall planning and control strategy are verified through physical simulation.


ThAT17	210A
Field Robots 1	Regular Session
Chair: Manoonpong, Poramate	Vidyasirimedhi Institute of Science and Technology (VISTEC)
Co-Chair: Lyu, Ximin	Sun Yat-Sen University

10:30-10:35, Paper ThAT17.1
Robotic Inspection and Data Analytics to Localize and Visualize the Structural Defects of Concrete Infrastructure (I)

Feng, Jinglun	The City College of New York
Shang, Bo	The City College of New York
Hoxha, Ejup	The City College of New York
Hernandez Montiel, Cesar Gilberto	City College of New York
He, Yang	The City College of New York
Wang, Weihan	Stevens Institute of Technology
Xiao, Jizhong	The City College of New York
Keywords: Climbing Robots, Computer Vision for Automation, Field Robots Abstract: This paper presents an innovative robotic inspection system designed to enhance the detection and analysis of structural defects in concrete infrastructure. The proposed inspection system is comprised of three modules: a robotic data collection module, a visual inspection module, and a subsurface mapping module. The robotic data collection module features an omnidirectional robotic platform, designed to move sideways without spinning. It is equipped with Ground Penetrating Radar (GPR) and RGB-D cameras, facilitating systematic data collection across construction sites. The visual inspection module employs a learning-based method, InspectionNet++, to analyze the frames for surface defects such as cracks, spalls, and stains, providing high accuracy and metric measurements of the defects. The subsurface mapping module processes the GPR data to detect and visualize hidden defects, creating a comprehensive map that correlates these with visible surface anomalies. Field tests demonstrate the system’s ability to automate construction structural inspection with improved efficiency and precision. Additionally, the customized visualization software is introduced to enable intuitive and interactive exploration of the detected defects within a unified interface. By automating data collection and enhancing defect detection through learning algorithms, the system not only speeds up the inspection process but also increases the reliability of infrastructure evaluations, supporting more informed maintenance decisions.

10:35-10:40, Paper ThAT17.2
Distance and Collision Probability Estimation from Gaussian Surface Models

Goel, Kshitij	Carnegie Mellon University
Tabib, Wennie	Carnegie Mellon University
Keywords: Field Robots, Collision Avoidance, Reactive and Sensor-Based Planning Abstract: This paper describes methodologies to estimate the collision probability, Euclidean distance and gradient between a robot and a surface, without explicitly constructing a free space representation. The robot is assumed to be an ellipsoid, which provides a tighter approximation for navigation in cluttered and narrow spaces compared to the commonly used spherical model. Instead of computing distances over point clouds or high-resolution occupancy grids, which is expensive, the environment is modeled using compact Gaussian mixture models and approximated via a set of ellipsoids. A parallelizable strategy to accelerate an existing ellipsoid-ellipsoid distance computation method is presented. Evaluation in 3D environments demonstrates improved performance over state-of-the-art methods. Execution times for the approach are within a few microseconds per ellipsoid pair using a single-thread on low-power embedded computers.

10:40-10:45, Paper ThAT17.3
An Online Motion Planning Framework for Navigating Torpedo-Shaped Autonomous Underwater Vehicles in Unknown Underwater Environments

Yu, Tianyou	CSSC Intelligent Innovation Research Institute
Dong, Zhaoxuan	CSSC Intelligent Innovation Research Institute
Wu, Yu	CSSC Intelligent Innovation Research Institute
Fu, Xingjie	China State Shipbuilding System Engineering Research Institute
Keywords: Field Robots, Autonomous Vehicle Navigation, Motion and Path Planning Abstract: Navigating unknown underwater environments is a significant challenge for autonomous underwater vehicles (AUVs), especially those with torpedo-like shapes. Lacking a prior map, these vehicles rely on real-time sensor data for perception. Although online motion planning addresses this challenge, many existing methods are primarily tested on more maneuverable robots, such as multicopters and ground vehicles, and do not account for the unique kinematics of torpedo-shaped AUVs, such as limited lateral movement, or the need for 3D motion planning. In this paper, we propose an online motion planning system specifically designed for torpedo-shaped AUVs to navigate 3D underwater terrain without prior environmental knowledge. The system employs a receding horizon planning framework to ensure safe navigation by replanning the trajectory when collisions are detected or the planning horizon is reached. For trajectory generation, a search-based method is used and utilizes a 3D Dubins curve heuristic to guide the generation of an optimal 3D trajectory that adheres to the AUV's kinematic constraints. To further enhance safety and smoothness, gradient-based optimization is applied to refine the trajectory. Experiments in simulated environments validate the proposed method, demonstrating its ability to generate safe trajectories for AUVs in complex and unknown environments. We release our code as an open-source package.

10:45-10:50, Paper ThAT17.4
Automated UAV-Based Wind Turbine Blade Inspection: Blade Stop Angle Estimation and Blade Detail Prioritized Exposure Adjustment

Shi, Yichuan	Sun Yat-Sen University
Liu, Hao	Power China Zhongnan Engineering Corporation Limited
Zheng, Haowen	Sun Yat-Sen University
Yu, Haowen	Sun Yat-Sen University
Liang, Xianqi	Sun Yat-Sen University
Li, Jie	Powerchina Zhongnan Engineering Corporation
Ma, Minmin	Power China Zhongnan Engineering Corporation Limited
Lyu, Ximin	Sun Yat-Sen University
Keywords: Field Robots Abstract: Unmanned aerial vehicles (UAVs) are critical in the automated inspection of wind turbine blades. Nevertheless, several issues persist in this domain. Firstly, existing inspection platforms encounter challenges in meeting the demands of automated inspection tasks and scenarios. Moreover, current blade stop angle estimation methods are vulnerable to environmental factors, restricting their robustness. Additionally, there is an absence of real-time blade detail prioritized exposure adjustment during capture, where lost details cannot be restored through post-optimization. To address these challenges, we introduce a platform and two approaches. Initially, a UAV inspection platform is presented to meet the automated inspection requirements. Subsequently, a Fermat point based blade stop angle estimation approach is introduced, achieving higher precision and success rates. Finally, we propose a blade detail prioritized exposure adjustment approach to ensure appropriate brightness and preserve details during image capture. Extensive tests, comprising over 120 flights across 10 wind turbine models in 5 operational wind farms, validate the effectiveness of the proposed approaches in enhancing inspection autonomy.

10:50-10:55, Paper ThAT17.5
REFINE-Bot: Furnace Cleaning Robot for Heat-Transfer Efficiency Improvement

Punapanont, Sanpoom	Vidyasirimedhi Institute of Science and Technology
Pairam, Thipawan	Vidyasirimedhi Institute of Science and Technology
Ausrivong, Wasuthorn	VISTEC : Vidyasirimedhi Institute of Science and Technology
Manoonpong, Poramate	Vidyasirimedhi Institute of Science and Technology (VISTEC)
Keywords: Service Robotics, Field Robots, Robotics in Hazardous Fields Abstract: In the oil and gas industry, scale accumulation on radiant coils within furnaces significantly reduces heat-transfer efficiency, leading to increased energy consumption. This paper introduces the REFINE-bot, a robotic system developed to improve the descaling process and operational efficiency in fired heaters. Unlike existing solutions which are mainly designed for specific tube sizes and positions and focused on inspection, the REFINE-bot integrates an adaptable clamping mechanism that adapts to both vertical and horizontal tubes of varying diameters (3"–8"), even in complex environments with narrow tube-to-tube and wall-to-tube gaps. We evaluate three different cleaning tools—a Knot End Brush, Wire Cup Brush, and Sandpaper—under simulated hard scale conditions in a lab environment, revealing cleaning tool limitations and optimal safety parameters to prevent tube damage. An adaptive force control is also developed to online adjust the position of the cleaning relative to the tube surface to address uneven scale heights. Additionally, the robot is successfully deployed in a real furnace setting to test its clamping and cleaning mechanisms on the actual scale. The results demonstrate superior cleaning performance at radiant coils of a furnace compared to traditional manual descaling methods, as evaluated by measured reductions in scale thickness and infrared thermal imaging.

10:55-11:00, Paper ThAT17.6
SEM-RRT*: Fast Risk Assessment and Path Planning in Uneven Terrain Using Statistical Elevation Map

Dong, Xudong	Xi'an Jiaotong University
Liu, Jianyi	Xi'an Jiaotong University
Shi, Yuhong	Xi'an Jiaotong University
Wang, Wenzhe	Xi'an Jiaotong University
Keywords: Field Robots, Motion and Path Planning, Autonomous Vehicle Navigation Abstract: Path planning in uneven terrain scenarios is one of the core capabilities of intelligent off-road robots and vehicles. The complex terrain undulations often cause bumpy motion and sharp turns along the planned path, making smooth and safe path planning challenging. Most path planning methods in this community rely on dense point cloud maps as direct inputs, which inevitably incur high computational overhead for map representation and terrain assessment. To address these problems, we propose a novel path planning method toward uneven terrains, SEM-RRT, which balances both planning quality and computational efficiency. First, we propose a map representation namely Statistical Elevation Map (SEM), which is lightweight to store and compute. Then, to enable fast terrain risk assessment, a terrain risk filter with omnidirectional and multi-scale characteristics is designed. Finally, we incorporate multi-objective cost evaluation, backward search, and rolling optimization strategies into the Informed RRT framework, leveraging its path optimality on large scale map. Extensive experiments in challenging terrain scenarios, such as hills, canyons, and volcanic landscapes, show that SEM-RRT* outperforms existing methods in both path quality and computational time.

11:00-11:05, Paper ThAT17.7
Nezha-T: A Bi-Floating State Lightweight Tail-Sitter HAUV

Han, Xiqiao	Shanghai Jiao Tong University
Bi, Yuanbo	Shanghai Jiao Tong University
Zhang, Ziyang	Shanghai Jiao Tong University
Zeng, Zheng	Shanghai Jiao Tong University
Lian, Lian	Shanghai Jiaotong University
Keywords: Field Robots, Marine Robotics, Aerial Systems: Applications Abstract: Hybrid aerial-underwater vehicles (HAUVs) are attracting significant interest for their unique capability to operate across both air and water. However, achieving a lightweight design coupled with efficient cross-domain performance remains a formidable challenge. This paper introduces Nezha-T, a novel, ultra-lightweight HAUV featuring a dual-stable floating state capability. This is achieved through an innovative center-of-gravity (CoG) arrangement method, which enables seamless transitions between upright and horizontal floating postures. This ability to stably floating in either orientation is crucial for stable water entry and exit, mitigating impact forces on the vehicle and its payload. Furthermore, to counteract the residual buoyancy inherent in this design, a zero-lift pitch angle is incorporated into the control system, improving depth-keeping and pitch control performance. The proposed design were rigorously validated through computational fluid dynamics (CFD) simulations, pool experiments, and open-water field tests. The results confirm the feasibility of the dual-stable floating states, emonstrate stable cross-domain traversal, verify the effectiveness of the depth-keeping control system, and validate the vehicle's fixed-wing flight capability.

11:05-11:10, Paper ThAT17.8
LiDAR-IMU Fusion System with Adaptive Scanning for High-Resolution Deformation Monitoring of Underground Infrastructures

Li, Menggang	China University of Mining and Technology
Li, Zhuoqi	China University of Mining and Technology
Hu, Kun	China University of Mining and Technology
Hu, Eryi	Information Institute, Ministry of Emergency Management of the P
Tang, Chaoquan	China University of Mining and Technology
Zhou, Gongbo	China University of Mining and Technology
Keywords: Field Robots, Robotics in Hazardous Fields, Mining Robotics Abstract: A LiDAR-IMU fusion system utilizing adaptive scanning is developed for high-resolution deformation monitoring of underground coal mine infrastructure, such as sealed walls. The system integrates data from a LiDAR scanner and an IMU, employing a penalty function-based scanning strategy to optimize point cloud quality. Following feature extraction and state estimation, a 3D point cloud model of the sealed wall is constructed. Deformation monitoring is achieved through point cloud segmentation, registration, and error analysis across multiple time intervals. A methodology for optimizing equipment placement on walls of varying dimensions is proposed to efficiently capture deformation details. Two metrics, PATD and PARE, are introduced to evaluate system performance. Calibration experiments using standardized boards and blocks are designed to determine optimal monitoring parameters, including distance, height, and sampling frequency. Simulated deformation experiments under real-world conditions validate the system’s rationality and accuracy.


ThAT18	210B
Mapping 1	Regular Session
Chair: Jiao, Jianhao	University College London
Co-Chair: Duan, Ran	The Hong Kong Polytechnic University

10:30-10:35, Paper ThAT18.1
From Satellite to Street: Semantic and Depth Information for Enhanced Geo-Localization

Zhu, Yilong	HKUST
Jiao, Jianhao	University College London
Wei, Hexiang	The Hong Kong University of Science and Technology
Wu, Jin	University of Science and Technology Beijing
Xue, Bohuan	South China Normal University
Zhang, Shuyang	The Hong Kong University of Science and Technology
Shen, Shaojie	Hong Kong University of Science and Technology
Keywords: Mapping, Localization, Autonomous Vehicle Navigation Abstract: Accurate positioning is essential for autonomous driving, but localization using 2D maps is challenging due to the domain gap between perspective view and 2D map. While GNSS accuracy is often limited by atmospheric effects, multipath, and signal blockages. We propose a novel positioning method that combines perspective view images with satellite images retrieved based on rough GNSS positions to achieve precise three-degree-of-freedom (3-DoF) pose estimation. Our method leverages the Swin Transformer for satellite image processing and semantic completion for monocular image analysis. By extracting depth and semantic information from monocular images, we convert these to overhead projections, effectively bridging the gap between different viewpoints. This cross-view transformation allows for precise alignment of features from monocular images onto semantically enriched satellite images. Additionally, we integrate a robust global position estimator using the semantic information from satellite images to further enhance accuracy and robustness. The experimental results demonstrate that our method excels in various complex scenarios; we successfully improved the positioning accuracy within 1 m to 80.67% and the heading in 1◦ to 33.78%. However, longitudinal localization remains more challenging, with higher errors than lateral positioning.

10:35-10:40, Paper ThAT18.2
ESFUSION: Enhanced LiDAR-Camera Fusion Architecture for HD Mapping at Intersection

Yang, Suhui	Tianjin University
Cui, Jingjing	Beijing Jiaotong University
Keywords: Mapping, Sensor Fusion, Deep Learning for Visual Perception Abstract: The construction of high-definition (HD) maps at intersections is crucial for autonomous driving and vehicle-to-infrastructure (V2I) collaboration. However, the semantic complexity of intersections poses significant challenges for HD mapping. Previous research has predominantly relied on traditional algorithms to process LiDAR or camera data, which often struggle with occlusion and inherent sensor limitations. To address these challenges, we propose a novel method, called ESFusion for Effective BEV Feature Selection and Fusion. To the best of our knowledge, this is the first work to leverage multi-modal data from intelligent roadside infrastructure, particularly LiDAR and cameras, for generating HD maps at intersections. To enhance multi-modal feature representation in Bird's Eye View (BEV), we design a Cross-modal Channel Exchange (CCE) module that creates multi-scale spatial features and facilitates LiDAR-camera information exchange across channels. Additionally, we introduce a Dynamic Feature Selection (DFS) module to adaptively select the most valuable information between modalities. Comprehensive evaluations on the DAIR-V2X dataset demonstrate that our method outperforms single-modal approaches and existing state-of-the-art fusion methods for vehicle-side applications. Moreover, experiments on the nuScenes dataset further highlight the high flexibility of our proposed module, showcasing its ability to be seamlessly integrated into existing multi-modal fusion workflows.

10:40-10:45, Paper ThAT18.3
CODE: COllaborative Visual-UWB SLAM for Online Large-Scale Metric DEnse Mapping

Chen, Lin	Northwestern Polytechnical University
Jia, Xuan	Northwestern Polytechnical University
Bu, Shuhui	Northwestern Polytechnical University
Wang, Guangming	University of Cambridge
Li, Kun	Northwestern Polytechnical University
Xia, Zhenyu	Northwestern Polytechnical University
Li, Xiaohan	Northwestern Polytechnical University
Han, Pengcheng	Northwestern Polytechnical University
Cao, Xuefeng	Information Engineering University
Keywords: Mapping, SLAM Abstract: This paper presents a novel collaborative online dense mapping system for multiple Unmanned Aerial Vehicles (UAVs). The system confers two primary benefits: it facilitates simultaneous UAVs co-localization and real-time dense map reconstruction, and it recovers the metric scale even in GNSS-denied conditions. To achieve these advantages, Ultra-wideband (UWB) measurements, monocular Visual Odometry (VO), and co-visibility observations are jointly employed to recover both relative positions and global UAV poses, thereby ensuring optimality at both local and global scales. In the proposed methodology, a two-stage optimization strategy is proposed to reduce optimization burden. Initially, relative Sim3 transformations among UAVs are swiftly estimated, with UWB measurements facilitating metric scale recovery in the absence of GNSS. Subsequently, a global pose optimization is performed to effectively mitigate cumulative drift. By integrating UWB, VO, and co-visibility data within this framework, both local geometric consistency and global pose accuracy are robustly maintained. Through comprehensive simulation and empirical real-world testing, we demonstrate that our system not only improves UAV positioning accuracy in challenging scenarios but also facilitates the high-quality, online integration of dense point clouds in large-scale areas. This research offers valuable contributions and practical techniques for precise, real-time map reconstruction using an autonomous UAV fleet, particularly in GNSS-denied environments.

10:45-10:50, Paper ThAT18.4
GPGS: Geometric Priors for 3D Gaussian Splatting in Structural Environments

Xu, Ziwei	Huawei Cloud Computing Technologies Co., Ltd
Chen, Wen	Huawei Cloud Computing Technologies Co., Ltd
Wang, Shilong	Huawei Cloud Computing Technologies Co., Ltd
Ouyang, Zile	Huawei Cloud Computing Technologies Co., Ltd
Bian, Shengwei	Huawei Cloud Computing Technologies Co., Ltd
Zhou, Shunbo	Huawei
Keywords: Mapping, SLAM, Sensor Fusion Abstract: Recently, 3D Gaussian Splatting (3DGS) has garnered significant attention for its remarkable capacity to efficiently synthesize novel views with high fidelity. Nevertheless, 3DGS encounters challenges in accurately representing the geometry of real-world scenes. To address this issue, previous methods commonly utilize a depth-normal consistency term on 2D images to regulate the geometry of 3D Gaussians. However, these methods degrade in performance when dealing with low-texture surfaces or limited training views. In contrast, we present GPGS, a novel approach that directly regulates Gaussians in 3D space using Geometric Priors (GP). Given posed LiDAR scans and images, we organize the point clouds into a hierarchical voxel map. Each voxel contains occupancy information and explicitly reveals the internal planar or non-planar structure. We propose a novel divide-and-conquer strategy to separately regulate Gaussians in planar and non-planar voxels. For planar voxels, we design positional and rotational constraints to align Gaussians with the estimated plane. Considering the noisy ranging measurements of complex structures, we use depth-normal consistency to regularize Gaussians in non-planar voxels. Additionally, an occupancy-aware density control strategy is introduced to confine the densification process within occupied voxels, thus reducing artifacts. Extensive experiments on real-world datasets show that our proposed approach outperforms existing state-of-the-art methods in both geometric accuracy and visual quality.

10:50-10:55, Paper ThAT18.5
3D Gaussian Splatting for Fine-Detailed Surface Reconstruction in Large-Scale Scene

Chen, Shihan	The Hong Kong Polytechnic University
Li, Zhaojin	Hong Kong Polytechnic University
Chen, Zeyu	The Hong Kong Polytechnic University
Yan, Qingsong	Wuhan University
Shen, Gaoyang	Wuhan University
Duan, Ran	The Hong Kong Polytechnic University
Keywords: Mapping Abstract: Recent developments in 3D Gaussian Splatting have made significant advances in surface reconstruction. However, scaling these methods to large-scale scenes remains challenging due to high computational demands and the complex dynamic appearances typical of outdoor environments. These challenges hinder the application in aerial surveying and autonomous driving. This paper proposes a novel solution to reconstruct large-scale surfaces with fine details, supervised by full-sized images. Firstly, we introduce a coarse-to-fine strategy to reconstruct a coarse model efficiently, followed by adaptive scene partitioning and sub-scene refining from image segments. Additionally, we integrate a decoupling appearance model to capture global appearance variations and a transient mask model to mitigate interference from moving objects. Finally, we expand the multi-view constraint and introduce a single-view regularization for texture-less areas. Our experiments were conducted on the publicly available dataset GauU-Scene V2, which was captured using unmanned aerial vehicles. To the best of our knowledge, our method outperforms existing NeRF-based and Gaussian-based methods, achieving high-fidelity visual results and accurate surface from full-size image optimization. Open-source code will be available on GitHub.

10:55-11:00, Paper ThAT18.6
Adaptive Sliding Window Optimization for Multi-Modal LiDAR Inertial Odometry and Mapping

Han, Guodong	Shanghaitech University
Li, Wei	Institute of Computing Technology, Chinese Academy of Sciences
Hu, Yu	Institute of Computing Technology Chinese Academy of Sciences
Keywords: Mapping, SLAM, Localization Abstract: Fixed-Lag smoothing is widely employed as a backend in localization tasks. Generally, increasing the window length leads to better accuracy, but demands more computational resources. Therefore, determining an appropriate window length and whether a fixed length should be maintained throughout the localization process are worth studying. Assuming independent and identically distributed noise based on the distance-independent characteristic of LiDAR ranging errors, we propose an uncertainty-based adaptive sliding window (ASW) strategy. Through mathematical derivation, the reference uncertainty is affected by the LiDAR feature distribution of each frame. Consequently, we develop a multi-modal LiDAR inertial odometry and mapping framework based on ASW, which integrates mechanical and solid-state LiDAR to enhance odometry accuracy and mapping density. By designing a joint matching module, our approach leverages the strengths of distinct scanning patterns. Additionally, we incorporate loop closure detection in the mapping process to minimize cumulative drift. Extensive experiments conducted on both public and self-collected datasets demonstrate the effectiveness of our method. Compared to the state-of-the-art method, our approach improves the average accuracy by 10.3%. We also provide an open-source implementation for further studies. https://github.com/wowhhhhgd/ASW-LIOM.

11:00-11:05, Paper ThAT18.7
SLOOP: Aligned Coordinate System-Aided LiDAR LOOP Closure Detection Based on Semantic Node Graph Matching

Tang, Yujie	Beijing Institute of Technology
Wang, Meiling	Beijing Institute of Technology
Lu, Haoyang	Beijing Institute of Techonology
Zhong, Jiagui	Beijing Institute of Technology
Zuo, Sibo	Beijing Institute of Technology
Deng, Yinan	Beijing Institute of Technology
Yue, Yufeng	Beijing Institute of Technology
Keywords: Mapping, Localization Abstract: Loop closure detection and pose estimation play a significant role in correcting odometry trajectories and generating globally consistent point cloud maps. Geometric feature descriptor methods neglect object-level spatial topology features, resulting in inadequate performance in loop closure detection. Semantic graph-based loop closing methods improve upon this, however, they still follow the paradigm of "first generating descriptors, then comparing similarity, and finally achieving alignment (6D pose)". Specifically, they compare two semantic graphs that are not spatially aligned, which makes direct node correspondences impossible and necessitates extensive descriptor extraction and comparison. This decouples similarity comparison from 6D pose estimation, resulting in a cumbersome process that limits practicality and scalability. This paper proposes SLOOP, a novel descriptor-free semantic graph matching method that "aligns two graphs first, followed by efficient similarity comparison". Specifically, we first design a dedicated neighborhood semantic feature module to extract high-quality matched node pairs. Next, we seek the aligned coordinate systems for candidate loops based on the robust ground normal vectors and two suitable node pairs examined by the two-stage global geometric consistency metrics. Finally, the aligned coordinate systems enable efficient extraction and comparison of node spatial distributions. We conducted extensive outdoor loop detection experiments and compared with various loop closure detection approaches, demonstrating the improved performance of SLOOP in loop closure detection and its practicality. The code and related materials are available at https://github.com/BIT-TYJ/SLOOP_c.

11:05-11:10, Paper ThAT18.8
PlanarMesh: Building Compact 3D Meshes from LiDAR Using Incremental Adaptive Resolution Reconstruction

Wang, Jiahao	University of Oxford
Chebrolu, Nived	University of Oxford
Tao, Yifu	University of Oxford
Zhang, Lintong	University of Oxford
Kim, Ayoung	Seoul National University
Fallon, Maurice	University of Oxford
Keywords: Mapping, SLAM Abstract: Building an online 3D LiDAR mapping system that produces a detailed surface reconstruction while remaining computationally efficient is a challenging task. In this paper, we present PlanarMesh, a novel incremental, mesh-based LiDAR reconstruction system that adaptively adjusts mesh resolution to achieve compact, detailed reconstructions in real-time. It introduces a new representation, planar-mesh, which combines plane modeling and meshing to capture both large surfaces and detailed geometry. The planar-mesh can be incrementally updated considering both local surface curvature and free-space information from sensor measurements. We employ a multi-threaded architecture with a Bounding Volume Hierarchy (BVH) for efficient data storage and fast search operations, enabling real-time performance. Experimental results show that our method achieves reconstruction accuracy on par with, or exceeding, state-of-the-art techniques—including truncated signed distance functions, occupancy mapping, and voxel-based meshing—while producing smaller output file sizes (10 times smaller than raw input and more than 5 times smaller than mesh-based methods) and maintaining real-time performance (around 2 Hz for a 64-beam sensor).


ThAT19	210C
Biologically-Inspired Robots 5	Regular Session
Chair: Hughes, Josie	EPFL
Co-Chair: Zhu, Yaonan	University of Tokyo

10:30-10:35, Paper ThAT19.1
Eagle-Scale Flapping-Wing Robot with Aggressive Roll Maneuverability: Bio-Inspired Actuation, Fluid-Structure Interaction Simulation and Flight Experiment

Wang, Haoyu	Harbin Institute of Technology, Shenzhen
Gong, Zhenkun	School of Mechanical Engineering and Automation Harbin Institute
Pan, Erzhen	Harbin Institute of Technology, Shenzhen
Xu, Wenfu	Harbin Institute of Technology, Shenzhen
Keywords: Biologically-Inspired Robots, Biomimetics, Aerial Systems: Mechanics and Control Abstract: Large flapping-wing aerial vehicles (FWAVs) face dual challenges in aerodynamic and structural design, with long-standing technical bottlenecks, particularly in roll maneuvers. In this study, by reverse-engineering the biomechanical mechanisms of raptor flight, we propose a bio-inspired wing-shoulder torsional mechanism and successfully developed an eagle-inspired flapping-wing aerial vehicle with a wingspan of 1.87m and a takeoff weight of 1,260g. A nonlinear explicit dynamics-lattice Boltzmann fluid-structure interaction (FSI) numerical model was innovatively established, comprehensively revealing the interaction mechanism between unsteady flapping flow fields and flexible wing deformations. Numerical simulations demonstrate that at a cruising speed of 8 m/s, the proposed mechanism generates a high-purity roll torque of 3.3 N·m (with a residual yaw torque of 0.2 N·m, torque purity ratio 16.5:1), while lift and thrust losses are below 1.5%. Flight experiments validate the exceptional performance of this mechanism in 3D maneuvers: a 360° barrel roll is completed in 2.6 seconds (average roll rate 136°/s). This study provides a theoretical framework and technological prototype for next-generation bio-inspired aerial vehicles that integrate efficient cruising with high maneuverability, marking the first instance where FWAVs surpass traditional aircraft in specific 3D maneuverability metrics.

10:35-10:40, Paper ThAT19.2
Design and Performance Study of an Underwater Soft Snake-Like Robot

Ma, Huichen	Beijing Institute of Technology
Zhou, Junjie	Beijing Institute of Technology
Tan, Gavril Yong En	National University of Singapore
Zhou, Xuanyi	National University of Singapore
Zhang, Xinzhi	National University of Singapore
Yeow, Chen-Hua	National University of Singapore
Keywords: Biologically-Inspired Robots, Hydraulic/Pneumatic Actuators, Motion Control Abstract: In this paper, we propose a design of an underwater soft snake-like robot prototype that uses two actuators made of 3D-printed soft materials to build the robot body. Control signals with appropriate displacement phases and different voltages are used to control the water pump to drive the soft actuator to bend to generate a sine wave with increasing amplitude from the head to the tail of the robot body. We test customized tail materials, phase shifts, and voltage growth rate signals to observe the effects of different parameters on the movement of the snake robot in water. Experiments show that the movement speed is positively correlated with the swing amplitude of the snake robot's motion module. In addition, measured data show that swimming efficiency and movement speed are also affected by tail flexibility and movement gait. When the phase offset is 2/3π, the tail is made of harder PLA material, and the voltage growth rate is 1.2, the maximum underwater movement speed achieved by the snake robot is 4.464 cm/s (0.076 BL/s). We also found that when the phase offset increases, the snake motion speed and motion efficiency first increase and then decrease. The results obtained in this study will aid in the advancement of soft, slender swimming robots and improve the understanding of the swimming capabilities of both robots and sea snakes.

10:40-10:45, Paper ThAT19.3
A Novel Aerial-Aquatic Locomotion Robot with Variable Stiffness Propulsion Module

Hu, Junzhe	CMU
Chen, Pengyu	Xi'an Jiaotong University
Tianxiang, Feng	Unversity of Cincinnati
Yuxuan, Wen	University of Cincinnati
Wu, Ke	MBZUAI
Dong, Janet	University of Cincinnati
Keywords: Biologically-Inspired Robots, Actuation and Joint Mechanisms, Soft Robot Applications Abstract: In recent years, the development of robots capable of operating in both aerial and aquatic environments has gained significant attention. This study presents the design and fabrication of a novel aerial-aquatic locomotion robot (AALR). Inspired by the diving beetle, the AALR incorporates a biomimetic propulsion mechanism with power and recovery strokes. The variable stiffness propulsion module (VSPM) uses low melting point alloy (LMPA) and variable stiffness joints (VSJ) to achieve efficient aquatic locomotion while reducing harm to marine life. The AALR's innovative design integrates the VSPM into the arms of a traditional quadrotor, allowing for effective aerial-aquatic locomotion. The VSPM adjusts joint stiffness through temperature control, meeting locomotion requirements in both aerial and aquatic modes. A dynamic model for the VSPM was developed, with optimized dimensional parameters to increase propulsion force. Experiments focused on aquatic mode analysis and demonstrated the AALR's swimming capability, achieving a maximum swimming speed of 77 mm/s underwater. The results confirm the AALR's effective performance in water environment, highlighting its potential for versatile, eco-friendly operations.

10:45-10:50, Paper ThAT19.4
SkB-Hand: A Skeleton Bionic Hand with Dual-Tendon for General Purpose Robotic Grasping Tasks

Yang, Duan-Hong	National Taiwan University
Nguyen, Dai-Dong	National Taiwan University
Chuang, Ming-Yang	National Taiwan University
Kuo, Yu-Cheng	National Taiwan University
Kuo, Chung-Hsien	National Taiwan University
Keywords: Biologically-Inspired Robots, Multifingered Hands Abstract: The development of bionic hands is a crucial area robotics research, aiming to achieve human-like dexterity, adaptability, and efficiency in manipulation tasks. This study presents the Skeleton Bionic Hand (SkB-Hand), which features a dual elastic-tendon mechanism for extensor and volar plate in a single finger. The skeleton structure ensures the SkB-Hand is lightweight (< 600g) and low-cost (< 150USD) while maintaining performance in various grasping tasks. We evaluated the SkB-Hand through experiments such as the Kapandji test, GRASP Taxonomy, and dynamic scenarios. Results showed the SkB-Hand scored 7/10 in the Kapandji test and 33/33 in the GRASP Taxonomy test. The hand demonstrated resistance to deformation under external forces while ensuring flexibility. Integrated into the Techman Robot (TM-Robot), the SkB-Hand provide its potential for general robotic grasping tasks in daily life.

10:50-10:55, Paper ThAT19.5
High-Fidelity Model and Nonlinear Model Predictive Control for Flip Maneuvers of Tailless Flapping-Wing Robots

Guo, Qingcheng	Shanghai Jiaotong University
Wu, Chaofeng	Shanghai Jiao Tong University
Lu, Junguo	Shanghai Jiaotong University
Hughes, Josie	EPFL
Keywords: Biologically-Inspired Robots, Aerial Systems: Mechanics and Control, Micro/Nano Robots Abstract: Insects and hummingbirds exhibit remarkable agility, including full body flip maneuvers. Achieving similar performance in bio-inspired tailless flapping-wing robots (FWRs) is challenging due to the complex dynamics, inherent nonlinearities and control issues. This paper presents an nonlinear model predictive control (NMPC) algorithm designed to enable 360-degree flip maneuvers for the developed X-wing tailless FWR, which weighs 30.8 g and has a wingspan of 14.5 cm. We first introduce a high-fidelity model of the FWR, which incorporates the aerodynamics of the wings, dynamics of the motors and servos, body kinodynamic model, and the model of thrust and torques generation. This high-fidelity model allows for testing the FWR in a simulation, thus reducing the cost of flip maneuver of real-world experiments. Based on this high-fidelity model, we propose an NMPC controller to offline compute state trajectories and corresponding control inputs, which are then used as feedforward control for the FWR during its 360-degree flip maneuvers. Next, we present an online basic feedback controller that integrates the feedforward control for the FWR's flip control. Experimental results demonstrate the successful execution of the flip maneuvers without any mechanical modifications, highlighting the effectiveness of the proposed control strategy.

10:55-11:00, Paper ThAT19.6
Collective Motion of Magnetic Soft Cilia Controls Droplets and Nanozyme Catalysis

Lyashchuk, Victoriia	ITMO University
Chilikina, Alena	ITMO
Saban, Evgeniia	ITMO
Kladko, Daniil	ITMO University
Keywords: Biologically-Inspired Robots, Biomimetics, Soft Sensors and Actuators Abstract: Bioinspired magnetic cilia attract the tremendous attention of researchers due to flexible nature and remotely controlled manipulations of droplets and fluids. Nonetheless, controlling catalytic process by magnetic cilia remains underexplored. Here, we present magnetic soft cilia carpets with different magnetizations to control enzyme-like chemical reactions. We demonstrate a methodology to optimize material, geometric, and magnetic field parameters to achieve high-frequency oscillations of the cilia under an alternating magnetic field at 15 Hz. We also show stable oscillation of water-based droplets on the cilia surface at a magnetic field frequency of 10 Hz without droplet detachment. Furthermore, we demonstrate enhanced droplet catalysis resulting from droplet oscillation on the cilia surface. As a concept for automated lab analysis, we demonstrate droplet transport by magnetic cilia onto a flexible pH sensor. Finally, as a proof-of-concept, we show control of nanozyme reaction rates by varying cilia magnetization angles, achieving a fourfold enhancement of reaction rate under a rotating magnetic field compared to unmagnetized cilia. These findings offer a promising approach to remotely control enzyme-like reactions and droplet catalysis using magnetic cilia.

11:00-11:05, Paper ThAT19.7
Online Brain-Inspired Adaptive Learning Control for Nonlinear Systems with Configuration Uncertainties (I)

Zhang, Yanhui	Zhejiang University
Dai, YiMing	Zhejiang University
Chaoran, Wang	Zhejiang University
Chen, Yong	Southwest Jiaotong University
Zhao, Wenwen	Zhejiang University
Chen, Weifang	Zhejiang University
Keywords: Aerial Systems: Applications, Bioinspired Robot Learning, Machine Learning for Robot Control Abstract: This paper develops a real-time brain-inspired learning control (RBiLC) strategy for adaptive tracking of quadcopters subject to nonlinear configuration uncertainties and time-varying uncertain dynamics. An online learning-evaluation-optimization mechanism addresses flight control law reconfiguration during unforeseen structural changes (e.g., propeller/motor faults). The RBiLC-based adaptive controller reduces design time and human intervention. Lyapunov-Krasovskii functions ensure stability under parametric uncertainties, while a signed sinusoidal perturbation estimate guides online learning direction and magnitude. Theoretical stability analysis under dynamic uncertainties proves the system converges to a compact set within finite time. Simulations and flight tests confirm the scheme achieves enhanced tracking performance and accelerated adaptation compared to existing methods.

11:05-11:10, Paper ThAT19.8
Neural Signatures and Decoding of the Various Cognitive Processes Elicited by the Same Stimulus

Ju, Jiawei	Lingang Laboratory
Zou, Yongjie	Lingang Laboratory
Keywords: Brain-Machine Interfaces, Cognitive Control Architectures, Intention Recognition Abstract: Multi-tasks decoding from electroencephalogram (EEG) signals is of great value for brain-computer interaction (BCI) applications in natural scenes. Although most existing studies have concentrated on decoding significantly different multi-tasks, a few studies explored the various cognitive processes that individuals may exhibit when elicited by the same stimulus. However, in practice, the diversity and complexity of individuals’ cognitive responses when faced with the same stimulus cannot be ignored. In this paper, we aimed to construct a paradigm of the various cognitive processes elicited by the same stimulus, explore the neural signatures, and decode the multiple cognitive processes from EEG signals. Experimental results show that the regularized linear discriminant analysis (RLDA) classifier with event-related spectral perturbation (ERSP) features yielded a decoding accuracy of 96.30% ± 3.40% for the multi-cognitions. In-depth research on signatures and decoding of various cognitive processes elicited by the same stimulus is of great significance for improving the naturalness and intelligence of BCI.


ThAT20	210D
Perception for Grasping and Manipulation 1	Regular Session
Co-Chair: Liu, Xing	Northwestern Polytechnical University

10:30-10:35, Paper ThAT20.1
Visual-Tactile Perception Based Control Strategy for Complex Robot Peg-In-Hole Process Via Topological and Geometric Reasoning

Wang, Gaozhao	Northwestern Polytechnical University
Liu, Xing	Northwestern Polytechnical University
Liu, Zhengxiong	Northwestern Polytechnical University
Huang, Panfeng	Northwestern Polytechnical University
Yang, Yang	Northwestern Polytechnical University, Research Center of Intell
Keywords: Perception for Grasping and Manipulation, Force and Tactile Sensing, Contact Modeling Abstract: Peg-hole-insertion processes of diverse shapes are typical contact-rich tasks, which need the accurate representation of object’s shape, pose, and peg-hole contact states. The visual-tactile sensor can perceive the relative moving trend between the gripper and the grasped object, which could be applied in the perception of the peg-hole contact states. In order to complete peg-hole insertion tasks, this manuscript proposes a method of using the visual-tactile sensor to estimate the relative position of peg and hole. Furthermore, it introduces the theory of topological and geometric reasoning to characterize the insertion process, which could be used for various polygon shaped pegs and holes. In five different shapes of peg and hole experiments, errors of peg-hole relative position estimation using the method proposed in this manuscript are almost within 5 degrees, which can meet the needs of insertion tasks. What’s more, insertion processes become more smooth by adopting the topological and geometric reasoning, indicating the effectiveness of the reasoning process.

10:35-10:40, Paper ThAT20.2
Neural Collision Detection for Constrained Grasp Pose Optimization in Cluttered Environments

Lin, Longyuan	Fuzhou University
Zhu, Weiwei	Fuzhou University
Zhuang, Yixin	Fuzhou University
Zheng, Qinghai	Fuzhou University
Yu, Yuanlong	Fuzhou University
Keywords: Perception for Grasping and Manipulation, Collision Avoidance, Deep Learning for Visual Perception Abstract: Robust robotic grasping in cluttered environments presents a significant challenge, as existing methods often neglect the complex interactions between the gripper, objects, and obstacles, leading to collisions and grasping failures. To address this, we propose a framework that integrates collision avoidance as a core constraint within the grasp pose optimization process. Central to this framework is a Neural Collision Detection (NCD) network that takes scene configurations and grasp poses as inputs, producing a collision score that approximates traditional collision detection functions. The NCD network provides critical feedback for refining grasp predictions and demonstrates strong generalization across diverse environments, facilitating efficient collision detection and constrained grasp pose optimization. Additionally, we incorporate frictional force closure, geometric symmetry, and surface alignment as regularization terms within the optimization function, enhancing the physical stability and geometric plausibility of the generated grasps. Extensive experiments conducted in real-world environments show a significant improvement in grasp success rates, with robust generalization to previously unseen objects and scenarios. These results validate the efficacy of our framework, highlighting its potential for enabling reliable robotic manipulation in complex and cluttered environments.

10:40-10:45, Paper ThAT20.3
Tactile-Based Force Estimation for Interaction Control with Robot Fingers

Chelly, Elie	Sorbonne Université - Institut Des Systèmes Intelligents Et Rob
Cherubini, Andrea	LS2N - Ecole Centrale Nantes
Fraisse, Philippe	LIRMM
Ben Amar, Faiz	Université Pierre Et Marie Curie, Paris 6
Khoramshahi, Mahdi	Sorbonne University
Keywords: Perception for Grasping and Manipulation, Force and Tactile Sensing, Calibration and Identification Abstract: Fine dexterous manipulation requires reactive control based on rich sensing of manipulator-object interactions. Tactile sensing arrays provide rich contact information across the manipulator's surface. However their implementation faces two main challenges: accurate force estimation across complex surfaces like robotic hands, and integration of these estimates into reactive control loops. We present a data-efficient calibration method that enables rapid, full-array force estimation across varying geometries, providing online feedback that accounts for non-linearities and deformation effects. Our force estimation model serves as feedback in an online closed-loop control system for interaction force tracking. The accuracy of our estimates is independently validated against measurements from a calibrated force-torque sensor. Using the Allegro Hand equipped with Xela uSkin sensors, we demonstrate precise force application through an admittance control loop running at 100Hz, achieving up to 0.12pm0.08 [N] error margin—results that show promising potential for dexterous manipulation.

10:45-10:50, Paper ThAT20.4
GraspMamba: A Mamba-Based Language-Driven Grasp Detection Framework with Hierarchical Feature Learning

Hoang Nguyen, Huy	Austrian Institute of Technology
Vuong, An Dinh	MBZUAI
Nguyen, Anh	University of Liverpool
Reid, Ian	University of Adelaide
Vu, Minh Nhat	TU Wien, Austria
Keywords: Perception for Grasping and Manipulation, Grasping, Computer Vision for Automation Abstract: Grasp detection is a fundamental robotic task critical to the success of many industrial applications. However, current language-driven models for this task often struggle with cluttered images, lengthy textual descriptions, or slow inference speed. We introduce GraspMamba, a new language-driven grasp detection method that employs hierarchical feature fusion with Mamba vision to tackle these challenges. By leveraging rich visual features of the Mamba-based backbone alongside textual information, our approach effectively enhances the fusion of multimodal features. GraspMamba represents the first Mamba-based grasp detection model to extract vision and language features at multiple scales, delivering robust performance and rapid inference time. Intensive experiments show that GraspMamba outperforms recent methods by a clear margin. We validate our approach through real-world robotic experiments, highlighting its fast inference speed.

10:50-10:55, Paper ThAT20.5
SR3D: Unleashing Single-View 3D Reconstruction for Transparent and Specular Object Grasping

Zhang, Mingxu	Beijing University of Posts and Telecommunications
Li, Xiaoqi	Peking University
Xu, Jiahui	Beijing University of Posts and Telecommunications
Bae, Hojin	Peking University
Xiong, Chuyan	Institution of Computer Technology, Chinese Academy of Science
Shen, Yan	Peking University
Zhou, Kaichen	University of Oxford
Dong, Hao	Peking University
Keywords: Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation, Grasping Abstract: Recent advancements in 3D robotic manipulation have improved grasping of everyday objects, but transparent and specular materials remain challenging due to depth sensing limitations. While several 3D reconstruction and depth completion approaches address these challenges, they suffer from setup complexity or limited observation information utilization. To address this, leveraging the power of single-view 3D object reconstruction approaches, we propose a training-free framework SR3D that enables robotic grasping of transparent and specular objects from a single-view observation. Specifically, given single-view RGB and depth images, SR3D first uses the external visual models to generate 3D reconstructed object mesh based on RGB image. Then, the key idea is to determine the 3D object’s pose and scale to accurately localize the reconstructed object back into its original depth corrupted 3D scene. Therefore, we propose view matching and keypoint matching mechanisms, which leverage both the 2D and 3D's inherent semantic and geometric information in the observation to determine the object's 3D state within the scene, thereby reconstructing an accurate 3D depth map for effective grasp detection. Experiments in both simulation and real-world show the reconstruction effectiveness of SR3D.

10:55-11:00, Paper ThAT20.6
JENGA: Object Selection and Pose Estimation for Robotic Grasping from a Stack

Jeevanandam, Sai Srinivas	German Research Center for Artificial Intelligence (DFKI)
Inuganti, Sandeep	RPTU Kaiserslautern
Govil, Shreedhar	German Research Center for Artificial Intelligence
Stricker, Didier	German Research Center for Artificial Intelligence
Rambach, Jason	German Research Center for Artificial Intelligence
Keywords: Perception for Grasping and Manipulation, Data Sets for Robotic Vision, Robotics and Automation in Construction Abstract: Vision-based robotic object grasping is typically investigated in the context of isolated objects or unstructured object sets in bin picking scenarios. However, there are several settings, such as construction or warehouse automation, where a robot needs to interact with a structured object formation such as a stack. In this context, we define the problem of selecting suitable objects for grasping along with estimating an accurate 6DoF pose of these objects. To address this problem, we propose a camera-IMU based approach that prioritizes unobstructed objects on the higher layers of stacks and introduce a dataset for benchmarking and evaluation, along with a suitable evaluation metric that combines object selection with pose accuracy. Experimental results show that although our method can perform quite well, this is a challenging problem if an error-free solution is needed. Finally, we show results from the deployment of our method for a brick-picking application in a construction scenario.

11:00-11:05, Paper ThAT20.7
Sequential Multi-Object Grasping with One Dexterous Hand

He, Sicheng	University of Southern California
Shangguan, Zeyu	University of Southern California
Wang, Kuanning	Fudan University
Gu, Yongchong	Fudan University
Fu, Yuqian	INSAIT
Fu, Yanwei	Fudan University
Seita, Daniel	University of Southern California
Keywords: Perception for Grasping and Manipulation, Multifingered Hands, AI-Based Methods Abstract: Sequentially grasping multiple objects with multi-fingered hands is common in daily life, where humans can fully leverage the dexterity of their hands to enclose multiple objects. However, the diversity of object geometries and the complex contact interactions required for high-DOF hands to grasp one object while enclosing another make sequential multi-object grasping challenging for robots. In this paper, we propose SeqMultiGrasp, a system for sequentially grasping objects with a four-fingered Allegro Hand. We focus on sequentially grasping two objects, ensuring that the hand fully encloses one object before lifting it, and then grasps the second object without dropping the first. Our system first synthesizes single-object grasp candidates, where each grasp is constrained to use only a subset of the hand’s links. These grasps are then validated in a physics simulator to ensure stability and feasibility. Next, we merge the validated single-object grasp poses to construct multi-object grasp configurations. For deployment, we train a diffusion model conditioned on point clouds to propose grasp poses, followed by a heuristic-based execution strategy for real-world grasping. We test our system using 88 object combinations in simulation and 63 object combinations in real. Our diffusion based grasp model obtains an average success rate of 65.8% over 1,600 simulation trials and 56.7% over 90 real-world trials, suggesting that it is a promising approach for sequential multi-object grasping with multi-fingered hands. Supplementary material is available on our project website: https://hesic73.github.io/SeqMultiGrasp.


ThAT21	101
Machine Learning for Robot Control 1	Regular Session
Chair: Lee, Dongheui	Technische Universität Wien (TU Wien)
Co-Chair: Wan, Weiwei	Osaka University

10:30-10:35, Paper ThAT21.1
DL-Clip: Online D-Learning with Clipping Operation for Fast Model-Free Stabilizing Control

Liu, Jingxuan	Beihang University
Wang, Chenyu	Beihang University
Shen, Zhaolong	Beihang University
Quan, Quan	Beihang University
Keywords: Machine Learning for Robot Control, Reinforcement Learning, Visual Servoing Abstract: In this paper, we present DL-Clip, an innovative online learning approach for nonlinear stabilizing control that operates without prior knowledge of system dynamics or reward signals, while significantly improving training efficiency. DL-Clip introduces a novel integration of stabilizing control with efficient Reinforcement Learning (RL) training mechanisms. The algorithm uses Lyapunov functions to ensure system stability and employs clipping operations to optimize policy updates, achieving faster convergence. We evaluate the effectiveness of DL-Clip through experiments, including simulations of the inverted pendulum and the Image-Based Visual Servoing (IBVS) for multicopter position stabilization. In addition, we validate the approach through a real flight experiment based on the IBVS problem, demonstrating its practical applicability.

10:35-10:40, Paper ThAT21.2
Celebi's Choice: Causality-Guided Skill Optimisation for Granular Manipulation Via Differentiable Simulation

Wei, Minglun	Cardiff University
Yang, Xintong	CARDIFF UNIVERSITY
Yan, Junyu	University of Edinburgh
Lai, Yu-Kun	Cardiff University
Ji, Ze	Cardiff University
Keywords: Machine Learning for Robot Control, AI-Based Methods, Deep Learning in Grasping and Manipulation Abstract: Robotic soil manipulation is essential for automated farming, particularly in excavation and levelling tasks. However, the nonlinear dynamics of granular materials challenge traditional control methods, limiting stability and efficiency. We propose Celebi, a causality-enhanced optimisation method that integrates differentiable physics simulation with adaptive step-size adjustments based on causal inference. To enable gradient-based optimisation, we construct a differentiable simulation environment for granular material interactions. We further define skill parameters with a differentiable mapping to end-effector motions, facilitating efficient trajectory optimisation. By modelling causal effects between task-relevant features extracted from point cloud observations and skill parameters, Celebi selectively adjusts update step sizes to enhance optimisation stability and convergence efficiency. Experiments in both simulated and real-world environments validate Celebi’s effectiveness, demonstrating robust and reliable performance in robotic excavation and levelling tasks.

10:40-10:45, Paper ThAT21.3
Dynamics As Prompts: In-Context Learning for Sim-To-Real System Identifications

Zhang, Xilun	Stanford University
Liu, Shiqi	Carnegie Mellon University
Huang, Peide	Apple Inc
Han, William	Carnegie Mellon University
Lyu, Yiqi	Carnegie Mellon University
Xu, Mengdi	Stanford University
Zhao, Ding	Carnegie Mellon University
Keywords: Machine Learning for Robot Control, AI-Based Methods, Reinforcement Learning Abstract: Sim-to-real transfer remains a significant challenge in robotics due to the discrepancies between simulated and real-world dynamics. Traditional methods like Domain Randomization often fail to capture fine-grained dynamics, limiting their effectiveness for precise control tasks. In this work, we propose a novel approach that dynamically adjusts simulation environment parameters online using in-context learning. By leveraging past interaction histories as context, our method adapts the simulation environment dynamics to real-world dynamics without requiring gradient updates, resulting in faster and more accurate alignment between simulated and real-world performance. We validate our approach across two tasks: object scooping and table air hockey. In the sim-to-sim evaluations, our method significantly outperforms the baselines on environment parameter estimation by 80% and 42% in the object scooping and table air hockey setups, respectively. Furthermore, our method achieves at least 70% success rate in sim-to-real transfer on object scooping across three different objects. By incorporating historical interaction data, our approach delivers efficient and smooth system identification, advancing the deployment of robots in dynamic real-world scenarios. Demos are available on our project page: https://sim2real-capture.github.io/

10:45-10:50, Paper ThAT21.4
A Passivity-Based Approach for Variable Stiffness Control with Dynamical Systems (I)

Michel, Youssef	Technical University of Munich
Saveriano, Matteo	University of Trento
Lee, Dongheui	Technische Universität Wien (TU Wien)
Keywords: Machine Learning for Robot Control, Compliance and Impedance Control, Learning from Demonstration Abstract: In this paper, we present a controller that combines motion generation and control in one loop, to endow robots with reactivity and safety. In particular, we propose a control approach that enables to follow the motion plan of a first order Dynamical System (DS) with a variable stiffness profile, in a closed loop configuration where the controller is always aware of the current robot state. This allows the robot to follow a desired path with an interactive behavior dictated by the desired stiffness. We also present two solutions to enable a robot to follow the desired velocity profile, in a manner similar to trajectory tracking controllers, while maintaining the closed-loop configuration. Additionally, we exploit the concept of energy tanks in order to guarantee the passivity during interactions with the environment, as well as the asymptotic stability in free motion, of our closed-loop system. The developed approach is evaluated extensively in simulation, as well as in real robot experiments, in terms of performance and safety both in free motion and during the execution of physical interaction tasks. Note to Practitioners—The approach presented in this work allows for safe and reactive robot motions, as well as the capacity to shape the robot’s physical behavior during interactions. This becomes crucial for performing contact tasks that might require adaptability or for interactions with humans as in shared control or collaborative tasks. Furthermore, the reactive properties of our controller make it adequate for robots that operate in proximity to humans or in dynamic environments where potential collisions are likely to happen.

10:50-10:55, Paper ThAT21.5
Disentangled Object-Centric Image Representation for Robotic Manipulation

Emukpere, David	Naver Labs Europe
Deffayet, Romain	Naver Labs Europe
Wu, Bingbing	Naver Labs Europe
Brégier, Romain	NAVER LABS Europe
Niemaz, Michael	Naver Labs Europe
Meunier, Jean-Luc	Naver Labs Europe
Proux, Denys	Naver Labs Europe
Renders, Jean-Michel	Naver Labs Europe
Kim, Seungsu	Naver Labs Europe
Keywords: Machine Learning for Robot Control, Deep Learning in Grasping and Manipulation, Reinforcement Learning Abstract: Learning robotic manipulation skills from vision is a promising approach for developing robotics applications that can generalize broadly to real-world scenarios. As such, many approaches to enable this vision have been explored with fruitful results. Particularly, object-centric representation methods have been shown to provide better inductive biases for skill learning, leading to improved performance and generalization. Nonetheless, we show that object-centric methods can struggle to learn simple manipulation skills in multi-object environments. Thus, we propose DOCIR, an object-centric framework that introduces a disentangled representation for objects of interest, obstacles, and robot embodiment. We show that this approach leads to state-of-the-art performance for learning pick and place skills from visual inputs in multi-object environments and generalizes at test time to changing objects of interest and distractors in the scene. Furthermore, we show its efficacy both in simulation and zero-shot transfer to the real world.

10:55-11:00, Paper ThAT21.6
Meta-Reinforcement Learning with Evolving Gradient Regularization

Chen, Jiaxing	National University of Defense Technology
Ma, Ao	National University of Defence Technology
Chen, Shaofei	National University of Defense Technology
Yuan, Weilin	National University of Defense Technology
Hu, Zhenzhen	National University of Defense Technology
Li, Peng	National University of Defence Technology
Keywords: Machine Learning for Robot Control, Evolutionary Robotics, Reinforcement Learning Abstract: Deep reinforcement learning (DRL) typically requires reinitializing training for new tasks, limiting its generalization due to isolated knowledge transfer. Meta-reinforcement learning (Meta-RL) addresses this by enabling rapid adaptation through prior task experiences, yet existing gradient-based methods like MAML suffer from poor out-of-distribution performance due to overfitting narrow task distributions. To overcome this limitation, we propose Evolving Gradient Regularization MAML (ER-MAML). By integrating evolving gradient regularization into the MAML framework, ER-MAML optimizes meta-gradients while constraining adaptation directions via a regularization policy. This dual mechanism prevents overparameterization and enhances robustness across diverse task distributions. Experiments demonstrate ER-MAML outperforms state-of-the-art baselines by 14.6% in out-of-distribution success rates. It also achieves strong online adaptation performance in the MetaWorld benchmark. These results validate ER-MAML's effectiveness in improving meta-RL generalization under distribution shifts.

11:00-11:05, Paper ThAT21.7
Preference Aligned Diffusion Planner for Quadrupedal Locomotion Control

Yuan, Xinyi	Osaka University
Shang, Zhiwei	The Hong Kong University of Science and Technology (Guangzhou)
Wang, Zifan	The Hong Kong University of Science and Technology (Guangzhou)
Wang, Chenkai	Southern University of Science and Technology
Shan, Zhao	Tsinghua University
Zhu, Meixin	Hong Kong University of Science and Technology (Guangzhou)
Bai, Chenjia	Institute of Artificial Intelligence (TeleAI), China Telecom
Wan, Weiwei	Osaka University
Harada, Kensuke	Osaka University
Li, Xuelong	Northwestern Polytechnical University
Keywords: Machine Learning for Robot Control, Legged Robots, AI-Based Methods Abstract: Diffusion models demonstrate superior performance in capturing complex distributions from large-scale datasets, providing a promising solution for quadrupedal locomotion control. However, the robustness of the diffusion planner is inherently dependent on the diversity of the pre-collected datasets. To mitigate this issue, we propose a two-stage learning framework to enhance the capability of the diffusion planner under limited dataset (reward-agnostic). Through the offline stage, the diffusion planner learns the joint distribution of state-action sequences from expert datasets without using reward labels. Subsequently, we perform the online interaction in the simulation environment based on the trained offline planner, which significantly diversified the original behavior and thus improves the robustness. Specifically, we propose a novel weak preference labeling method without the ground-truth reward or human preferences. The proposed method exhibits superior stability and velocity tracking accuracy in pacing, trotting, and bounding gait under different speeds and can perform a zero-shot transfer to the real Unitree Go1 robots.

11:05-11:10, Paper ThAT21.8
BeeTLe: Blind Terrain-Aware Learned Locomotion

Fransen, Rogier	University of Surrey
Bowden, Richard	University of Surrey
Hadfield, Simon	University of Surrey
Keywords: Machine Learning for Robot Control, Legged Robots, Incremental Learning Abstract: One of the largest challenges in the deployment of legged robots in the real world is deriving effective general gaits. In this paper, we present BeeTLe, which is a framework that enables terrain aware locomotion without the need for dedicated terrain sensors. BeeTLe is realised as a multi-expert policy Reinforcement Learning (RL) algorithm. This enables multiple gaits, applicable to different surface types, to be stored and shared in a single policy. Sensor free terrain awareness is incorporated using a Recurrent Neural Network (RNN) to infer surface type purely from actuator positions over time. The RNN achieves an accuracy of 94% in terrain identification out of 8 possible options. We demonstrate that BeeTLe achieves a greater performance than the baselines across a series of challenges including: the traversal of a flat plane, a tilted plane, a sequence of tilted planes and geometry modelling a natural hilly terrain. This is despite not seeing the sequence of tilted planes and the natural hilly terrain during training.


ThAT22	102A
Dual Arm Manipulation 1	Regular Session
Chair: Li, Shuai	University of Florida
Co-Chair: Kudoh, Shunsuke	The University of Electro-Communications

10:30-10:35, Paper ThAT22.1
A Dual-Arm Shared Control Framework Integrating Sub-Goals and Predicted Trajectories for Asymmetric Tasks

Wang, Zhixiong	School of Electrical Engineering, Guangxi University
Li, Shaodong	Guangxi Key Laboratory of Intelligent Control and Maintenance Of
Shuang, Feng	Guangxi University
Gao, Fang	Guangxi University
Keywords: Dual Arm Manipulation, Human-Robot Collaboration, Task and Motion Planning Abstract: In robotic operation, asymmetric tasks requiring dual-arm cooperation are the highly challenging research direction. Autonomous operation generally has a low success rate or poor generalization because of its excessive dependence on the accuracy of sub-goals from asymmetric tasks. Although teleoperation can significantly improve the performances above, during operation, the operators are prone to neglect crucial intermediate sub-goals that are conducive to fine-grained dual-arm cooperation. Therefore, we propose a dual-arm shared control framework which firstly introduces the Sub-goal Generation module to sufficiently concentrate on the intermediate states, thus improving the ability of fine-grained dual-arm cooperation and reducing the adjustment quantity during robotic asymmetric task operation. Also, we integrate the Trajectory Prediction module that computes the future trajectory based on the historical movement information to enhance the robot motion smoothness. Finally, through dynamic combination of Sub-goal Generation module, Trajectory Prediction module and operator movement in the shared control framework, we effectively decrease the sensitivity to the accuracy of sub-goals, thus significantly improving the success rate. In simulation, we conduct the comparative experiments with autonomous operation and teleoperation on four common asymmetric tasks to validate the advantages of our shared control framework. The effect of each element in our framework is verified by ablation study. Certainly, our shared control framework can also be applied in real-world scenario.

10:35-10:40, Paper ThAT22.2
A Planning Framework for Stable Robust Multi-Contact Manipulation

Yang, Lin	Nanyang Technological University
Turlapati, Sri Harsha	Nanyang Technological University
Lu, Zhuoyi	Nanyang Technological University
Lv, Chen	Nanyang Technological University
Campolo, Domenico	Nanyang Technological University
Keywords: Dual Arm Manipulation, Compliant Assembly, Compliance and Impedance Control Abstract: While modeling multi-contact manipulation as a quasi-static mechanical process transitioning between different contact equilibria, we propose formulating it as a planning and optimization problem, explicitly evaluating (i) contact stability and (ii) robustness to sensor noise. Specifically, we conduct a comprehensive study on multi-manipulator control strategies, focusing on dual-arm execution in a planar peg-in-hole task and extending it to the Multi-Manipulator Multiple Peg-in-Hole (MMPiH) problem to explore increased task complexity. Our framework employs Dynamic Movement Primitives (DMPs) to parameterize desired trajectories and Black-Box Optimization (BBO) with a comprehensive cost function incorporating friction cone constraints, squeeze forces, and stability considerations. By integrating parallel scenario training, we enhance the robustness of the learned policies. To evaluate the friction cone cost in experiments, we test the optimal trajectories computed for various contact surfaces, i.e., with different coefficients of friction. The stability cost is analytical explained and tested its necessity in simulation. The robustness performance is quantified through variations of hole pose and chamfer size in simulation and experiment. Results demonstrate that our approach achieves consistently high success rates in both the single peg-in-hole and multiple peg-in-hole tasks, confirming its effectiveness and generalizability. The video can be found at https://youtu.be/IU0pdnSd4tE.

10:40-10:45, Paper ThAT22.3
Coordination of Learned Decoupled Dual-Arm Tasks through Gaussian Belief Propagation

Prados, Adrian	Universidad Carlos III De Madrid
Espinoza, Gonzalo	Universidad Carlos III De Madrid
Moreno, Luis	Carlos III University
Barber, Ramon	Universidad Carlos III of Madrid
Keywords: Dual Arm Manipulation, Manipulation Planning, Learning from Demonstration Abstract: Robotic manipulation can involves multiple manipulators to complete a task. In those cases, the complexity of performing the task in a coordinated manner increases, requiring coordinated planning while avoiding collisions between robots and environmental elements. For these challenges, we propose a robotic arm control algorithm based on Learning from Demonstration to independently learn the tasks of each arm, followed by a graph-based communication method using Gaussian Belief Propagation. Our method enables the resolution of decoupled dual-arm tasks learned independently without requiring coordinated planning. The algorithm generates smooth, collision-free solutions between arms and environmental obstacles while ensuring efficient movements without the need for constant replanning. Its efficiency has been validated through experiments and comparisons against another multi-robot control method in simulation using PyBullet with two opposing IIWA robots, as well as a mobile robot with two UR3 arms, which has also been used for real-world testing.

10:45-10:50, Paper ThAT22.4
Grasping and Alignment of Stacked Fabrics by Robot Hands with Sticky Fingers

Kondo, Kazuki	The University of Electro-Communications
Yamazaki, Takuma	The University of Electro-Communications
Kimura, Kohei	The University of Electro-Communications
Kudoh, Shunsuke	The University of Electro-Communications
Keywords: Dual Arm Manipulation, Grippers and Other End-Effectors, Grasping Abstract: This study proposes a method for grasping only the topmost layer of stacked fabrics with a dual-arm robot and placing it at a target position. The proposed method employs a three-finger robot hand consisting of one adhesive finger and two non-adhesive grasping fingers to grasp only the top fabric layer. After grasping, feature matching is performed to determine the rotation angle and translation vector required to align the fabric with the target position, and then, the fabric is moved while being pressed by the fingers. The proposed method was implemented on a robot and validated through experiments using five different fabrics with varying structures, surface materials, masses, and thicknesses, thereby confirming its effectiveness.

10:50-10:55, Paper ThAT22.5
Imitation-Guided Bimanual Planning for Stable Manipulation under Changing External Forces

Cai, Kuanqi	Technical University of Munich
Wang, Chunfeng	Guangdong University of Technology
Li, Zeqi	Technical University of Munich
Yao, Haowen	Technical Univerity of Munich
Chen, Weinan	Guangdong University of Technology
Figueredo, Luis	University of Nottingham (UoN)
Billard, Aude	EPFL
Ajoudani, Arash	Istituto Italiano Di Tecnologia
Keywords: Dual Arm Manipulation, Human-Robot Collaboration, Motion and Path Planning Abstract: Robotic manipulation in dynamic environments often requires seamless transitions between different grasp types to maintain stability and efficiency. However, achieving smooth and adaptive grasp transitions remains a challenge, particularly when dealing with external forces and complex motion constraints. Existing grasp transition strategies often fail to account for varying external forces and do not optimize motion performance effectively. In this work, we propose an Imitation-Guided Bimanual Planning Framework that integrates efficient grasp transition strategies and motion performance optimization to enhance stability and dexterity in robotic manipulation. Our approach introduces Strategies for Sampling Stable Intersections in Grasp Manifolds for seamless transitions between uni-manual and bi-manual grasps, reducing computational costs and regrasping inefficiencies. Additionally, a Hierarchical Dual-Stage Motion Architecture combines an Imitation Learning-based Global Path Generator with a Quadratic Programming-driven Local Planner to ensure real-time motion feasibility, obstacle avoidance, and superior manipulability. The proposed method is evaluated through a series of force-intensive tasks, demonstrating significant improvements in grasp transition efficiency and motion performance.

10:55-11:00, Paper ThAT22.6
Local Path Optimization in the Latent Space Using Learned Distance Gradient

Zhang, Jiawei	Harbin Institute of Technology
Bai, Chengchao	Harbin Institute of Technology
Pan, Wei	The University of Manchester
Liu, Tianhang	Harbin Institute of Technology
Guo, Jifeng	Harbin Institute of Technology
Keywords: Dual Arm Manipulation, Motion and Path Planning Abstract: Constrained motion planning is a common but challenging problem in robotic manipulation. In recent years, data-driven constrained motion planning algorithms have shown impressive planning speed and success rate. Among them, the latent motion method based on manifold approximation is the most efficient planning algorithm. Due to errors in manifold approximation and the difficulty in accurately identifying collision conflicts within the latent space, time-consuming path validity checks and path replanning are required. In this paper, we propose a method that trains a neural network to predict the minimum distance between the robot and obstacles using latent vectors as inputs. The learned distance gradient is then used to calculate the direction of movement in the latent space to move the robot away from obstacles. Based on this, a local path optimization algorithm in the latent space is proposed, and it is integrated with the path validity checking process to reduce the time of replanning. The proposed method is compared with state-of-the-art algorithms in multiple planning scenarios, demonstrating the fastest planning speed.

11:00-11:05, Paper ThAT22.7
Adaptive Noise Rejection Strategy for Cooperative Motion Control of Dual-Arm Robots

Zhang, Xiyuan	Hainan University
Yu, Yilin	The School of Information and Communication Engineering, Hainan
Cang, Naimeng	Hainan University
Guo, Dongsheng	Hainan University
Li, Shuai	University of Oulu
Zhang, Weidong	Shanghai JiaoTong University
Zheng, Jinrong	Institute of Deep-Sea Science and Engineering, Chinese Academy O
Keywords: Dual Arm Manipulation, Robust/Adaptive Control, Motion Control Abstract: Dual-arm robots possess exceptional collaborative capabilities and versatility, demonstrating broad application prospects across various fields. As a significant research area for dual-arm robots, the requirements for coordinated motion control are gradually increasing. In practical applications, robots inevitably encounter noise interference, which can lead to suboptimal performance in coordinated motion control. In this letter, cooperative motion control of dual-arm robots in the presence of harmonic noise is investigated. On the basis of the relative Jacobian method, an adaptive noise rejection strategy is proposed for cooperative motion control of dual-arm robots perturbed by harmonic noise. Such a strategy incorporates a compensator, which can simulate and suppress interference from harmonic noise. Theoretical analysis indicates that the Cartesian error generated by the proposed strategy exhibits convergence. Simulation and experiment results under a dual-arm system consisting of two Panda robot manipulators further verify the noise resistance and applicability of the proposed strategy with the existence of harmonic noise.

11:05-11:10, Paper ThAT22.8
Dual-Arm Fabric Manipulation Learning with Grasp Pose Constraints

Zhu, Zhongpan	University of Shanghai for Science and Technology
Guo, Zhaochen	Tongji University
Hu, Qi	Qi
Zhou, Yanmin	Tongji University
He, Bin	Tongji University
Keywords: Dual Arm Manipulation, Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation Abstract: Dual-arm manipulation of fabrics is important and complex in the field of embodied intelligence for robots. Previous work has mostly focused on extracting the geometric features of fabrics, but has neglected the knowledge and experience constraints on grasp postures like humans, which greatly affects the success rate of operations. Therefore, this paper proposes a dual-arm manipulation method for fabrics based on imitation learning, including the study of an action encoding learning algorithm based on dynamic motion primitives (DMPs), the establishment of a dual-arm motion primitive library to provide prior knowledge for operation planning, the creation of a segment-based dual-arm grasp method based on prior knowledge, and the realization of extracting the optimal grasp posture from the implicit geometric features of messy fabrics. Experiments show that the success rate of this method in the fabric unfolding task is 85.0%, with an average of 5.0 operations, and the success rate in the fabric folding task is 96.7%, both of which are significantly better than the two baseline methods, SpeedFolding and FlingBot.


ThAT23	102B
Force and Tactile Sensing 4	Regular Session
Chair: Ding, Wenbo	Tsinghua University
Co-Chair: Li, Tianliang	Wuhan University of Technology

10:30-10:35, Paper ThAT23.1
UltraTac: Integrated Ultrasound-Augmented Visuotactile Sensor for Enhanced Robotic Perception

Gong, Junhao	Tsinghua University
Sou, Kit-Wa	Tsinghua University
Li, Shoujie	Tsinghua Shenzhen International Graduate School
Guo, Changqing	Tsinghua University
Huang, Yan	Tsinghua University
Lyu, Chuqiao	Tsinghua Shenzhen International Graduate School
Song, Ziwu	Tsinghua University
Ding, Wenbo	Tsinghua University
Keywords: Force and Tactile Sensing, Sensor Fusion, Embedded Systems for Robotic and Automation Abstract: Visuotactile sensors provide high-resolution tactile information but are incapable of perceiving the material features of objects. We present UltraTac, an integrated sensor that combines visuotactile imaging with ultrasound sensing through a coaxial optoacoustic architecture. The design shares structural components and achieves consistent sensing regions for both modalities. Additionally, we incorporate acoustic matching into the traditional visuotactile sensor structure, enabling the integration of the ultrasound sensing modality without compromising visuotactile performance. Through tactile feedback, we can dynamically adjust the operating state of the ultrasound module to achieve more flexible functional coordination. Systematic experiments demonstrate three key capabilities: proximity sensing in the 3–8 cm range (R² = 0.99), material classification (average accuracy: 99.20%), and texture-material dual-mode object recognition achieves 92.11% accuracy on a 15-class task. Finally, we integrate the sensor into a robotic manipulation system to concurrently detect container surface patterns and internal content, which verifies its promising potential for advanced human-machine interaction and precise robotic manipulation.

10:35-10:40, Paper ThAT23.2
Vision-Based Force Feedback System Using Moir´e Patterns

Zuo, Jinrun	Hiroshima University
Takaki, Takeshi	Hiroshima University
Keywords: Force and Tactile Sensing, Visual Servoing, Force Control Abstract: This paper proposes a novel force feedback system based on visual processing and Moir´e patterns. The system uses a force sensor with a simple and efficient structure to eliminate the need for cables or other electronic components. A brass flexure plate was employed as the primary elastic element, leveraging the Moir´e fringe principle for force measurements. High-speed cameras are used to capture real-time images that are then processed using advanced algorithms to accurately extract force data. Force feedback experiments were conducted to evaluate the performance of the proposed system, and the results were compared with those taken from conventional force sensors. The experimental results indicated that the proposed device consistently delivered precise target force outputs with outcomes that closely matched those obtained using traditional sensors. In addition, the system demonstrated robust performance in high-frequency measurements at 500 Hz, achieving an average force- feedback convergence time of approximately 0.1s.

10:40-10:45, Paper ThAT23.3
NailTact: Single-Camera Based Tactile Fingertip with Nail

Zhou, Hao	Ritsumeikan University
Miyazaki, Masahiro	Ritsumeikan University
Shimonomura, Kazuhiro	Ritsumeikan University
Keywords: Force and Tactile Sensing, Grasping Abstract: Vision-based tactile sensing, an economical and widely utilized methodology, has the potential to offer crucial contact geometry information for localizing objectives even in cases of visual occlusion. However, this kind of fingertip sensor is limited. When a person picks up a relatively small object placed on a flat surface with two fingers, they may not only use the pads of their fingers depending on the size of the object but also use their fingernails for small or thin objects. Fingers with nail structures have been shown to be effective in picking up objects like this in robot hands as well. Moreover, in actual work, accidental contact between sensors and surrounding objects such as tables often occurs. Sensors with fingernails can avoid this situation in advance by having the fingernails touch the object before the fingertip touches the object. In this work, we present the NailTact, which can detect the force applied to both the fingertip part and the nail part from the same camera image using a single camera. Using the prototype robot finger, we will verify the sensor response characteristics to the load on the nail and the sensor response when grasping an object with the nail and the situation when the finger makes contact with a table. We also present a simple model that illustrates the relationship between the force applied to the nail and the movement of the marker. In the card-grasping experiment, we not only successfully grasped a very thin object but also measured the grasping force.

10:45-10:50, Paper ThAT23.4
Low-Fidelity Visuo-Tactile Pre-Training Improves Vision-Only Manipulation Performance

Gano, Selam	Carnegie Mellon University
George, Abraham	Carnegie Mellon University
Barati Farimani, Amir	Carnegie Mellon University
Keywords: Force and Tactile Sensing, Dexterous Manipulation, Imitation Learning Abstract: Tactile perception is essential for real-world manipulation tasks, yet the high cost and fragility of tactile sensors can limit their practicality. In this work, we explore BeadSight (a low-cost, open-source tactile sensor) alongside a tactile pre-training approach, an alternative method to precise, pre-calibrated sensors. By pre-training with the tactile sensor and then disabling it during downstream tasks, we aim to enhance robustness and reduce costs in manipulation systems. We investigate whether tactile pre-training, even with a low-fidelity sensor like BeadSight, can improve the performance of an imitation learning agent on complex manipulation tasks. Through visuo-tactile pre-training on both similar and dissimilar tasks, we analyze its impact on a longer-horizon downstream task. Our experiments show that visuo-tactile pre-training improved performance on a USB cable plugging task by up to 65% with vision-only inference. Additionally, on a longer-horizon drawer pick-and-place task, pre-training — whether on a similar, dissimilar, or identical task — consistently improved performance, highlighting the potential for a large-scale visuo-tactile pre-trained encoder.

10:50-10:55, Paper ThAT23.5
ThinTact: Thin Vision-Based Tactile Sensor by Lensless Imaging

Xu, Jing	Tsinghua University
Chen, Weihang	Tsinghua University
Qian, Hongyu	Tsinghua University
Wu, Dan	Tsinghua University
Chen, Rui	Tsinghua University
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation, Lensless Imaging, Computer Vision for Other Robotic Applications Abstract: Vision-based tactile sensors have drawn increasing interest in the robotics community. However, traditional lensbased designs impose minimum thickness constraints on these sensors, limiting their applicability in space-restricted settings. In this paper, we propose ThinTact, a novel lensless visionbased tactile sensor with a sensing field of over 200 mm2 and a thickness of less than 10 mm. ThinTact utilizes the mask-based lensless imaging technique to map the contact information to CMOS signals. To ensure real-time tactile sensing, we propose a real-time lensless reconstruction algorithm that leverages a frequency-spatial-domain joint filter based on discrete cosine transform (DCT). This algorithm achieves significantly faster computation than existing optimization-based methods. Additionally, to improve the sensing quality, we develop a mask optimization method based on the generic algorithm and the corresponding system matrix calibration algorithm. We evaluate the performance of our proposed lensless reconstruction and tactile sensing through qualitative and quantitative experiments. Furthermore, we demonstrate ThinTact’s practical applicability in diverse applic

10:55-11:00, Paper ThAT23.6
IFEM2.0: Dense 3-D Contact Force Field Reconstruction and Assessment for Vision-Based Tactile Sensors

Zhao, Can	Shanghai Jiao Tong University
Liu, Jin	Shanghai Jiao Tong University
Ma, Daolin	Shanghai Jiao Tong University
Keywords: Force and Tactile Sensing, Soft Sensors and Actuators, Perception for Grasping and Manipulation, Inverse Finite Element Method Abstract: Vision-based tactile sensors offer rich tactile information through high-resolution tactile images, enabling the reconstruction of dense contact force fields on the sensor surface. However, accurately reconstructing the 3-D contact force distribution remains a challenge. In this article, we propose the multilayer inverse finite-element method (iFEM2.0) as a robust and generalized approach to reconstruct dense contact force distribution. We systematically analyze various parameters within the iFEM2.0 framework, and determine the appropriate parameter combinations through simulation and in situ mechanical calibration. Our approach incorporates multilayer mesh constraints and ridge regularization to enhance robustness. Furthermore, as no off-the-shelf measurement equipment or criterion metrics exist for 3-D contact force distribution perception, we present a benchmark covering accuracy, fidelity, and noise resistance that can serve as a cornerstone for other future force distribution reconstruction methods. The proposed iFEM2.0 demonstrates good performance in both simulation- and experiment-based evaluations. Such dense 3-D contact force information is critical for enabling dexterous robotic manipulation that handles both rigid and soft materials.

11:00-11:05, Paper ThAT23.7
3D Vision-Tactile Reconstruction from Infrared and Visible Images for Robotic Fine-Grained Tactile Perception

Lin, Yuankai	Huazhong University of Science and Technology
Lu, Xiaofan	Huazhong University of Science and Technology
Chen, Jiahui	Huazhong University of Science and Technology
Yang, Hua	Huazhong University of Science and Technology
Keywords: Force and Tactile Sensing, Haptics and Haptic Interfaces, Sensor Fusion Abstract: To achieve human-like haptic perception in anthropomorphic grippers, the compliant sensing surfaces of vision tactile sensor (VTS) must evolve from conventional planar configurations to biomimetically curved topographies with continuous surface gradients. However, planar VTSs have challenges when extended to curved surfaces, including insufficient lighting of surfaces, blurring in reconstruction, and complex spatial boundary conditions for surface structures. With an end goal of constructing a human-like fingertip, our research (i) develops GelSplitter3D by expanding imaging channels with a prism and a near-infrared (NIR) camera, (ii) proposes a photometric stereo neural network with a CAD-based normal ground truth generation method to calibrate tactile geometry, and (iii) devises a normal integration method with boundary constraints of depth prior information to correcting the cumulative error of surface integrals. We demonstrate better tactile sensing performance, a 40% improvement in normal estimation accuracy, and the benefits of sensor shapes in grasping and manipulation tasks.

11:05-11:10, Paper ThAT23.8
High Temperature Sterilization Resistant and Enclosed Three-Axial Force-Sensing Surgical Instrument Integrated with Step-Reduced FBG

Li, Tianliang	Wuhan University of Technology
Fan, Haolei	Wuhan University of Technology
Zhao, Chen	Wuhan University of Technology
Du, Mingchang	Wuhan University of Technology
Tu, Houxin	Wuhan University of Technology
Zhu, Siqi	Wuhan University of Technology
Keywords: Force and Tactile Sensing, Surgical Robotics: Laparoscopy, Medical Robots and Systems Abstract: The Fiber Bragg grating (FBG) three-axial force sensor provides force feedback for an endoscopic surgical robot, reducing operational difficulty and risks. However, the packaging method of the optical fiber sensor demonstrates limited adaptability to both high-temperature sterilization environments and the wet operative areas encountered during surgery. Based on this, this paper presents step-reduced FBG and enclosed three-axial force sensor. The sensor adopts an integrated design with a maximum outer diameter of 4.5 mm and can be seamlessly integrated into the end of the flexible endoscopic surgical robot. At the same time, the hydrofluoric acid corrosion process is introduced to obtain the twin reflection spectrum and realize the decoupling of three-axial forces and temperature. Static calibration demonstrates etched grating sensitivities of 227.78 pm/N (Fx), 242.63 pm/N (Fy), and 233.50 pm/N (Fz) via least-squares fitting. Force-temperature coupling experiment confirms maximum full-scale force errors remain below 5% under temperature perturbation, verifying the reliability of the sensor. Finally, the high-temperature sterilization experiment at 180°C was conducted, demonstrating the designed sensor’s thermal stability under medical device sterilization protocols.


ThAT24	102C
Calibration and Identification 1	Regular Session
Chair: Ravankar, Ankit A.	Tohoku University

10:30-10:35, Paper ThAT24.1
Generative Adversarial Networks for Solving Hand-Eye Calibration without Data Correspondence

Hong, IlKwon	Hyundai Motor Group
Ha, Junhyoung	Korea Institute of Science and Technology
Keywords: Calibration and Identification, Deep Learning Methods, Visual Servoing Abstract: In this study, we rediscovered the framework of generative adversarial networks (GANs) as a solver for calibration problems without data correspondence. When data correspondence is not present or loosely established, the calibration problem becomes a parameter estimation problem that aligns the two data distributions. This procedure is conceptually identical to the underlying principle of GAN training in which networks are trained to match the generative distribution to the real data distribution. As a primary application, this idea is applied to the hand-eye calibration problem, demonstrating the proposed method's applicability and benefits in complicated calibration problems.

10:35-10:40, Paper ThAT24.2
EF-Calib: Spatiotemporal Calibration of Event and Frame-Based Cameras Using Continuous-Time Trajectories

Wang, Shaoan	Peking University
Xin, Zhanhua	Peking UNiversity
Hu, Yaoqing	Peking University
Li, Dongyue	Peking University
Zhu, Mingzhu	Fuzhou University
Yu, Junzhi	Chinese Academy of Sciences
Keywords: Calibration and Identification, Sensor Fusion, Computer Vision for Automation Abstract: The event camera, a bio-inspired asynchronous triggered camera, offers promising prospects for fusion with frame-based cameras owing to its low latency and high dynamic range. However, calibrating stereo vision systems that incorporate both event- and frame-based cameras remains a significant challenge. In this letter, we present EF-Calib, a spatiotemporal calibration framework for event- and frame-based cameras using continuous-time trajectories. A novel calibration pattern applicable to both camera types and the corresponding event recognition algorithm are proposed. Leveraging the asynchronous nature of events, a derivable piece-wise B-spline to represent camera pose continuously is introduced, enabling calibration for intrinsic parameters, extrinsic parameters, and time offset, with analytical Jacobians provided. Various experiments are carried out to evaluate the calibration performance of mbox{EF-Calib}, including calibration experiments for intrinsic parameters, extrinsic parameters, and time offset. Experimental results demonstrate that EF-Calib outperforms current SOTA methods by achieving the most accurate intrinsic parameters, comparable accuracy in extrinsic parameters to frame-based method, and precise time offset estimation. EF-Calib provides a convenient and accurate toolbox for calibrating the system that fuses events and frames. The code of this paper is open-sourced at: https://github.com/wsakobe/EF-Calib.

10:40-10:45, Paper ThAT24.3
Active Learning for Exciting Motion Generation with Safety Constraint: Toward Reducing Model-Reality Gap in Inertial Parameters

Mori, Kenya	The University of Tokyo
Ayusawa, Ko	National Institute of Advanced Industrial Science and Technology
Venture, Gentiane	The University of Tokyo
Keywords: Calibration and Identification, Planning under Uncertainty, Constrained Motion Planning Abstract: Inertial parameters should be estimated accurately for precise robot control and simulation. Exciting motions, motions that sufficiently excite all robot dynamics, must be generated to obtain these parameters accurately. However, this process requires a certain level of accuracy in the inertial parameters themselves, leading to a circular dependency, especially in cases with a significant model-reality discrepancy. To address this challenge, we propose a constrained method for generating exciting motions within an iterative data acquisition process (active learning). It optimizes the condition number of the regressor matrix and achieves sufficient excitation by combining several motions. We evaluate our method on a manipulator robot not fixed to the ground, in which case dynamic constraints should be considered to maintain balance during experiments. Despite a substantial gap between the initial model and the actual robot (simulated by adding a 0.575 kg payload at the end effector), our method continuously improved the condition number without motion execution failure—whereas conventional methods resulted in robot tipping. Furthermore, cross-validation analysis confirmed that our approach achieved the lowest root mean square error (RMSE) along with one conventional method. These results collectively demonstrate that our method provides the best performance for inertial parameter identification, proving to be particularly practical in applications where dynamic constraints are critical for motion planning. This method minimizes the need for a priori knowledge or expertise, as the generation process is fully automated and can easily be generalized to various robotic systems.

10:45-10:50, Paper ThAT24.4
Multi-Kernel Correntropy Regression: Robustness, Optimality, and Application on Magnetometer Calibration (I)

Li, Shilei	Beijing Institute of Technology
Chen, Yihan	Harbin Institute of Technology Shenzhen,
Lou, Yunjiang	Harbin Institute of Technology, Shenzhen
Shi, Dawei	Beijing Institute of Technology
Li, Lijing	China University of Mining and Technology
Shi, Ling	The Hong Kong University of Science and Technology
Keywords: Calibration and Identification, Optimization and Optimal Control, Sensor Fusion Abstract: This paper investigates the robustness and optimality of the multi-kernel correntropy (MKC) on linear regression. We first derive an upper error bound for a scalar regression problem in the presence of arbitrarily large outliers. Then, we find that the proposed MKC is related to a specific heavy-tail distribution, where its head shape is consistent with the Gaussian distribution while its tail shape is heavy-tailed and the extent of heavy-tail is controlled by the kernel bandwidth. Interestingly, when the bandwidth is infinite, the MKC-induced distribution becomes a Gaussian distribution, enabling the MKC to address both Gaussian and non-Gaussian problems by appropriately selecting correntropy parameters. To automatically tune these parameters, an expectation-maximization-like (EM) algorithm is developed to estimate the parameter vectors and the correntropy parameters in an alternating manner. The results show that our algorithm can achieve equivalent performance compared with the traditional linear regression under Gaussian noise and significantly outperforms the conventional method under heavy-tailed noise. Both numerical simulations and experiments on a magnetometer calibration application verify the effectiveness of the proposed method.

10:50-10:55, Paper ThAT24.5
Multiscale and Uncertainty-Aware Targetless Hand-Eye Calibration Via the Gauss-Helmert Model

Colakovic-Benceric, Marta	University of Zagreb
Persic, Juraj	University of Zagreb
Markovic, Ivan	University of Zagreb Faculty of Electrical Engineering and Compu
Petrovic, Ivan	University of Zagreb
Keywords: Calibration and Identification, Sensor Fusion, Localization, Optimization Abstract: The operational reliability of an autonomous robot depends crucially on extrinsic sensor calibration as a prerequisite for precise and accurate data fusion. Exploring the calibration of unscaled sensors (e.g., monocular cameras) and the effective utilization of uncertainties are difficult and often overlooked. The development of a solution for the simultaneous calibration of hand-eye sensors and scale estimation based on the Gauss-Helmert model aims to utilize the valuable information contained in the uncertainty of odometry. In this work, we propose a versatile and robust solution for batch calibration based on the analytical on-manifold approach for estimation. The versatility of our method is demonstrated by its ability to calibrate multiple unscaled and metric-scaled sensors while dealing with odometry failures and reinitializations. Importantly, all estimated parameters are provided with their corresponding uncertainties. The validation of our method and its comparison with five competing state-of-the-art calibration methods in both simulations and real-world experiments show its superior accuracy, with particularly promising results observed in high-noise scenarios.

10:55-11:00, Paper ThAT24.6
PLK-Calib: Single-Shot and Target-Less LiDAR-Camera Extrinsic Calibration Using Pl"ucker Lines

Zhang, Yanyu	University of California, Riverside
Xu, Jie	University of California, Riverside
Ren, Wei	University of California, Riverside
Keywords: Calibration and Identification, Data Sets for Robotic Vision, Sensor Fusion Abstract: Accurate LiDAR-Camera (LC) calibration is challenging but crucial for autonomous systems and robotics. In this paper, we propose two single-shot and target-less algorithms to estimate the calibration parameters between LiDAR and camera using line features. The first algorithm constructs line-to-line constraints by defining points-to-line projection errors and minimizes the projection error. The second algorithm (PLK-Calib) utilizes the co-perpendicular and co-parallel geometric properties of lines in Pl¨ucker (PLK) coordinate, and decouples the rotation and translation into two constraints, enabling more accurate estimates. Our degenerate analysis and Monte Carlo simulation indicate that three nonparallel line pairs are the minimal requirements to estimate the extrinsic parameters. Furthermore, we collect an LC calibration dataset with varying extrinsic under three different scenarios and use it to evaluate the performance of our proposed algorithms.

11:00-11:05, Paper ThAT24.7
Cal or No Cal? - Real-Time Miscalibration Detection of LiDAR and Camera Sensors

Tahiraj, Ilir	Technical University of Munich
Swadiryus, Jeremialie	Technical University of Munich
Fent, Felix	Technical University of Munich
Lienkamp, Markus	Technical University of Munich
Keywords: Calibration and Identification, Sensor Fusion, Intelligent Transportation Systems Abstract: The goal of extrinsic calibration is the alignment of sensor data to ensure an accurate representation of the surroundings and enable sensor fusion applications. From a safety perspective, sensor calibration is a key enabler of autonomous driving. In the current state of the art, a trend from target-based offline calibration towards targetless online calibration can be observed. However, online calibration is subject to strict real-time and resource constraints which are not met by state-of-the-art methods. This is mainly due to the high number of parameters to estimate, the reliance on geometric features, or the dependence on specific vehicle maneuvers. To meet these requirements and ensure the vehicle's safety at any time, we propose a miscalibration detection framework that shifts the focus from the direct regression of calibration parameters to a binary classification of the calibration state, i.e., calibrated or miscalibrated. Therefore, we propose a contrastive learning approach that compares embedded features in a latent space to classify the calibration state of two different sensor modalities. Moreover, we provide a comprehensive analysis of the feature embeddings and challenging calibration errors that highlight the performance of our approach. As a result, our method outperforms the current state-of-the-art in terms of detection performance, inference time, and resource demand. The code will be made available open-source.

11:05-11:10, Paper ThAT24.8
Multi-Cali Anything: Dense Feature Multi-Frame Structure-From-Motion for Large-Scale Camera Array Calibration

You, Jinjiang	Meta Platforms, Inc
Wang, Hewei	Carnegie Mellon University
Li, Yijie	Carnegie Mellon University
Huo, Mingxiao	Carnegie Mellon University
Tran Ha, Long Van	Symbotic
Ma, Mingyuan	Harvard Univeristy
Xu, Jinfeng	The University of Hong Kong
Zhang, Jiayi	University of Nottingham Ningbo China
Wu, Puzhen	Cornell University
Garg, Shubham	Meta Platforms, Inc
Pu, Wei	Carnegie Mellon University
Keywords: Calibration and Identification, Computational Geometry, RGB-D Perception Abstract: Calibrating large-scale camera arrays, such as those in dome-based setups, is time-intensive and typically requires dedicated captures of known patterns. While extrinsics in such arrays are fixed due to the physical setup, intrinsics often vary across sessions due to factors like lens adjustments or temperature changes. In this paper, we propose a dense-feature-driven multi-frame calibration method that refines intrinsics directly from scene data, eliminating the necessity for additional calibration captures. Our approach enhances traditional Structure-from-Motion (SfM) pipelines by introducing an extrinsics regularization term to progressively align estimated extrinsics with ground-truth values, a dense feature reprojection term to reduce keypoint errors by minimizing reprojection loss in the feature space, and an intrinsics variance term for joint optimization across multiple frames. Experiments on the Multiface dataset show that our method achieves nearly the same precision as dedicated calibration processes, and significantly enhances intrinsics and 3D reconstruction accuracy. Fully compatible with existing SfM pipelines, our method provides an efficient and practical plug-and-play solution for large-scale camera setups. Our code is publicly available at: https://github.com/YJJfish/Multi-Cali-Anything


ThAT25	103A
Legged Robots 5	Regular Session
Co-Chair: Ding, Liang	Harbin Institute of Technology

10:30-10:35, Paper ThAT25.1
Design of Biomimetic and Energy-Efficient Legs for a Humanoid Robot

Wang, Junyang	Harbin Institute of Technology
Li, XueAi	Harbin Institute of Technology
Ni, Fenglei	State Key Laboratory of Robotics and System, Harbin Institute Of
Cao, Baoshi	Harbin Institute of Technology
Qi, Le	Harbin Institute of Technology
Liu, Hong	Harbin Institute of Technology
Wang, Xiangji	Harbin Institute of Technology
Zhang, Teng	Harbin Institute of Technology
Keywords: Legged Robots, Mechanism Design, Biomimetics Abstract: The mobility and manipulation of bipedal humanoid robots always depend on their legs that account for balancing, which may substantially accelerate energy consumption especially when the lower body is expected to be stationary. To this end, this letter presents a biomimetic and energy-efficient design for bipedal robots’ legs and extensively demonstrates its performance on the developed prototype. From the biomimetic perspective, human walking data is captured at first, and the range of motion, speed, and coupling relationship of various joints is analyzed. The skeletal and muscular structure of each joint is then dissected and imitated by mechanism synthesis, where a novel locking mechanism inspired by the biological structure of the knee joint is integrated. In order to better evaluate the energy consumption capability of legged robots, we propose a new metric entitled the dynamic power of the system (DPoS) and experimentally prove its rationality. The effectiveness and superiority of our design are ultimately validated through the comparative experiments on both our prototype and the off-the-shelf counterpart.

10:35-10:40, Paper ThAT25.2
Development of a 3-DOF Planar Monopod Piezoelectric Robot Actuated by Multidirectional Spatial Elliptical Trajectories (I)

Zhao, Yuzhu	Harbin Institute of Technology
Zhang, Shijing	Harbin Institute of Technology
Deng, Jie	Harbin Institute of Technology
Li, Jing	Harbin Institute of Technology
Liu, Yingxiang	Harbin Institute of Technology
Keywords: Legged Robots, Micro/Nano Robots Abstract: A three degrees of freedom (3-DOF) planar monopod piezoelectric robot (MPR) is proposed in this work, in which five longitudinal vibration ultrasonic transducers are orthogonally set and gathered at a single driving foot. A prominent feature of the MPR is that the single foot can generate multidirectional spatial elliptical trajectories by the cooperation of the first-order longitudinal vibrations in three orthogonal directions to achieve flexible motions. The proposed MPR addresses the problems of complex structures, intricate control strategies, and motion inconsistency in existing multi-DOF piezoelectric robots employing multiple driving feet. By optimizing the distribution of the multidirectional spatial elliptical trajectories, the MPR realizes linear motions along X-axis and Y-axis, and rotary motion around Z-axis. In addition, it holds the capability of generating planar omnidirectional motions. The working principle is illustrated and validated via simulations and experiments. The experimental results demonstrate that the MPR is capable of achieving 3-DOF fast motions at speeds of up to 700 mm/s, 712 mm/s, and 919 °/s. The linear motion resolution reaches 0.49 µm. Furthermore, the MPR can carry a maximum load of 3 kg. The proposed MPR successfully achieves outstanding performances including agile motion, fast speed, high resolution, and strong load capacity.

10:40-10:45, Paper ThAT25.3
Tensegrity-Based Legged Robot Generates Passive Walking, Skipping, and Crawling Gaits in Accordance with Environment (I)

Zheng, Yanqiu	Ritsumeikan University
Asano, Fumihiko	Japan Advanced Institute of Science and Technology
Yan, Cong	Ritsumeikan University
Li, Longchuan	Beijing University of Chemical Technology
Tokuda, Isao T.	Ritsumeikan University
Keywords: Legged Robots, Modeling and Simulating Humans, Passive Walking Abstract: Legged locomotion animals produce various gaits, e.g., skipping, walking, running, and crawling, depending upon the environmental situation. Such an autonomous selection of distinct gait patterns is highly advantageous for energy efficiency, stress minimization, and postural stability. It should be of interest to introduce such a remarkably flexible mechanism to robotics research. This study addresses the mechanism of generating various gait patterns in a passive legged locomotion system by introducing a tensegrity structure. Building upon the classical model of the rimless wheel, we propose a novel model, called rimless wheel-like tensegrity walker (RTW). Numerical simulations show that the RTW system can generate skipping, walking, and crawling gaits depending upon the strength of the body-leg coupling, which controls independence level of the leg movements. Smooth gait transition can also be realized by a change in the body-leg coupling or the environmental parameter. An experimental study using physical models of the RTW confirmed the validity of the numerical results. The RTW may provide a minimal locomotion model to generate various gaits and to induce their transitions.

10:45-10:50, Paper ThAT25.4
Workspace-Based Motion Planning for Quadrupedal Robots on Rough Terrain (I)

Gu, Yaru	Memorial University of Newfoundland
Zou, Ting	Memorial University
Keywords: Legged Robots, Motion and Path Planning Abstract: Legged robots have demonstrated high potential when dealing with rough terrain, for which an efficient motion planner becomes crucial. This article presents a novel approach for quadrupedal robot motion planning on rough terrain that is both conceptually straightforward and computationally efficient. Implementing the concept of workspace constitutes the cornerstone of this method: both body poses and swing-leg footholds are chosen within their corresponding workspace. A novel approach called the “cross-diagonal method” is developed to facilitate the search for new body poses. Based on the obtained body pose, the foothold for a swing leg selected within its foot workspace satisfies the reachability constraint automatically. The proposed motion planning scheme is integrated with an elevation mapping module and a state estimation module, enabling quadrupedal robots to travel through uneven terrains with high efficiency. The significance of this work is validated through simulation and physical experiments with a quadrupedal robot, which achieves high success rates in overcoming difficult terrains without prior knowledge of the environment. This approach offers the advantages of high computational efficiency, simplicity, and adaptability to different types of terrain, making it a promising solution for real-world applications.

10:50-10:55, Paper ThAT25.5
Quadrupedal Locomotion with Parallel Compliance: E-Go Design, Modeling, and Control (I)

Ding, Jiatao	University of Trento
Posthoorn, Perry	Delft University of Technology
Atanassov, Vassil	University of Oxford
Boekel, Fabio García Medina	Delft University of Technology
Kober, Jens	TU Delft
Della Santina, Cosimo	TU Delft
Keywords: Legged Robots, Multi-Contact Whole-Body Motion Planning and Control, Optimization and Optimal Control Abstract: To promote the research in compliant quadrupedal locomotion, especially with parallel elasticity, we present Delft E-Go, which is an easily accessible quadruped that combines the Unitree Go1 with open-source mechanical add-ons and control architecture. Implementing this novel system required a combination of technical work and scientific innovation. First, a dedicated parallel spring with adjustable rest length is designed to strengthen each actuated joint. Then, a novel 3D dual spring-loaded inverted pendulum model is proposed to characterize the compliant locomotion dynamics, decoupling the actuation with parallel compliance. Based on this template model, trajectory optimization is employed to generate optimal explosive motion without requiring reference defined in advance. To complete the system, a torque controller with anticipatory compensation is adopted for motion tracking. Extensive hardware experiments in multiple scenarios, such as trotting across uneven terrains, efficient walking, and explosive pronking, demonstrate the system’s reliability, energy benefits of parallel compliance, and enhanced locomotion capability. Particularly, we demonstrate for the first time the controlled pronking of a quadruped with asymmetric legs with this novel system.

10:55-11:00, Paper ThAT25.6
Online Hierarchical Planning for Multicontact Locomotion Control of Quadruped Robots (I)

Sun, Hao	China Academy of Launch Vehicle Technology
Yang, Junjie	HIT
Jia, Yinghao	Harbin Institute of Technology
Wang, Changhong	HIT
Keywords: Legged Robots, Multi-Contact Whole-Body Motion Planning and Control, Whole-Body Motion Planning and Control Abstract: Owing to challenges such as lengthy solving time and convergence issues, multicontact locomotion planning problems are often formulated with fixed contact schedules, greatly restricting the flexibility of quadrupedal robot behavior. This article presents a novel hierarchical planning framework designed for online multicontact locomotion control of quadruped robots. At the top level, we systematically explore the gait branches of a passive planar quadrupedal dynamic model via numerical continuation. In addition, we propose a gait assessment strategy at varying speeds, comprehensively considering both energy consumption and stability. At the middle level, we present an efficient strategy for contact-implicit optimization problems by integrating McCormick envelopes and alternating direction method of multipliers. Using the gait selection reference obtained from the top level as an initial guess can substantially reduce the solution space and bring the resulting solution closer to the global optimum. Based on the gait pattern and state trajectory reference acquired from the middle level, we adopt a hybrid kinodynamic model for application in model predictive control of quadrupedal locomotion. To validate the proposed hierarchical planning framework, we conduct comparative locomotion experiments on the quadrupedal robot SCIT-Dog under varying speeds. Experimental results demonstrate the effectiveness and superiority of the proposed algorithm compared to the impulse-based gait transition method and the predefined trot gait pattern. Moreover, the observed gaits align with those of quadrupedal animals, demonstrating the potential of the proposed framework to enhance adaptability and performance in multicontact locomotion planning for quadrupedal robots.

11:00-11:05, Paper ThAT25.7
16 Ways to Gallop: Energetics and Body Dynamics of High-Speed Quadrupedal Gaits

Alqaham, Yasser G.	Syracuse University
Cheng, Jing	Syracuse University
Gan, Zhenyu	Syracuse University
Keywords: Legged Robots, Optimization and Optimal Control, Dynamics Abstract: Galloping is a common high-speed gait in both animals and quadrupedal robots, yet its energetic characteristics remain insufficiently explored. This study systematically analyzes a large number of possible galloping gaits by categorizing them based on the number of flight phases per stride and the phase relationships between the front and rear legs, following Hildebrand’s framework for asymmetrical gaits. Using the A1 quadrupedal robot from Unitree, we model galloping dynamics as a hybrid dynamical system and employ trajectory optimization (TO) to minimize the cost of transport (CoT) across a range of speeds. Our results reveal that rotary and transverse gallop footfall sequences exhibit no fundamental energetic difference, despite variations in body yaw and roll motion. However, the number of flight phases significantly impacts energy efficiency: galloping with no flight phases is optimal at lower speeds, whereas galloping with two flight phases minimizes energy consumption at higher speeds. We validate these findings using a Quadratic Programming (QP)-based controller, developed in our previous work, in Gazebo simulations. These insights advance the understanding of quadrupedal locomotion energetics and may inform future legged robot designs for adaptive, energy-efficient gait transitions.

11:05-11:10, Paper ThAT25.8
An Online Terrain Classification Framework for Legged Robots Based on Fusion of Proprioceptive and Exteroceptive Sensors

Ding, Weikai	Shandong University
Meng, Jingui	Shandong University
Zhu, Zhengguo	Shandong University
Chen, Teng	Shandong University
Zhang, Guoteng	Shandong University
Keywords: Legged Robots, Perception-Action Coupling Abstract: Terrain classification is crucial for assessing terrain traversability and supporting locomotion control of legged robots. By integrating multi-source sensor information, including exteroceptive sensors and proprioceptive sensors, legged robots can acquire terrain geometric features and surface cover types. However, single-sensor approaches exhibit inherent limitations, where exteroceptive sensors are susceptible to environmental interference while proprioceptive sensors struggle to identify surface cover types. To address these challenges, this paper proposes a robust terrain classification framework that overcomes the limitations of single-modal perception through fusion of exteroceptive and proprioceptive sensors. The framework comprises a Golden Sine Optimization Algorithm-based random forest model using proprioceptive sensors to determine optimal hyperparameter combinations based on classification requirements, and a YOLOv11 network integrated with intersection over union object tracking algorithm to achieve stable image extraction during robot movement. Final terrain classification is accomplished through Kalman filter-based decision fusion. Experimental validation demonstrated classification accuracies of 94.4% for the proprioceptive module and 94.2% for the visual module in offline testing. In online fusion testing, the system achieved 95.9% overall classification accuracy, confirming the effectiveness and engineering practicality of the proposed method.


ThAT26	103B
Localization 5	Regular Session
Co-Chair: de Croon, Guido	TU Delft

10:30-10:35, Paper ThAT26.1
PEnG: Pose-Enhanced Geo-Localisation

Shore, Tavis	University of Surrey
Mendez, Oscar	University of Surrey
Hadfield, Simon	University of Surrey
Keywords: Localization, Vision-Based Navigation, Computer Vision for Transportation Abstract: Cross-view Geo-localisation is typically performed at a coarse granularity, because densely sampled satellite image patches overlap heavily. This heavy overlap would make disambiguating patches very challenging. However, by opting for sparsely sampled patches, prior work has placed an artificial upper bound on the localisation accuracy that is possible. Even a perfect oracle system cannot achieve accuracy greater than the average separation of the tiles. To solve this limitation, we propose combining cross-view geo-localisation and relative pose estimation to increase precision to a level practical for real-world application. We develop PEnG, a 2-stage system which first predicts the most likely edges from a city-scale graph representation upon which a query image lies. It then performs relative pose estimation within these edges to determine a precise position. PEnG presents the first technique to utilise both viewpoints available within cross-view geo-localisation datasets, referring to this as Multi-View Geo-Localisation (MVGL). This enhances accuracy to a sub-metre level, with some examples achieving centimetre level precision. Our proposed ensemble achieves state-of-the-art accuracy - with relative Top-5m retrieval improvements on previous works of 213%. Decreasing the median euclidean distance error by 96.90% from the previous best of 734m down to 22.77m, when evaluating with 90◦ horizontal FOV images. Code will be made available: github.com/tavisshore/peng.

10:35-10:40, Paper ThAT26.2
Floor Plan Based Active Global Localization and Navigation Aid for Persons with Blindness and Low Vision

Goswami, Raktim	New York University
Sinha, Harshit	New York University
Palacherla, Venkata Amith	New York University
Hari, Jagennath	New York University, Laudando & Assessment LLC
Krishnamurthy, Prashanth	New York University Tandon School of Engineering
Rizzo, John-Ross	NYU School of Medicine / NYU Tandon School of Engineering
Khorrami, Farshad	New York University Tandon School of Engineering
Keywords: Localization, Vision-Based Navigation, Motion and Path Planning Abstract: Navigation of an agent, such as a person with blindness or low vision, in an unfamiliar environment poses substantial difficulties, even in scenarios where prior maps, like floor plans, are available. It becomes essential first to determine the agent’s pose in the environment. The task’s complexity increases when the agent also needs directions for exploring the environment to reduce uncertainty in the agent’s position. This problem of active global localization typically involves finding a transformation to match the agent’s sensor-generated map to the floor plan while providing a series of point-to- point directions for effective exploration. Current methods fall into two categories: learning-based, requiring extensive training for each environment, or non-learning-based, which generally depend on prior knowledge of the agent’s initial position, or the use of floor plan maps created with the same sensor modality as the agent. Addressing these limitations, we introduce a novel system for real-time, active global localization and navigation for persons with blindness and low vision. By generating semantically informed real-time goals, our approach enables local exploration and the creation of a 2D semantic point cloud for effective global localization. Moreover, it dynamically corrects for odometry drift using the architectural floor plan, independent of the agent’s global position and introduces a new method for real-time loop closure on reversal. Our approach’s effectiveness is validated through multiple real-world indoor experiments, also highlighting its adaptability and ease of extension to any mobile robot.

10:40-10:45, Paper ThAT26.3
Generalized Maximum Likelihood Estimation for Perspective-N-Point Problem

Zhan, Tian	Beijing Institute of Technology
Xu, Chunfeng	Beijing Institute of Technology
Zhang, Cheng	Beijing Institute of Technology
Zhu, Ke	Beijing Institute of Technology
Keywords: Localization, Vision-Based Navigation, Probability and Statistical Methods Abstract: The Perspective-n-Point (PnP) problem has been widely studied in the literature and applied in various vision-based pose estimation scenarios. However, most existing methods ignore the anisotropy uncertainty of observations, as demonstrated in several real-world datasets in this paper. This oversight may lead to suboptimal and inaccurate estimation, particularly in the presence of noisy observations. To this end, we propose a generalized maximum likelihood PnP solver, named GMLPnP, that minimizes the determinant criterion by iterating the generalized least squares procedure to estimate the pose and uncertainty simultaneously. Further, the proposed method is decoupled from the camera model. Results of synthetic and real experiments show that our method achieves better accuracy in common pose estimation scenarios, GMLPnP improves rotation/translation accuracy by 4.7%/2.0% on TUM-RGBD and 18.6%/18.4% on KITTI-360 dataset compared to the best baseline. It is more accurate under very noisy observations in a vision-based UAV localization task, outperforming the best baseline by 34.4% in translation estimation accuracy.

10:45-10:50, Paper ThAT26.4
SCORE: Saturated Consensus Relocalization in Semantic Line Maps

Jiang, Haodong	The Chinese University of Hong Kong, Shenzhen
Zheng, Xiang	Chinese University of HongKong(ShenZhen)
Zhang, Yanglin	The Chinese University of Hong Kong, Shenzhen
Zeng, Qingcheng	The Hong Kong University of Science and Technology (Guangzhou)
Li, Yiqian	University of Pennsylvania
Hong, Ziyang	Heriot-Watt University
Wu, Junfeng	The Chinese Unviersity of Hong Kong, Shenzhen
Keywords: Localization, Vision-Based Navigation Abstract: We present SCORE, a visual relocalization system that achieves unprecedented map compactness by adopting semantically labeled 3D line maps. SCORE requires only 0.01%–0.1% of the storage needed by structure-based or learning-based baselines, while maintaining practical accuracy and comparable runtime. The key innovation is a novel robust estimation mechanism, Saturated Consensus Maximization (Sat-CM), which generalizes classical Consensus Maximization (CM) by assigning diminishing weights to inlier associations with probabilistic justification. Under extreme outlier ratios (up to 99.5%) arising from one-to-many ambiguity in semantic matching, Sat-CM enables accurate estimation when CM fails. To ensure computational efficiency, we propose an accelerating framework for globally solving Sat-CM formulations and specialize it for the Perspective-n-Lines problem at the core of SCORE.

10:50-10:55, Paper ThAT26.5
Self-TIO: Thermal-Inertial Odometry Via Self-Supervised 16-Bit Feature Extractor and Tracker

Lee, Junwoon	University of Michigan
Ando, Taisei	The University of Tokyo
Shinozaki, Mitsuru	Technology Innovation R&D Dept.Ⅱ, Research & Development H
Kitajima, Toshihiro	KUBOTA Corporation
An, Qi	The University of Tokyo
Yamashita, Atsushi	The University of Tokyo
Keywords: Localization, Visual-Inertial SLAM Abstract: In recent years, thermal odometry has gained significant attention in mobile robotics for addressing visually degraded scenes. To achieve a reasonable robustness and accuracy of thermal odometry, a learning-based image feature extractor and tracker has been proposed. While learning-based methods generally provide better feature tracking results in thermal images compared to classical methods, they still require labeled data for training and struggle with real-time execution. To deal with these issues, this paper presents a robust and accurate thermal-inertial odometry (TIO) system, Self-TIO equipped with a self-supervised feature extractor and tracker designed for the 16-bit radiometric image domain. Moreover, Self-TIO employs a hybrid tracker, combining the Kanade- Lucas-Tomasi (KLT) tracker and learning-based optical flow, to achieve high robustness and subpixel accuracy, even in scenes affected by non-uniformity correction (NUC) and aggressive motion. Experimental results demonstrate that our method outperforms state-of-the-art methods in both feature tracking and thermal-inertial odometry.

10:55-11:00, Paper ThAT26.6
High-Order Regularization Dealing with Ill-Conditioned Robot Localization Problems

Liu, Xinghua	University of Groningen
Cao, Ming	University of Groningen
Keywords: Localization, Wheeled Robots, Range Sensing, High-order Regularization Abstract: In this work, we propose a high-order regularization method to solve the ill-conditioned problems in robot localization. Numerical solutions to robot localization problems are often unstable when the problems are ill-conditioned. A typical way to solve ill-conditioned problems is regularization, and a classical regularization method is the Tikhonov regularization. It is shown that the Tikhonov regularization is a low-order case of our method. We find that the proposed method is superior to the Tikhonov regularization in approximating some ill-conditioned inverse problems, such as some basic robot localization problems. The proposed method overcomes the over-smoothing problem in the Tikhonov regularization as it uses more than one term in the approximation of the matrix inverse, and an explanation for the over-smoothing of the Tikhonov regularization is given. Moreover, one a priori criterion, which improves the numerical stability of the ill-conditioned problem, is proposed to obtain an optimal regularization matrix. As most of the regularization solutions are biased, we also provide two bias-correction techniques for the proposed high-order regularization. The simulation and experimental results using an Ultra-Wideband sensor network in a 3D environment are discussed, demonstrating the performance of the proposed method.

11:00-11:05, Paper ThAT26.7
BEVDiffLoc: End-To-End LiDAR Global Localization in BEV View Based on Diffusion Model

Wang, Ziyue	National University of Defense Technology
Shi, Chenghao	NUDT
Wang, Neng	National University of Defense Technology
Yu, Qinghua	National University of Defense Technology
Chen, Xieyuanli	National University of Defense Technology
Lu, Huimin	National University of Defense Technology
Keywords: Localization Abstract: Localization is one of the core part in modern robotics. Classic localization methods typically follow a retrieve-then-register paradigm, achieving remarkable success. Recently, the emergence of end-to-end localization approaches has offered distinct advantages, including a streamlined system architecture and the elimination of the need to store extensive map data. Although these methods have demonstrated promising results, current end-to-end localization approaches still face limitations in robustness and accuracy. Bird’s-Eye-View image is one of the most widely adopted data representation in autonomous driving. It significantly reduces data complexity while preserving spatial structure and scale consistency, making them an ideal representation for localization tasks. However, research on BEV-based end-to-end localization remains notably insufficient. To fill this gap, we propose BEVDiffLoc, a novel framework that formulates LiDAR localization as a conditional generation of poses. Leveraging the properties of BEV, we first introduce a specific data augmentation method to significantly enhance the diversity of input data. Then, the Maximum Feature Aggregation Module and Vision Transformer are employed to learn robust features while maintaining robustness against significant rotational view variations. Finally, we incorporate a diffusion model that iteratively refines the learned features to recover the absolute pose. Extensive experiments on the Oxford Radar RobotCar and NCLT datasets demonstrate that BEVDiffLoc outperforms baseline methods. Our code is available at https://github.com/nubot-nudt/BEVDiffLoc.

11:05-11:10, Paper ThAT26.8
RING#: PR-By-PE Global Localization with Roto-Translation Equivariant Gram Learning

Lu, Sha	Zhejiang University
Xu, Xuecheng	Zhejiang University
Zhang, Dongkun	Zhejiang University
Wu, Yuxuan	Shanghai Jiao Tong University
Lu, Haojian	Zhejiang University
Chen, Xieyuanli	National University of Defense Technology
Xiong, Rong	Zhejiang University
Wang, Yue	Zhejiang University
Keywords: Localization, SLAM, Autonomous Vehicle Navigation, Place Recognition Abstract: Global localization using onboard perception sensors, such as cameras and LiDARs, is crucial in autonomous driving and robotics applications when GPS signals are unreliable. Most approaches achieve global localization by sequential place recognition (PR) and pose estimation (PE). Some methods train separate models for each task, while others employ a single model with dual heads, trained jointly with separate task-specific losses. However, the accuracy of localization heavily depends on the success of place recognition, which often fails in scenarios with significant changes in viewpoint or environmental appearance. Consequently, this renders the final pose estimation of localization ineffective. To address this, we introduce a new paradigm, PR-by-PE localization, which bypasses the need for separate place recognition by directly deriving it from pose estimation. We propose RING#, an end-to-end PR-by-PE localization network that operates in the bird's-eye-view (BEV) space, compatible with both vision and LiDAR sensors. RING# incorporates a novel design that learns two equivariant representations from BEV features, enabling globally convergent and computationally efficient pose estimation. Comprehensive experiments on the NCLT and Oxford datasets show that RING# outperforms state-of-the-art methods in both vision and LiDAR modalities, validating the effectiveness of the proposed approach. The code is available at https://github.com/lus6-Jenny/RINGSharp.


ThAT27	103C
Energy and Environment-Aware Automation 1	Regular Session
Chair: Wu, Zhenyu	Beijing University of Posts and Telecommunications
Co-Chair: Yang, Yunjie	The University of Edinburgh

10:30-10:35, Paper ThAT27.1
Non-Buoyant Microrobots Swimming with Near-Zero Angle of Attack

Ligtenberg, Leendert-Jan Wouter	University of Twente
Jongh, de, Luuc	University of Twente
van der Kooij, Jaap	University of Twente
Paul, Aniruddha	Univeristy of Twente
Li, Chuang	Liaoning Technical University
Goulas, Constantinos	University of Twente
Mohanty, Sumit	AMOLF
Khalil, Islam S.M.	University of Twente
Keywords: Medical Robots and Systems, Micro/Nano Robots, Motion Control Abstract: In the design of microrobots, a helical geometry is pivotal to overcome the time-reversal constraints of the scallop theorem. The helical geometry enables the microrobots to propel themselves forward in viscous fluids with a corkscrew like motion when they are allowed to rotate. It is physically advantageous for microrobots to swim with near-zero angle of attack much like buoyant microorganisms, allowing high thrust for forward propulsion. This type of propulsion is not possible as the non-buoyant microrobot drifts downward due to gravity. Here, we analyze the stability problem of controlling magnetically driven helical microrobots to achieve bounded straight runs without drift in a low-Reynolds-number regime. We demonstrate periodic active suspension solutions, that facilitate helical propulsion with minimal angle of attack and zero drift. We theoretically predict unique control inputs, for a given helical microrobot geometry and magnetic composition (i.e., 62% Ni and 24% Au Wt%), which can be generated with rotating field and field-gradient pulling. Using microrobots fabricated of denser-than-water soft-magnetic body (4870 kg·m−3), we find that the microrobot is allowed to swim with near-zero angle of attack of 8.3◦ ± 5.2◦ (mean±s.d.), outperforming conventional gravity compensation methods.

10:35-10:40, Paper ThAT27.2
U-Snake: A Small-Sized Smart Underwater Snake Robot

Wang, Bowen	Tongji University
Zuo, Haobo	University of Hong Kong
Fu, Changhong	Tongji University
Keywords: Marine Robotics, Actuation and Joint Mechanisms, Motion Control Abstract: With the rapid development of AI chips, underwater snake robots hold significant promise for navigating complex underwater environments, offering unique advantages in exploration, monitoring, and inspection tasks due to their flexible body and high mobility. However, existing underwater snake robots predominantly employ bulky mechanical configurations with expensive manufacturing costs, resulting in excessive power consumption and limited operational endurance with standard batteries, which impede their widespread adoption and limit their operational flexibility. Moreover, most path following methods used in underwater snake robots inadequately account for the dynamic changes in path curvature, leading to serious tracking error in scenarios involving sharp turns or complex environments, which does not address the demands of more intricate trajectories. To address above issues, this work introduces the U-Snake, a small-sized smart underwater snake robot with a simple lightweight structure and a highly maneuverable controller, adapted to various complicated path following tasks. In particular, each joint of U-Snake is designed to be small and lightweight to achieve higher spatial utilization, which is covered by a convenient and efficient 3D printed waterproof casing to achieve robust water resistance. In addition, a path following method based on curvature of the path is designed to achieve high performance in various complicated trajectories. Furthermore, an integrated controller combines the method with the kinematics and dynamics models of USnake, enabling precise following of straight and curved paths. The experimental results demonstrate that the proposed control structure effectively guides U-Snake to follow the desired path.

10:40-10:45, Paper ThAT27.3
Origami-Inspired Pneumatic Continuum Module: Stiffness Modeling and Validation

Li, Zhuowen	Shanghai Jiao Tong University
Chen, Huaiyuan	Shanghai Jiao Tong University
Xu, Chunshan	Shanghai Artificial Intelligence Research Institute
Xu, Fan	Shanghai Jiao Tong University
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Sensors and Actuators, Hydraulic/Pneumatic Actuators Abstract: This paper establishes a stiffness model for an origami-inspired pneumatic continuum module (OPM) capable of large stretch ratios and active stiffness modulation. A kinematic model is firstly established, using the piecewise constant curvature assumption, in order to describe the end-effector's posture by configuration states. Subsequently, utilizing virtual work theory, the static model is derived, which integrates both pneumatic actuation and intrinsic elastic energy. Based on this foundation, a Cartesian compliance matrix is formulated to quantitatively predict 3D deformations under external loads. Experimental validation demonstrates spatial variations prediction accuracy with max errors of 2.00 mm (z-axis) and and 2.04^{circ} (roll) under 500 g payloads. This study aims to bridge pressure-stiffness coupling and enables model-based stiffness-position control for adaptive tasks.

10:45-10:50, Paper ThAT27.4
Computing Forward Statics from Tendon-Length in Flexible-Joint Hyper-Redundant Manipulators

Feng, Weiting	The University of Edinburgh
Walker, Kyle Liam	EPFL
Yang, Yunjie	The University of Edinburgh
Giorgio-Serchi, Francesco	University of Edinburgh
Keywords: Modeling, Control, and Learning for Soft Robots, Tendon/Wire Mechanism, Compliant Joints and Mechanisms Abstract: Hyper-redundant tendon-driven manipulators offer greater flexibility and compliance over traditional manipulators. A common way of controlling such manipulators relies on adjusting tendon lengths, which is an accessible control parameter. This approach works well when the kinematic configuration is representative of the real operational conditions. However, when dealing with manipulators of larger size subject to gravity, it becomes necessary to solve a static force problem, using tendon force as the input and employing a mapping from the configuration space to retrieve tendon length. Alternatively, measurements of the manipulator posture can be used to iteratively adjust tendon lengths to achieve a desired posture. Hence, either tension measurement or state estimation of the manipulator are required, both of which are not always accurately available. Here, we propose a solution by reconciling cables tension and length as the input for the solution of the system forward statics. We develop a screw-based formulation for a tendon-driven, multi-segment, hyper-redundant manipulator with elastic joints and introduce a forward statics iterative solution method that equivalently makes use of either tendon length or tension as the input. This strategy is experimentally validated using a traditional tension input first, subsequently showing the efficacy of the method when exclusively tendon lengths are used. The results confirm the possibility to perform open-loop control in static conditions using a kinematic input only, thus bypassing some of the practical problems with tension measurement and state estimation of hyper-redundant systems.

10:50-10:55, Paper ThAT27.5
Intelligent Output-Feedback Speed Tracking System for Servo Drives Via Adaptive Error Stabilization and Order Reduction Approaches (I)

Kim, Seok-Kyoon	Huzhou Institute of Zhejiang University, Huzhou, Zhejiang, China
Lim, Sun	Korea Electronics Technology Institute
Shi, Peng	The University of Adelaide
Ahn, Choon Ki	Korea University
Keywords: Motion Control, Robust/Adaptive Control Abstract: This paper exhibits an innovative intelligent output-feedback solution for the speed tracking problem in servo drives, aimed at reducing peak current as well as the reliance on sensor measurements, system modeling, and load information. The proposed output-feedback system adopts a conventional multi-loop structure that operates without current measurements and offers three contributions: (a) a model-free speed observer based on a low-pass filter using position measurements, achieved through an order reduction technique; (b) outer loop intelligence that maintains critically damped tracking performance while lowering peak current levels; and (c) a simple adaptive proportional-derivative (PD) control for stabilizing inner-loop errors designed by an order reduction technique. Experimental validation of the system is conducted using a 500 W brushless DC motor prototype, demonstrating its effectiveness subject to the various load conditions.

10:55-11:00, Paper ThAT27.6
Tele-GS: 3D Gaussian Scene Representation for Low-Bandwidth Teleoperation

Zhao, Chunyang	Nanyang Technological University
Zhou, Zeyu	Nanyang Technological University
Liu, Haoran	The Hong Kong Polytechnic University
Kircali, Dogan	Nanyang Technological University
Yang, Huan	Nanyang Technological University
Low, Chang Boon	Nanyang Technological University
Wang, Yuanzhe	Shandong University
Wang, Danwei	Nanyang Technological University
Keywords: Engineering for Robotic Systems, Telerobotics and Teleoperation Abstract: Video streaming based teleoperation often faces a trade-off between bandwidth consumption and the need for high-fidelity telepresence. Higher image resolution or a wider field of view (FOV) substantially increases bandwidth requirements. In this paper, we propose a novel telepresence model for teleoperated vehicles operating in bandwidth-constrained environments. Our approach employs a LiDAR-fused 3D Gaussian Splatting (3DGS) as a compact scene representation to efficiently generate remote views. Initially, a static point cloud map is constructed using LiDAR-based semantic mapping, which serves as the initial Gaussians for optimizing the 3DGS model. During teleoperation, the prebuilt 3DGS is then rendered on the teleoperation platform, while only safety-critical information, such as vehicle pose and dynamic objects, is transmitted from the vehicle to the teleoperator in real-time. The proposed telepresence model significantly reduces data transmission requirements while maintaining photorealistic telepresence, enabling reliable and effective teleoperation even under stringent bandwidth constraints. This capability ensures safe and efficient vehicle teleoperation under challenging environments without relying on traditional high-bandwidth communication, thereby broadening the applicability of teleoperation technology to more demanding and diverse operational scenarios. Real-world experimental results show that the developed system can provide immersive teleoperation experiences at Kbps-level bandwidth consumption.

11:00-11:05, Paper ThAT27.7
Runtime Energy-Efficient Control Policy for Mobile Robots with Computing Workload and Battery Awareness

Wu, Chen	University of Turku
Haghbayan, Hashem	University of Turku
Malik, Abdul	University of Turku
Miele, Antonio	Politecnico Di Milano
Plosila, Juha	University of Turku
Keywords: Energy and Environment-Aware Automation, Embedded Systems for Robotic and Automation Abstract: Energy efficiency is a fundamental goal in robotic control. Various components within a robot, such as mechanical systems, computational units, and sensors, consume energy, all powered by the battery unit. Each component features several actuators and individual controllers that optimize energy usage locally, often without regard to one another. In this paper, we highlight a significant phenomenon indicating a considerable dependency between the mechanical and computational parts of the robot as energy consumers and the battery state of charge (SOC) as the energy provider. We demonstrate that as the battery SOC fluctuates, the behavior of energy consumption also varies, necessitating a unified controller with awareness of this relationship. Motivated by this observation, we propose a battery-aware co-optimization strategy for the mechanical and computational units, leveraging configuration space exploration to optimize the motor speed and the CPU frequency under different environmental conditions and battery SOC levels. Experimental results demonstrate the effectiveness of our approach in extending the operational lifetime of a robot under varying battery SOC and workload conditions, enhancing the energy efficiency of a case study rover by up to 53.93% w.r.t. selected baselines and similar past approaches.

11:05-11:10, Paper ThAT27.8
Semantic Risk Assessment in Visual Scenes for AUV-Assisted Marine Debris Removal

Singh, Sakshi	University of Minnesota
Sattar, Junaed	University of Minnesota
Keywords: Marine Robotics, Object Detection, Segmentation and Categorization, Energy and Environment-Aware Automation Abstract: Underwater debris is a significantly growing challenge that autonomous underwater vehicles (AUVs) can help alleviate, but robot-guided debris search and removal can also cause harm to the aquatic ecosystem or other humans engaged in cleanup missions if the AUVs are unable to assess the risks associated with its actions. We introduce a method for identifying such risks in an underwater scene in the context of AUV debris search and removal tasks. Our approach integrates a vision language model (VLM) with monocular depth estimation to effectively classify and localize objects in a marine scene, specifically submerged marine debris. We use the pixel distance and depth difference using the monocular depth map to identify entities that are sensitive to harm in proximity to the debris. We collect and annotate a custom dataset containing images in three different marine and aquatic environments containing debris and other such sensitive entities, and compare classification performance for different types of prompts. We observe that the prompts describing the debris properties (e.g., “eroded trash”) demonstrate a significant increase in accuracy compared to the use of object names directly as prompts. Our method successfully identifies debris that is safe to remove in complex scenes and turbid water conditions, highlighting the potential of using VLMs for risk assessment in AUV operations in the diverse underwater domain.


ThAT28	104
Rehabilitation Robotics 1	Regular Session
Chair: Yu, Ningbo	Nankai University

10:30-10:35, Paper ThAT28.1
A Multi-Layered Quantitative Assessment Approach for Hand Spasticity Based on a Cable-Actuated Hand Exoskeleton

Yu, Hao	The University of Edinburgh
Nelson, Alyson	Astley Ainslie Hospital
Erden, Mustafa Suphi	Heriot-Watt University
Keywords: Rehabilitation Robotics, Prosthetics and Exoskeletons, Wearable Robotics Abstract: Over the past two decades, numerous hand exoskeletons have been developed for rehabilitation scenarios, yet very few have the capability to assess spasticity. This paper introduces a quantitative assessment approach for hand spasticity developed on a cable-driven hand exoskeleton that was specifically designed with spastic hand in mind. The exoskeleton features an adaptive cable-linkage transmission mechanism equipped with bi-directional force transducers, enabling constant-velocity extension and flexion of individual finger joints while simultaneously recording joint torque and angle. Based on the exoskeleton, a multi-layered quantitative assessment approach is proposed to evaluate spasticity on individual finger joints. The basis of the approach is collecting joint resistance information automatically following Modified Tardieu Scale, serving for the computation of six middle-level parameters related to the properties of a spastic joint. The parameters include the range of motion, average resistance torque, joint stiffness, joint viscosity, catch angle, and reflex-induced resistance torque. The first four mechanical parameters are finally combined into a biomechanical metric, whereas the rest two reflex-related parameters result in a neurological metric, to neatly describe spasticity level. The performance of the exoskeleton to measure the parameters was tested on six healthy subjects. The assessment approach applied with the developed exoskeleton shows good reliability, repeatability, and validity to capture the features of spastic hands, providing strong evidence for further validation on real patients.

10:35-10:40, Paper ThAT28.2
A Rehabilitation Robot System to Enhance Proprioception with Physical and Virtual Simulation of Multi-Terrain Scenarios

Hao, Liziyi	Nankai University
Zhou, Zhaocheng	Nankai University
Zheng, Honghao	Nankai University
Wang, Xiangyu	Nankai University
Han, Jianda	Nankai University
Yu, Ningbo	Nankai University
Keywords: Rehabilitation Robotics, Human-Centered Robotics, Parallel Robots Abstract: Increasing evidence highlights the role of proprioceptive deficits in falls, emphasizing the need for targeted rehabilitation in populations with functional movement disorders.Despite advances in rehabilitation robots, movement constraints still hinder active engagement of the lower limb muscles, thereby limiting the effectiveness of proprioceptive training.In this work, We developed a neuro-rehabilitation robotic platform to address this need by physically and virtually simulating multi-terrain scenarios.The robot introduces common perturbations, such as uneven mountain trails, sandy beaches, and bumpy bus rides, to assess user stability and recovery, thereby assisting in the design of individualized training programs.The platform enhances neuromuscular responses across multiple directions and facilitates targeted muscle contraction through motor tasks that combine proprioceptive and visual feedback.Preliminary studies demonstrated that the robot successfully facilitated a complete range of ankle rotational movements.Electromyographic analysis revealed increased activation of specific muscle groups, changes in muscle loading and contraction patterns, suggesting that the system recruits multiple muscle groups while enhancing proprioceptive input to peri-articular soft tissues.The proposed robot and control strategies established a feasible solution to enhance proprioception rehabilitation.

10:40-10:45, Paper ThAT28.3
ULRVT Ⅱ: A Novel Upper Limb Rehabilitation Robot with Joint Synergy Control and Evaluation for Virtual Training

Yang, Lei	Key Laboratory of Robotics and System, Harbin Institute of Techn
Zhang, Fuhai	Harbin Institute of Technology
Wu, Tianyang	State Key Laboratory of Robotics and System, Harbin Institute Of
Jiang, Tongxin	Harbin Institute of Technology
Fu, Yili	Harbin Institute of Technology
Keywords: Rehabilitation Robotics Abstract: Global population aging has led to a sharp increase in patients of upper limb motor dysfunction. Robot assisted virtual training, as a novel solution, can offer safe and precise assistance for upper limb rehabilitation. However, it remains a critical challenge to compensate virtual interaction force and realize joint synergy movement. In this paper, we design an upper limb rehabilitation robot for virtual training (ULRVT Ⅱ) which is a cable driven exoskeleton with high compatibility controlled by a joint synergy method. Moreover, we establish a rehabilitation platform with a virtual training environment and evaluation system for experimental validation. Tests for the performance of joint synergy and virtual training are carried out to show the effectiveness of our robot.

10:45-10:50, Paper ThAT28.4
An Improved Flexible Hand Exoskeleton with SEA for Finger Strength Estimation and Progressive Resistance Exercise

Zheng, Honghao	Nankai University
Zhou, Zhaocheng	Nankai University
Hao, Liziyi	Nankai University
Wang, Xiangyu	Nankai University
Han, Jianda	Nankai University
Yu, Ningbo	Nankai University
Keywords: Rehabilitation Robotics, Wearable Robotics, Flexible Robotics Abstract: Hand exoskeletons can recognize user's intent and provide active resistance training to enhance finger strength in stroke patients. However, achieving fine human-robot interaction (HRI) while maintaining system simplicity for lightweight design remains a key challenge. In this work, we present an improved flexible hand exoskeleton with series elastic actuator (SEA) for hand strength estimation and progressive resistance exercise. The SEA design allows the hand exoskeleton to have backdrivability to improve HRI performance. By combining the flexible linkage with the flex sensor, we propose a novel user interface that is able to sensitively acquire hand motion intent. An Extended Kalman Filter (EKF) based tracking errors estimation is designed to evaluate the finger strength. The results of the finger strength estimation are used to adjust the parameters of the admittance model to provide small or large damping when the user's finger strength is low or high, achieving active admittance control based progressive resistance exercise. The feasibility has been demonstrated by two sets of experiments, and this work has established a hand exoskeleton solution for finger strength estimation and fine human-robot interaction.

10:50-10:55, Paper ThAT28.5
An Investigation into the Application of an RL-GA-Based Multi-Modal Motion Somatosensory Optimization Control Strategy for a Novel Rehabilitation Robot

Wu, Junyu	Harbin Institute of Technology
Wang, He	Harbin Engineering University
Zhang, Gongzi	Chinese PLA General Hospital
Liu, Yubin	Harbin Institude of Technology
Zhao, Jie	Harbin Institute of Technology
Cai, Hegao	Harbin Institute of Technology
Keywords: Rehabilitation Robotics, Optimization and Optimal Control, Motion Control Abstract: The increasing prevalence of balance disorders presents significant challenges to rehabilitation therapy, prompting the development of rehabilitation robots as effective tools for motion training. To enhance the patient experience and rehabilitation outcomes in human-robot collaborative training, a motion control strategy based on reinforcement learning and genetic algorithms (RL-GA) with somatosensory enhancement is proposed and implemented in a novel rehabilitation robot. This study introduces a mathematical model of the human sensory system to quantify somatosensory feedback related to motion and integrates the washout algorithm (WA) into the robot's control system, facilitating the reproduction of somatosensation and motion simulation. Three typical rehabilitation modes—level walking, stair climbing, and stair descending—are selected, with natural gait features extracted as predefined trajectories. The WA parameters for each mode are optimized using RL-GA in various rehabilitation scenarios. Simulation results demonstrate that the washout filtering optimization method using RL-GA reduces theoretical somatosensory error by approximately 10% to 20% across all three rehabilitation modes, compared to traditional GA optimization. The experimental results further confirm the reliability and feasibility of the proposed method. The proposed approach enhances the realism of robot-assisted motion, thereby theoretically improving training effectiveness and accelerating the rehabilitation process

10:55-11:00, Paper ThAT28.6
Policy Learning for Social Robot-Led Physiotherapy

Bettosi, Carl	Heriot-Watt University
Baillie, Lynne	Heriot-Watt University
Shenkin, Susan	University of Edinburgh
Romeo, Marta	Heriot-Watt University
Keywords: Rehabilitation Robotics, Reinforcement Learning, Modeling and Simulating Humans Abstract: Social robots offer a promising solution for autonomously guiding patients through physiotherapy exercise sessions, but effective deployment requires advanced decision-making to adapt to patient needs. A key challenge is the scarcity of patient behavior data for developing robust policies. To address this, we engaged 33 expert healthcare practitioners as patient proxies, using their interactions with our robot to inform a patient behavior model capable of generating exercise performance metrics and subjective scores on perceived exertion. We trained a reinforcement learning-based policy in simulation, demonstrating that it can adapt exercise instructions to individual exertion tolerances and fluctuating performance, while also being applicable to patients at different recovery stages with varying exercise plans.

11:00-11:05, Paper ThAT28.7
Hybrid Learning-Based Balance Function Assessment of Stroke Patients with a Single Ear-Worn IMU

Zhao, Tianshu	Shanghai Jiao Tong University
Xu, Zhenye	Shanghai Jiao Tong University
Wang, Pu	The Seventh Affilated Hospital, Sun Yat-Sen Univeristy
Guo, Yao	Shanghai Jiao Tong University
Keywords: Rehabilitation Robotics, Wearable Robotics Abstract: Rehabilitation robotics has attracted increasing attention due to its ability to provide continuous, precise, and adaptive treatment programs for stroke patients during their recovery. Accurately assessing lower-limb motor function is crucial in effectively implementing robot-assisted rehabilitation. This study proposes a novel application of a hybrid learning framework that leverages a single-ear-worn inertial measurement unit (IMU) combined with deep learning techniques to predict the Berg Balance Scale (BBS) scores. Participants performed a 3-meter Timed Up and Go (TUG) test while wearing the e-AR sensor. The collected 6-axis IMU data were processed through a CNN-LSTM framework, where we integrated time-domain, frequency-domain, and static features to enhance the model's regression performance. Experimental results demonstrate that our proposed method achieves a mean absolute error (MAE) of 1.074, surpassing previous studies' reported results and outperforming traditional machine learning and conventional deep learning algorithms when applied to ear-worn sensor data. The proposed framework is simple to operate yet accurate, making it suitable for patients' self-assessment even in a home environment.

11:05-11:10, Paper ThAT28.8
SEMG-Based Continues Motion Prediction of Shoulder Exoskeleton Control Using the VGANet Model

Jiang, Tongxin	Harbin Institute of Technology
Zhang, Fuhai	Harbin Institute of Technology
Yang, Lei	Key Laboratory of Robotics and System, Harbin Institute of Techn
Wu, Tianyang	State Key Laboratory of Robotics and System, Harbin Institute Of
Fu, Yili	Harbin Institute of Technology
Keywords: Rehabilitation Robotics Abstract: Wearable exoskeleton robots play a crucial role in promoting upper limb function recovery. To enhance human-robot interaction and achieve precise control, continuous prediction of limb joint angles is required. This paper proposes a decoupled network model (VGANet) based on Variable Graph Convolutional Networks (V-GCN) and Temporal External Attention (TEA) for motion prediction in upper limb rehabilitation training. By establishing a mapping relationship between surface electromyography (sEMG) signals and upper limb movements, the model can predict future joint angles based on real-time sEMG signals. Experimental results demonstrate that this method can achieve continuous motion prediction for the shoulder joint and has been successfully applied to the control system of exoskeleton robots, providing an effective solution for the intelligent development of rehabilitation exoskeletons.


ThAT29	105
Wearable Robotics 1	Regular Session
Chair: Zeng, Hong	Southeast University
Co-Chair: Xu, Jiajun	Nanjing University of Aeronautics and Astronautics

10:30-10:35, Paper ThAT29.1
Adaptive Human Movement Compensation Control of Supernumerary Robotic Limb for Overhead Support Task with Non-Zero-Sum Differential Game Theory (I)

Zhang, Jianxi	SEU
Zeng, Hong	Southeast University
Liu, Jia	Nanjing University of Information Science & Technology
Chen, Dapeng	Nanjing University of Information Science & Technology
Song, Aiguo	Southeast University
Keywords: Wearable Robotics, Physical Human-Robot Interaction Abstract: The supernumerary robotic limb (SRL) mounted on the shoulder has been demonstrated to be able to serve as a third arm to assist human in overhead support task. However, the mechanical connection between the wearer and the SRL means that human operator’s movement will continually disturb the SRL and may lead to instability. Moreover, there may be physical conflicts between SRL and human operator due to their different intentions, which may potentially increase the load on human operator. Therefore, it’s necessary to control SRL to ensure stable supporting and transparent interaction. Here, we propose an adaptive controller for human movement compensation. Firstly, we model the human-SRL coordinative behavior based on nonzero-sum differential game theory aiming to enhance the support stability and reduce the operator’s load, which is a framework capable of dynamically regulating the control strategies between two interacting agents. We then implement an adaptive control strategy that adjusts the SRL’s input optimally responsive to the human operator’s input in the sense of Nash equilibrium to meet predefined control objectives. From experimental results during overhead support task, the proposed controller reduces the peak reaction force from 15.59 N to 4.50 N compared to the state-ofthe-art controller that disregards human input while ensuring the stability of support, thereby providing advantage for human-SRL coordination in overhead support task.

10:35-10:40, Paper ThAT29.2
Development of a Sensor Suit for Gait Environment Detection Using Non-Contact Sensors and Integrated Model (I)

Choi, Junhwan	Korea Advanced Institute of Science and Technology, (KAIST)
Feng, Jirou	Korea Advanced Institute of Science and Technology
Cho, Junhwi	KAIST
Kim, Jung	KAIST
Keywords: Wearable Robotics, Soft Sensors and Actuators, Multi-Modal Perception for HRI Abstract: This study introduces a novel non-contact wearable sensor suit designed to classify various walking environments, integrating capacitive stretch sensors and pneumatic mechanomyography (pMMG) sensors. Unlike traditional surface electromyography (sEMG), which is prone to noise and electrode displacement, the proposed sensor suit, designed in pants, ensures consistent signal acquisition without requiring direct skin contact. Experiments with eight participants in five walking conditions—flat ground, ramps (ascent/descent), and stairs (ascent/descent)—demonstrated superior performance using long- and short-term memory (LSTM) models. In intrasubject experiments, where individual models were trained and tested on data from a single subject, the sensor suit achieved a classification accuracy of 96.1%, outperforming the 94.0% accuracy of traditional sEMG-based systems. In inter-subject experiments, where a general model was trained on multiple subjects and tested on unseen individuals, the suit achieved a classification accuracy of 91.0%, significantly surpassing the 76.4% accuracy of sEMG. Statistical analysis further validated the suit’s ability to generalize between users and environments, with significant p-values (p < 0.01) observed under multiple conditions. These findings emphasize the potential of the sensor suit for practical applications in wearable robotics and gait assist systems. Future research will focus on improving the system’s adaptability and extending its capabilities beyond classifying walking trials to detecting pathological patterns, thereby significantly enhancing its clinical relevance.

10:40-10:45, Paper ThAT29.3
Development of a Suspension Backpack with Quasi-Zero Stiffness and Controllable Damping

Ju, Haotian	Harbin Institute of Technology
Zhao, Sikai	Harbin Institute of Technology
Sui, Dongbao	Ji Hua Laboratory
Li, Hongwu	Harbin Institute of Technology
Guo, Songhao	Harbin Institute of Technology
Liu, Junchen	Harbin Institute of Technology
Wang, Ziqi	Harbin Institute of Technology
Zhang, Qinghua	Harbin Institute of Technology
Zhao, Jie	Harbin Institute of Technology
Zhu, Yanhe	Harbin Institute of Technology
Keywords: Wearable Robotics, Human-Centered Robotics, Machine Learning for Robot Control Abstract: Previous research has shown that load-bearing with elastic suspension backpacks improves human biomechanics and reduces human energy expenditure. Constant-force suspension backpacks (CFSB) with zero stiffness were developed to minimize the inertial force of loads. However, there is a mismatch between the load and the constant force mechanism due to the user’s nonfully upright waist while moving. The load will exert inertial forces on the human body, increasing energy expenditure. In this paper, a suspension backpack with quasi-zero stiffness and controlled damping (CQFB) is developed. The position of the load can be adjusted by the electromagnetic damping force from motors without consuming additional electrical energy. A Q-learning based variable damping controller is proposed to keep the load in the middle of the slide with a small external force. The results of the load disturbance rejection experiments show that the CQFB effectively prevents the problem of collision limit after the load is subjected to the bias force and reduces the net metabolism by 11.4% compared with the ordinary backpack (OB). The results of the peak accelerative vertical force experiments show that the CQFB can reduce the peak accelerative vertical force of the load at three speeds, with a maximum average reduction of 86.8%. The controllable damping device has no significant effect on the levitation effect.

10:45-10:50, Paper ThAT29.4
ELabrador: A Wearable Navigation System for Visually Impaired Individuals (I)

Kan, Meina	Institute of Computing Technology, Chinese Academy of Sciences
Zhang, Lixuan	Institute of Computing Technology, Chinese Academy of Sciences
Liang, Hao	Institute of Computing Technology, Chinese Academy of Sciences
Zhang, Boyuan	Institute of Computing Technology, Chinese Academy of Sciences
Fang, Minxue	Institute of Computing Technology, Chinese Academy of Sciences
Liu, Dongyang	The Chinese University of Hongkong
Shan, Shiguang	Chinese Academy of Sciences, Institute of Computing Technology
Chen, Xilin	Institute of Computing Technology, Chinese Academy
Keywords: Wearable Robotics, Vision-Based Navigation, Physically Assistive Devices Abstract: Visually impaired individuals encounter significant challenges when walking and acting in unfamiliar environments, particularly in outdoor scenarios. The complexity of outdoor environments, characterized by diverse obstacles, traffic signals, and societal norms, poses substantial barriers to mobility of visually impaired individuals and makes long-distance walking especially arduous. Although GPS-based navigation systems can facilitate long-distance travel, they often suffer from location inaccuracies in urban areas and even completely fail indoors. Moreover, these systems lack the capability to provide detailed information about walkways and immediate surroundings, which are crucial for safe and efficient walking. To address these limitations, we introduce a proof-of-concept wearable navigation system named eLabrador, designed to assist visually impaired individuals in long-distance walking in unfamiliar outdoor environments. The eLabrador integrates public maps (e.g. Amap or Google Maps) and GPS for global route planning, while leveraging computational visual perception to provide precise and safe local guidance. This hybrid approach enables accurate and safe navigation for visually impaired individuals in outdoor scenarios. Specifically, the eLabrador utilizes a head-mounted RGB-D camera to capture environmental geometric terrain and objects in outdoor urban environments. These inputs are processed into a 3D semantic map, offering a detailed representation of the surrounding environment. The planning module then integrates this 3D semantic map with route information from the global map (i.e. Amap) to generate an optimized walking path. Finally, the interaction module utilizes the audio-haptic dual-channel to relay navigation instructions to visually impaired user. Together, these three modules work seamlessly to facilitate long-distance navigation for visually impaired individuals in outdoor environments. The eLabrador is evaluated with two real-world outdoor scenarios, involving 10 visually impaired and visually masked participants. The experiments show that eLabrador successfully guides visually impaired participants to their destinations in outdoor environments. Additionally, the eLabrador provides descriptive information about landmarks and other navigation cues, helping visually impaired users better understand their surroundings. Subjective evaluations further indicate that most participants felt a sense of safety and reported an acceptable cognitive load during navigation, indicating its usability and effectiveness.

10:50-10:55, Paper ThAT29.5
Energy Reduction for Wearable Pneumatic Valve System with SINDy and Time-Variant Model Predictive Control (I)

Lee, Hao	University of California, Los Angeles
Ren, Ruoning	University of California, Los Angeles
Qian, Yifei	University of California, Los Angeles
Rosen, Jacob	University of California, Los Angeles
Keywords: Wearable Robotics Abstract: Pneumatic actuators are a popular choice for wearable robotics due to their high force-to-weight ratio and natural compliance, which allows them to absorb and reuse wasted energy during movement. However, traditional pneumatic control is energy inefficient and difficult to precisely control due to nonlinear dynamics, latency, and the challenge of quantifying mechanical properties. To address these issues, we developed a wearable pneumatic valve system with energy recycling capabilities and applied the sparse identification of nonlinear dynamics (SINDy) algorithm to generate a nonlinear delayed differential model from simple pressure measurements. Using first principles of thermal dynamics, SINDy was able to train time-variant delayed differential models of a solenoid valve-based pneumatic system and achieve good testing accuracy for two cases—increasing pressure and decreasing pressure, with training accuracies at 85.23% and 76.34% and testing accuracies at 87.66% and 77.66%, respectively. The generated model, when integrated with model predictive control (MPC), resulted in less than 5% error in pressure control. By using MPC for human assistive impedance control, the pneumatic actuator was able to output the desired force profile and recycle 85% of the energy used in negative work. These results demonstrate an energy-efficient and easily calibrated actuation scheme for designing assistive devices such as exoskeletons and orthoses.

10:55-11:00, Paper ThAT29.6
Exosense: A Vision-Based Scene Understanding System for Exoskeletons

Wang, Jianeng	University of Oxford
Mattamala, Matias	University of Edinburgh
Kassab, Christina	University of Oxford
Burger, Guillaume	Wandercraft
Elnecave Xavier, Fabio	MINES Paris / Wandercraft
Zhang, Lintong	University of Oxford
Petriaux, Marine	Wandercraft
Fallon, Maurice	University of Oxford
Keywords: Wearable Robotics, Prosthetics and Exoskeletons, RGB-D Perception Abstract: Self-balancing exoskeletons are a key enabling technology for individuals with mobility impairments. While the current challenges focus on human-compliant hardware and control, unlocking their use for daily activities requires a scene perception system. In this work, we present Exosense, a vision-centric scene understanding system for self-balancing exoskeletons. We introduce a multi-sensor visual-inertial mapping device as well as a navigation stack for state estimation, terrain mapping and long-term operation. We tested Exosense attached to both a human leg and Wandercraft's Personal Exoskeleton in real-world indoor scenarios. This enabled us to test the system during typical periodic walking gaits, as well as future uses in multi-story environments. We demonstrate that Exosense can achieve an odometry drift of about 4 cm per meter traveled, and construct terrain maps under 1 cm average reconstruction error. It can also work in a visual localization mode in a previously mapped environment, providing a step towards long-term operation of exoskeletons.

11:00-11:05, Paper ThAT29.7
Leveraging Geometric Modeling-Based Computer Vision for Context Aware Control in a Hip Exosuit

Tricomi, Enrica	Technical University of Munich
Piccolo, Giuseppe	University of Naples Federico II
Russo, Federica	University of Naples Federico II
Zhang, Xiaohui	Heidelberg University
Missiroli, Francesco	Heidelberg University
Ferrari, Sandro	Technische Universität München
Gionfrida, Letizia	King's College London
Ficuciello, Fanny	Università Di Napoli Federico II
Xiloyannis, Michele	Eidgenössische Technische Hochschule (ETH) Zürich
Masia, Lorenzo	Technische Universität München (TUM)
Keywords: Wearable Robots, Physically Assistive Devices, Modeling, Control, and Learning for Soft Robots, RGB-D Perception Abstract: Human beings adapt their motor patterns in response to their surrounding context utilizing visual inputs. This context-informed adaptive motor behavior has increased interest in integrating computer vision algorithms into robotic assistive technologies for context aware control. However, current methods mostly relying on data-driven approaches. In this study, we introduce a novel control framework for a hip exosuit, employing instead a physics-informed computer vision method grounded on geometric modeling of the scene for assistance tuning during stairs and level walking. Evaluating the controller with six subjects on a path comprising level walking and stairs, we achieved an overall detection accuracy of 93.0%. Computer vision-based assistance provided significantly greater metabolic benefits compared to non-vision-based assistance during stair ascent (-18.9% vs. -5.24.1%) and descent (-10.1% vs -4.7%). The assistive torque showed a significant increase while ascending stairs (+33.95%) and decrease while descending stairs (-17.38%) compared to a condition without assistance modulation enabled by vision.

11:05-11:10, Paper ThAT29.8
A Variable Stiffness Supernumerary Robotic Limb with Pneumatic-Tendon Coupled Actuation

Zhao, Mengcheng	Nanjing University of Aeronautics and Astronautics
Xu, Jiajun	Nanjing University of Aeronautics and Astronautics
Wang, Peixin	Nanjing University of Aeronautics and Astronautics, College of Me
Zhou, Juanxia	Nanjing University of Aeronautics and Astronautics
Zhang, Tianyi	Nanjing University of Aeronautics and Astronaut
Huang, Kaizhen	Nanjing University of Aeronautics and Astronautics
Ji, Aihong	Nanjing University of Aeronautics Ans Astronautics
Hou, Xuyan	Harbin Institute of Technology
Song, Guoli	Shenyang Institute of Automation, Chinese Academy of SciencesA
Li, You-Fu	City University of Hong Kong
Keywords: Wearable Robotics, Modeling, Control, and Learning for Soft Robots, Soft Robot Applications Abstract: Supernumerary robotic limbs (SRLs) can assist humans in achieving efficient and comfortable work in daily life or industrial assembly scenarios, requiring SRLs to switch between rigidity and flexibility to perform compliant movements while also providing stable support for humans to reduce fatigue from prolonged standing, existing SRLs struggle to achieve this transition. In this study, a variable stiffness supernumerary robotic limb (VSSRL) is implemented, capable of adjusting its position and stiffness through pneumatic-tendon coupled actuation. The position of the VSSRL is accurately modulated by tendons, while its stiffness is controlled by pneumatic-tendon coupled actuation, tendons significantly increase the overall stiffness of the VSSRL, and the fiber-reinforced actuators (FRAs) can dynamically adjust its stiffness in response to changes in dynamic loads. Furthermore, a kinematic model of the VSSRL and a stiffness model under the coupling of FRAs and tendons are developed. Then, the trajectory and stiffness of the VSSRL in task execution are assigned based on human motion, and a multi-objective control system for both position and stiffness of the VSSRL is designed based on reinforcement learning (RL) algorithm, achieving collaborative control of position and stiffness for the VSSRL. The accuracy of the control system is validated through experiments, which demonstrate that the load capacity of the VSSRL is significantly enhanced under the action of tendons and FRAs, and that the VSSRL is able to provide various modes of assistance for daily life activities.


ThAT30	106
Wheeled Robots 1	Regular Session
Chair: Song, Wenjie	Beijing Institute of Technology
Co-Chair: Yang, Yi	Beijing Institute of Technology

10:30-10:35, Paper ThAT30.1
A Robust Distributed Odometry for Mobile Robots with Steerable Wheels

Xi, Wang	Shanghai Jiao Tong University
Guo, Jiaming	Shanghai Jiao Tong University
Wang, Chenyang	Shanghai Jiao Tong University
Wu, Shukun	Shanghai Jiao Tong University
He, Jianping	Shanghai Jiao Tong University
Keywords: Wheeled Robots Abstract: Odometry estimation remains a critical challenge for wheeled robots, as reducing its drift directly mitigates dependency on external localization systems. This paper proposes a distributed odometry framework for steerable wheels, named ICF-DO, which is applicable to both Steerable Wheeled Mobile Robots (SWMRs) and cooperative multi-single-wheel robot systems. The proposed method features low computational complexity and reduced drift, while demonstrating strong robustness in communication-restricted scenarios. Additionally, singularity can be processed in a distributed manner in the proposed framework. Experimental validation on a real physical SWMR platform demonstrates the effectiveness and practicality of the proposed method.

10:35-10:40, Paper ThAT30.2
Motion Control of a Hybrid Self-Reconfigurable Wheel-Legged Dual-Arm Robot

Zhang, Rui	Beijing Institute of Technology, School of Automation
Du, Hong	Beijing Institute of Technology, School of Automation
Qiu, Peng	Beijing Institute of Technology
Yang, Yi	Beijing Institute of Technology
Song, Wenjie	Beijing Institute of Technology
Keywords: Wheeled Robots, Multi-Robot Systems, Field Robots Abstract: Current wheeled bipedal robots face significant mobility challenges when traversing discontinuous terrain such as gaps and step-like obstacles, and suffer from substantial dy-namic inefficiencies. This paper presents a hybrid self-reconfigurable wheel-legged dual-arm robot equipped with an active docking mechanism, enabling transitions be-tween wheeled bipedal and multi-wheel-legged configurations. Based on a self-developed robotic platform, this work ad-dresses key control challenges in articulated mul-ti-wheel-legged mode and proposes a novel distributed oper-ation paradigm for wheeled bipedal robots. Each module utilizes its manipulators for stable grasping of elevated ob-jects and collaborative tasks, while the multi-unit system achieves efficient, high-load, and stable locomotion. To manage the control complexities in multimodal operation, we develop a unified modular control architecture integrating Virtual Model Control (VMC) and Linear Quadratic Regu-lator (LQR). For the articulated multi-wheel-legged mode, a body-posture controller regulates global body configuration, and a turning controller adjusts the wheelbase and roll angle via distributed actuation to manage the passive degrees of freedom (DoF) at the articulation points. Experimental vali-dation using a physical prototype confirms the effectiveness and practicality of the proposed approach.

10:40-10:45, Paper ThAT30.3
Improbability Roller-2: A Hybrid Mobile Robot with Variable Diameter Transformable Wheels

Moger, Gourav	Nazarbayev University
Varol, Huseyin Atakan	Nazarbayev University
Keywords: Wheeled Robots, Field Robots, Actuation and Joint Mechanisms Abstract: Locomotion on unstructured terrain poses a significant challenge for wheeled mobile robots lacking reconfigurable mechanisms. Achieving both stability and agile motion in such environments requires a hybrid approach that leverages their adaptable nature to varying surface conditions while ensuring efficient mobility. In this research, we present Improbability Roller 2, a refined iteration of our hybrid mobile robot with variable-diameter wheels, simplifying the design while improving its maneuverability and adaptability on unstructured terrain. The new compliant outer wheel structure and its folding mechanism allow for a higher wheel size change ratio. With the combination of multimode steering, which integrates both differential drive control and steering based on wheel size disparity, the robot can now optimize locomotion on diverse terrain while maintaining traction. The robot was tested across various obstacles and multiple surface conditions to validate the effectiveness of the new wheel design and the dual steering strategy. Experiments, including slope and step climbing, confined space traversal, and locomotion on loose gravel and snow, demonstrated the robot's improved terrain adaptability, consistent traction, and control across varying surfaces.

10:45-10:50, Paper ThAT30.4
M-Predictive Spliner: Enabling Spatiotemporal Multi-Opponent Overtaking for Autonomous Racing

Imholz, Nadine	ETH
Brunner, Maurice	ETH Zurich
Baumann, Nicolas	ETH
Ghignone, Edoardo	ETH
Magno, Michele	ETH Zurich
Keywords: Wheeled Robots, Motion and Path Planning, Field Robots Abstract: Unrestricted multi-agent racing presents a significant research challenge, requiring decision-making at the limits of a robot's operational capabilities. While previous approaches have either ignored spatiotemporal information in the decision-making process or been restricted to single-opponent scenarios, this work enables arbitrary multi-opponent head-to-head racing while considering the opponents' future intent. The proposed method employs a Kalman Filter-based multi-opponent tracker to effectively perform opponent Re-Identification by associating them across observations. Simultaneously, spatial and velocity Gaussian Process Regression is performed on all observed opponent trajectories, providing predictive information to compute the overtaking maneuvers. This approach has been experimentally validated on a physical 1:10 scale autonomous racing car achieving an overtaking success rate of up to 91.65% and demonstrating an average 10.13%-point improvement in safety at the same speed as the previous State-of-the-Art. These results highlight its potential for high-performance autonomous racing.

10:50-10:55, Paper ThAT30.5
Lywal-X: A Novel Wheel-Claw Quadruped Robot

Shen, Hao	Tiangong University
Yang, Yuxuan	Tiangong University
Wang, Yiliang	School of Mechanical Engineering , Tiangong University
Zuo, Xintian	Tiangong University
Zhu, Hongwei	Tiangong University
Wang, Jianming	Tiangong University
Xiao, Xuan	Tiangong University
Keywords: Dual Arm Manipulation, Wheeled Robots, Legged Robots Abstract: This paper introduces a wheel-claw quadruped robot named Lywal-X, which is capable of omnidirectional movement as well as grasping actions. Firstly, the mechanical structure of Lywal-X is designed with a three-degree-of-freedom leg transformation mechanism and a two-degree-of-freedom wheel-claw structure. Then, movement strategies for different modes such as climbing and grasping are developed. Finally, the mobility performance of Lywal-X is analyzed, and physical experiments are conducted to verify the robot’s ability to pick up and transport target objects in both single-claw and double-claw modes.

10:55-11:00, Paper ThAT30.6
Design and Control of SeparaTrek: A Hybrid Aerial-Ground Robot with Separable and Combinative Locomotion Parts

Zhang, Yu	Beijing Institute of Technology
Chen, Xuechao	Beijing Insititute of Technology
Sun, Yanbo	Beijing Institute of Technology
Keywords: Wheeled Robots, Aerial Systems: Mechanics and Control, Mechanism Design Abstract: The hybrid aerial-ground robots combine ground mobility and aerial flight capability, which are often designed for executing multi-terrain tasks. However, most of the existing hybrid aerial-ground robots integrate the ground and aerial functionality into one single whole system. This leads to functional coupling that prevents the full utilization of multimodal locomotion capabilities. In this paper, we design a hybrid aerial-ground robot called SeparaTrek. SeparaTrek features separable and combinative ground and aerial locomotion parts by a free separation and combination structure, reducing the coupling relationship between ground and aerial functionality. Furthermore, we design a multimodal locomotion controller based on extended Kalman filter algorithm and adaptive sliding mode control, achieving stable locomotion of SeparaTrek on complex and variable terrain. Through experiments, it is demonstrated that SeparaTrek’s design is rational and its motion is stable.

11:00-11:05, Paper ThAT30.7
Steady-State Drifting Equilibrium Analysis of Single-Track Two-Wheeled Robots for Controller Design

Jing, Feilong	Tsinghua University
Deng, Yang	Tsinghua University
Wang, Boyi	Tsinghua University
Zheng, Xudong	Qiyuan Lab
Sun, Yifan	Tsinghua University
Chen, Zhang	Tsinghua University
Liang, Bin	Tsinghua University
Keywords: Wheeled Robots, Motion Control, Dynamics Abstract: Drifting is an advanced driving technique where the wheeled robot's tire-ground interaction breaks the common non-holonomic pure rolling constraint. This allows high-maneuverability tasks like quick cornering, and steady-state drifting control enhances motion stability under lateral slip conditions. While drifting has been successfully achieved in four-wheeled robot systems, its application to single-track two-wheeled (STTW) robots, such as unmanned motorcycles or bicycles, has not been thoroughly studied. To bridge this gap, this paper extends the drifting equilibrium theory to STTW robots and reveals the mechanism behind the steady-state drifting maneuver. Notably, the counter-steering drifting technique used by skilled motorcyclists is explained through this theory.In addition, an analytical algorithm based on intrinsic geometry and kinematics relationships is proposed, reducing the computation time by four orders of magnitude while maintaining less than 6% error compared to numerical methods. Based on equilibrium analysis, a model predictive controller (MPC) is designed to achieve steady-state drifting and equilibrium points transition, with its effectiveness and robustness validated through simulations.

11:05-11:10, Paper ThAT30.8
Design and Active Stability Control of a Wheel-Foot Mobile Platform with High Trafficability

Li, Xiran	Harbin Institute of Technology Shenzhen
Yi, Haowei	Harbin Institute of Technology(Shenzhen)
Yuan, Han	Harbin Institute of Technology
Keywords: Wheeled Robots, Mechanism Design, Robust/Adaptive Control Abstract: As a critical branch of robotics, mobile platforms have seen extensive applications in industrial automation, social services, and military industry sectors in recent years. However, conventional wheeled platforms exhibit limited obstacle-crossing capability, while legged robots, despite superior terrain traversability, demand excessive power consumption for high payloads and face significant challenges in maintaining platform stability on complex terrain due to intricate control requirements and hardware complexity. This study presents a wheel-footed hybrid robot that integrates a compound wheel-foot mechanism to achieve high payload capacity, exceptional terrain adaptability, and enhanced stability, enabling adaptive embodied intelligence in complex scenarios. First, a novel mechanical architecture and hardware system for the wheel-foot module were designed and constructed. Then, focusing on high dynamic response platform stabilization control, an autonomous planning framework and active stability control system were developed, accompanied by kinematic modeling of the prototype. Finally, experimental validation was conducted on the prototype, demonstrating the ability to carry an adult weighing approximately 60 kg while maintaining platform horizontality (maximum posture errors: 1.5° on slopes, 1.2° over speed bumps, 6.1° during stair climbing), verifying the practicality of both the mechanical design and control strategy.


ThBT1	401
Intention Recognition 2	Regular Session
Chair: Baskaran, Prakash	Honda Research Institute

13:20-13:25, Paper ThBT1.1
Landmark-Based Goal Recognition for Shared Autonomy: A Framework for Enhanced Teleoperation

Lorthioir, Guillaume	AIST
Benallegue, Mehdi	AIST Japan
Cisneros Limon, Rafael	National Institute of Advanced Industrial Science and Technology
Ramirez-Alpizar, Ixchel Georgina	National Institute of Advanced Industrial Science and Technology
Keywords: Intention Recognition, Telerobotics and Teleoperation, Probabilistic Inference Abstract: Shared autonomy is the future of teleoperation as it reduces the operator’s burden, enhances capabilities, and improves embodiment by offering seamless control of the robot. However, it remains rarely used, particularly with humanoid robots, as it faces numerous challenges. In this work, we introduce an innovative shared autonomy framework suitable for a wide range of robots, which we tested on a humanoid robot. This framework leverages Bayesian filtering over a Hidden Markov Model (HMM) to perform goal recognition, employing a landmark-based heuristic that minimizes computational demands while computing observation likelihoods without prior knowledge or a cost function. Once the operator’s goal is identified, the robot assists according to its confidence level in the goal prediction. Assistance is provided by guiding the robot's end-effector to reach a specified target position and orientation. In experiments with a diverse group of 10 teleoperators, conducted with video transmission delay, we achieved high accuracy in goal prediction and demonstrated significantly faster teleoperation time with shared autonomy.

13:25-13:30, Paper ThBT1.2
Recommendation Navigation Based on User Information Using VLM

Kwak, DaeWon	Kyunghee.uni
Rim, Hyunwoo	Kyung Hee University
Kim, Hyunwoo	Kyung-Hee University
Kim, Donghan	Kyung Hee University
Keywords: Intention Recognition, Multi-Modal Perception for HRI, Service Robotics Abstract: In this paper,we propose a novel recommendation-based path planning system that leverages VLM and LLM to interpret user intentions. The system infers user preferences through both conversational and behavioral data, thereby delivering personalized navigation and guidance services within complex consumer environments. The LLM component is designed to deduce user intent even in the absence of direct item references by utilizing higher level conceptual cues, while the VLM component analyzes images of user behavior to extract contextual information. A virtual museum simulation was implemented using Isaac Sim, and a metadata dataset for exhibits was constructed to validate the system’s performance. Experimental results demonstrate that the proposed system effectively interprets user intent and generates optimized pathways. Future work will focus on extending the system to consumer spaces such as department stores and supermarkets—areas where conventional 2D semantic maps are inadequate—by exploring topology-based mapping solutions. Ultimately, this research aims to revolutionize user experience by enabling personalized robotic services in consumer environments.

13:30-13:35, Paper ThBT1.3
Physics-Embedded Neural Networks for sEMG-Based Continuous Motion Estimation

Heng, Wending	University of Manchester
Liang, Chaoyuan	The University of Manchester
Zhao, Yihui	Sichuan University
Zhang, Zhiqiang	Imperial College London
Cooper, Glen	University of Manchester
Li, Zhenhong	University of Manchester
Keywords: Intention Recognition, Neurorobotics, AI-Enabled Robotics Abstract: Accurately decoding human motion intentions from surface electromyography (sEMG) is essential for myoelectric control and has wide applications in rehabilitation robotics and assistive technologies. However, existing sEMG-based motion estimation methods often rely on subject-specific musculoskeletal (MSK) models that are difficult to calibrate, or purely data-driven models that lack physiological constraints. This paper introduces a novel Physics-Embedded Neural Network (PENN) that combines interpretable MSK forward-dynamics with data-driven residual learning, thereby preserving physiological consistency while ensuring accurate motion estimation. The PENN employs a recursive temporal structure to propagate historical estimates and a lightweight convolutional neural network for residual correction, leading to robust and temporally coherent estimations. A two-phase training strategy is designed for PENN. Experimental evaluations on six healthy subjects show that PENN outperforms state-of-the-art baseline methods in both root mean square error (RMSE) and R² metrics.

13:35-13:40, Paper ThBT1.4
Edit Distance Based Intention Estimation for Teleoperated Assembly

Xu, Aolin	HRI
Li, Songpo	Honda Research Institute
Baskaran, Prakash	Honda Research Institute
Iba, Soshi	Honda Research Institute USA
Dariush, Behzad	Honda Research Institute USA
Keywords: Intention Recognition, Telerobotics and Teleoperation, Probability and Statistical Methods Abstract: We address the problem of intention estimation in human-robot teleoperation, which involves identifying the task being completed and predicting the next actions. Our approach sequentially quantifies the similarity between the observed action sequence and nominal action sequences representing possible tasks using the edit distance metric. and employs the nearest neighbor rule for task estimation as well as action prediction. The key advantage of our approach lies in its robustness to deviations in operator actions and action recognition errors, which are frequently encountered in real-world teleoperation settings. Through extensive experiments with both real and simulated data, we demonstrate that our method largely outperforms alternative approaches, including probabilistic graphical model based method and transformer based methods, particularly in scenarios with significant action deviations or action recognition errors. Additionally, we construct task distance matrices to analyze task similarities and potential confusion points, offering insights into when and where estimation errors are likely to occur. This analysis can guide the design of more distinctive task sequences and further improve the reliability of teleoperated robotic systems.

13:40-13:45, Paper ThBT1.5
EXplainable Intention Estimation in Teleoperated Manipulation Using Deep Dynamic Graph Neural Networks

Baskaran, Prakash	Honda Research Institute
Liu, Xiao	Arizona State University
Li, Songpo	Honda Research Institute
Iba, Soshi	Honda Research Institute USA
Keywords: Intention Recognition, Deep Learning Methods, Telerobotics and Teleoperation Abstract: Shared autonomy can improve teleoperating robotic systems in complex manufacturing and assembly tasks by combining human decision-making and robotic capabilities. A key aspect of seamless collaboration and trust in shared autonomy is the robot’s ability to interpret human intentions in a consistent and explainable manner. To achieve this, a graph neural network-based intention estimation framework is introduced, which generates dynamic graphs that capture spatial relationships evolving over time. The framework predicts human intentions at two hierarchical levels: low-level actions and high-level tasks. Furthermore, we empirically and anecdotally verify the correctness and consistency of the predictions using explainability metrics. The algorithm is demonstrated by teleoperating a bi-manual robot to assemble various block structures in a virtual reality simulation environment.

13:45-13:50, Paper ThBT1.6
A Probabilistic Programming Approach to Intention Estimation in Human-Robot Teleoperated Assembly Tasks

Xu, Aolin	HRI
Li, Songpo	Honda Research Institute
Baskaran, Prakash	Honda Research Institute
Patel, Karankumar	Honda Research Institute
Iba, Soshi	Honda Research Institute USA
Dariush, Behzad	Honda Research Institute USA
Keywords: Telerobotics and Teleoperation, Intention Recognition, Probability and Statistical Methods Abstract: We propose a new approach to solving the problem of intention estimation in human-robot teleoperation for assembly tasks, which includes task estimation and action prediction. Our approach uses probabilistic graphical models to represent the joint distribution of the task and the actions to be taken to complete the task. Both model learning and inference are implemented with Pyro, a state-of-the-art probabilistic programming language. The distinctive feature from the traditional hidden Markov model type of probabilistic methods is that our model takes the time information into account and explicitly models the individual distributions of all the variables under consideration. By doing this, we fully utilize the power of probabilistic programming, and achieve accurate distribution hence uncertainty estimations. Working with a pretrained action recognition module, the proposed model can be trained solely on a tiny instruction manual of the assembly tasks and can be retrained with minimal overhead whenever the manual is changed or augmented, avoiding the need for the costly data reannotation and retraining by the end-to-end learning based methods. We also compare our method with a transformer based model trained directly on the instruction manual, and our method shows superior accuracy in both intention estimation and their distribution estimations. We additionally identify failure cases of both our method and the transformer-based method, and envision methods for improvement.

13:50-13:55, Paper ThBT1.7
EEG-Based Motor Imagery Classification with Tuned Heuristic Fusion Graph Convolutional Network for Rehabilitation Training (I)

Shi, Kecheng	The School of Automation Engineering, University of Electronic S
Huang, Rui	University of Electronic Science and Technology of China
Lyu, Jianzhi	University of Hamburg
Li, Zhe	University of Electronic Science and Technology of China
Mu, Fengjun	University of Electronic Science and Technology of China
Peng, Zhinan	Unversity of Electronic Science and Tehcnology of China
Zou, Chaobin	University of Electronic Science and Technology of China
Cheng, Hong	University of Electronic Science and Technology
Zhang, Jianwei	University of Hamburg
Ghosh, Bijoy	Texas Tech University
Keywords: Intention Recognition, Brain-Machine Interfaces, Rehabilitation Robotics Abstract: Motor imagery-based brain–computer interfaces (MI-BCIs) hold significant promise for rehabilitation training in individuals with neurological impairments such as stroke and spinal cord injury (SCI). Achieving precise and robust lower limb movement prediction for each patient is crucial. However, the variability in MI response frequencies and brain activation patterns among subjects presents a great challenge to the generalizability of MI-BCIs. This paper proposes a Tuned Heuristic Fusion Graph Convolutional Network (THFGCN) for limb movement prediction in rehabilitation scenarios. THFGCN innovatively designs a learnable EEG frequency band tuned module and a heuristic space topology module. These two modules allow for the intricate extraction of both frequency and spatial topological features, utilizing graph adjacency matrices that encapsulate channel correlations and spatial relationships, hence fostering individualized analysis and enhanced generalizability across subjects. Furthermore, a spatio-temporal convolution module paired with a feature map attention mechanism is proposed to extract the critical spatio-temporal features of electroencephalogram (EEG) data. Validation experiments on the PhysioNet and LLM-BCImotion datasets against six mainstream methods demonstrate that THFGCN outperforms state-of-the-art methods, achieving 88.41% and 82.82% accuracy in the within-subject case, and 65.93% and 60.56% accuracy in the cross-subject case, respectively. Detailed frequency band weight and T-distributed Stochastic Neighbor Embedding visualization validate the effectiveness of proposed modules. Furthermore, feature interpretability analysis proves the extracted features' profound MI task relevance, underlining THFGCN's exceptional interpretability.


ThBT2	402
Industrial Robots and Actuators 2	Regular Session
Chair: Gao, Anzhu	Shanghai Jiao Tong University

13:20-13:25, Paper ThBT2.1
DuLoc: Life-Long Dual-Layer Localization in Changing and Dynamic Expansive Scenarios

Jiang, Haoxuan	The Hong Kong University of Science and Technology (Guangzhou)
Qian, Peicong	Unity Drive Innovation Technology
Xie, Yusen	The Hong Kong University of Science and Technology (Guangzhou)
Li, Xiaocong	Eastern Institute of Technology, Ningbo
Liu, Ming	Hong Kong University of Science and Technology (Guangzhou)
Ma, Jun	The Hong Kong University of Science and Technology
Keywords: Industrial Robots, Localization, SLAM Abstract: LiDAR-based localization serves as a critical component in autonomous systems, yet existing approaches face persistent challenges in balancing repeatability, accuracy, and environmental adaptability. Traditional point cloud registration methods relying solely on offline maps often exhibit limited robustness against long-term environmental changes, leading to localization drift and reliability degradation in dynamic real-world scenarios. To address these challenges, this paper proposes DuLoc, a robust and accurate localization method that tightly couples LiDAR-inertial odometry with offline map-based localization, incorporating a constant-velocity motion model to mitigate outlier noise in real-world scenarios. Specifically, we develop a LiDAR-based localization framework that seamlessly integrates a prior global map with dynamic real-time local maps, enabling robust localization in unbounded and changing environments. Extensive real-world experiments in ultra unbounded port that involve 2,856 hours of operational data across 32 Intelligent Guided Vehicles (IGVs) are conducted and reported in this study. The results attained demonstrate that our system outperforms other state-of-the-art LiDAR localization systems in large-scale changing outdoor environments.

13:25-13:30, Paper ThBT2.2
Analysis and Experiment of a Pneumatic Linear Actuator Actuated by Both Positive and Negative Pressures

Ni, Weijian	University of Science and Technology Beijing
Hao, Yufei	University of Science and Technology Beijing
Bao, Lei	Beijing Soft Robot Tech Co., Ltd
Shan, Xuemei	Beijing Soft Robot Tech Co., Ltd
Zhang, Jianhua	University of Science and Technology Beijing
Keywords: Hydraulic/Pneumatic Actuators Abstract: This paper establishes an analytical model for a dual- pressure-actuated pneumatic linear actuator, investigating the relationship between the output force of the linear actuator and both the pressure differential and displacement. Experiments were designed to validate the model. The maximum output force of the linear actuator under negative pressure (-40kPa) is 100N, while under hybrid air pressure (negative pressure -40kPa combined with positive pressure 40kPa), the maximum output force significantly increases to approximately 210N, demonstrating that dual pressure driving can substantially enhance output performance. The analytical results exhibit excellent agreement with experimental data under low-pressure conditions, with a maximum relative error of only 5%. Furthermore, comparisons with a flexible bellows of the same dimensions confirm that the linear actuator also exhibits high stiffness. Finally, potential applications of the linear actuator in daily life are discussed.

13:30-13:35, Paper ThBT2.3
Multimodal Task Attention Residual Reinforcement Learning: Advancing Robotic Assembly in Unstructured Environment

Lin, Ze	South China University of Technology
Wang, Chuang	South China University of Technology
Wu, Sihan	South China University of Technology
Xie, Longhan	South China University of Technology
Keywords: Industrial Robots, Computer Vision for Manufacturing, Compliant Assembly Abstract: Robotic assembly in dynamic and unstructured environments poses challenges for recent methods, due to background noise and wide-ranging errors. Directly learning from the environments rely on complex model and extensive training iterations to adapt. Representation selection approaches, which depend on expert knowledge, can reduce training costs but suffer from poor robustness and high manual costs, limiting scalability. In response, this paper proposes a system that integrates task attention into residual reinforcement learning to address these challenges. By effectively segmenting task-relevant information from the background to leverage task attention, our approach mitigates the impact of environmental variability. Additionally, compared with existing baselines, our task attention mechanism based on instance segmentation and prompt-guided selection does not require additional offline training or local fine-tuning. Experimental evaluations conducted in both simulated and real environments demonstrate the superiority of our method over various baselines. Specifically, our system achieves high efficiency and effectiveness in learning and executing assembly tasks in dynamic and unstructured environments.

13:35-13:40, Paper ThBT2.4
Evetac Meets Sparse Probabilistic Spiking Neural Network: Enhancing Snap-Fit Recognition Efficiency and Performance

Fang, Senlin	Shenzhen Institute of Advanced Technology
Ding, Haoran	City University of Hong Kong
Liu, Yangjun	University of Macau
Liu, Jiashu	Shenzhen Institute of Advanced Technology
Zhang, Yupo	Southern University of Science and Technology
Li, Yilin	Shenzhen Institute of Advanced Technology
Kong, Hoiio	City University of Macau
Yi, Zhengkun	Shenzhen Institute of Advanced Technology, Chinese Academy of Sc
Keywords: Industrial Robots, Assembly, Deep Learning Methods Abstract: Snap-fit peg-in-hole assembly is common in industrial robotics, particularly for 3C electronics, where fast and accurate tactile recognition is crucial for protecting fragile components. Event-based optical sensors, such as Evetac, are well-suited for this task due to their high sparsity and sensitivity in detecting small, rapid force changes. However, existing research often converts event data into dense images and processing them with dense methods, leading to higher computational complexity. In this letter, we propose a Sparse Probabilistic Spiking Neural Network (SPSNN) that utilizes sparse convolutions to extract features from the event data, avoiding computations on non-zero cells. We introduce the Forward and Backward Propagation Through Probability (FBPTP) method, which enables simultaneous gradient computation across all time steps, eliminating the need for the step-by-step traversal required by traditional Forward and Backward Propagation Through Time (FBPTT). Additionally, the Temporal Weight Prediction (TWP) method dynamically allocates weights for different time outputs, enhancing recognition performance with minimal impact on model efficiency. We integrate the Evetac sensor compactly into our robotic system and collected two datasets, named Tactile Event Ethernet (TacEve-Eth) and Tactile Event Type-C (TacEve-TC), corresponding to cantilever and annular snap-fit structures. Experiments show that the SPSNN achieves the superior trade-off between recognition performance and efficiency compared to other widely used methods, achieving the highest average recognition performance while reducing inference time by over 90% compared to FBPTT-based dense SNN baselines.

13:40-13:45, Paper ThBT2.5
Dual_tool_Distance_Constraint_for_Robot_Length_Parameter_Identification_in_Confined_Calibration_Space (I)

Liu, Fei	Kunming University of Science and Technology
Na, Jing	Kunming University of Science & Technology
Gao, Guanbin	Kunming University of Science and Technology
Hou, Cheng	Soochow Unversity
Keywords: Industrial Robots, Kinematics Abstract: In robotic automation, compact layouts and the limited range of measurement devices often confine the robot's calibration space, leading to ill-conditioned Jacobian matrices for length parameters and significant estimation errors. To address this issue, joint configurations in the confined calibration space are modeled as perturbations around those under a fixed end-effector position. By applying matrix perturbation theory, the relationship between the smallest singular value of the identification matrix and the joint configuration perturbations is established. Building on this theoretical basis, a novel dual-tool distance constraint method is introduced to alleviate the ill-conditioning of the identification matrix and enhance parameter identification accuracy. The proposed method was validated using an ABB IRB 4600 industrial robot. Experimental results in the confined calibration space demonstrate that, compared with conventional methods, the proposed approach not only mitigates ill-conditioning but also significantly improves the robot's positioning accuracy.

13:45-13:50, Paper ThBT2.6
Data Privacy Protection Diagnostic Algorithm for Industrial Robot Joint Harmonic Reducers Based on Swarm Learning (I)

Huang, Haodong	Harbin Institute of Technology (Shenzhen)
Sun, Shilong	Harbin Institute of Technology Shenzhen
Wang, Dong	Shanghai Jiaotong University
Xu, Wenfu	Harbin Institute of Technology, Shenzhen
Keywords: Industrial Robots, Deep Learning Methods Abstract: Harmonic reducers play a crucial role in industrial robots. Their high load capacity and low friction performance make them highly favored. However, obtaining a large amount of high-quality data on all factory faults is not easy in actual industrial applications. At the same time, data sharing between factories is limited due to privacy concerns. To address this challenge, this article proposes an innovative solution by integrating convolutional neural networks (CNNs) into a swarm learning (SL) framework. In this framework, multiple factories act as edge computing nodes, sharing data features through the fusion of network parameters without directly sharing the data itself. First, we use CNNs to train each node and select a decision-maker before training to merge the model parameters. Secondly, the decision-maker chosen by SL collects the models from other nodes. Finally, the decision-maker disseminates the integrated model to the other nodes. We validated the proposed method using a harmonic reducer dataset and conﬁrmed its reliability. The experimental results show that the proposed framework can improve computational efﬁciency without relying on a central server, and the shared model can also improve the fault diagnosis accuracy of each edge node.

13:50-13:55, Paper ThBT2.7
Parameter Selections and Applications for Soft Bellows Actuators (SBAs) with Various Performance Metrics

Zou, Wenjing	Tsinghua University
Li, Zhekai	University of Science and Technology Beijing
Xiao, Ziting	Uestc
Zheng, KaiLan	University of Electronic Science and Technology of China
Lin, Chao	Tsinghua University
Haolin, Chen	Tsinghua University
Yu, Peifeng	Tsinghua University，
Niu, Yi	University of Electronic Science and Technology of China
Jiang, Jing	University of Electronic Science and Technology of China
Wang, Chao	Tsinghua University
Keywords: Hydraulic/Pneumatic Actuators, Soft Sensors and Actuators, Soft Robot Materials and Design Abstract: Soft bellows actuators (SBAs), a particular type of soft pneumatic actuators (SPAs), are widely used in various applications, such as climbing robots, industrial grippers, and wearable devices. Despite their advantages of uniform motion and high efficiency, the design of SBAs often relies on experiential methods rather than standardized guidelines. This results in unclear optimization pathways and a misalignment between SBA performance and specific application requirements. This study identifies six critical parameters of linear pneumatic SBAs: Shore hardness (SH), number of units (N), thickness (t), mid-diameter (Rm), unit width (x), and unit depth (h). We explore how these parameters influence load capacity, displacement efficiency, and bending resistance. Experimental findings indicate that increasing SH, t, x, and h and decreasing N enhance load capacity. Moreover, increases in N, Rm, x, and h, along with decreases in SH and t, improve displacement efficiency. Furthermore, enhancing SH, t, and Rm and reducing N, x, and h strengthen bending resistance. Based on these insights, we design three types of SBAs tailored to specific tasks, which are implemented in a high-load pneumatic gripper, a high-efficiency displacement table, and a pneumatic worm-inspired climbing robot. This research contributes to the targeted design of SBAs, offering a novel approach for the effective optimization and performance prediction of particular SPAs, thereby facilitating the broader application of soft robots.

13:55-14:00, Paper ThBT2.8
OptiGrasp: Optimized Grasp Pose Detection Using RGB Images for Warehouse Picking Robots

Atar, Soofiyan	University of California San Diego
Li, Yi	University of Washington
Grotz, Markus	University of Washington (UW)
Fox, Dieter	University of Washington
Smith, Joshua R.	University of Washington
Keywords: Industrial Robots, Inventory Management, Perception for Grasping and Manipulation Abstract: In warehouse environments, robots require robust picking capabilities to manage a wide variety of objects. Effective deployment demands minimal hardware, strong generalization to new products, and resilience in diverse settings. Current methods often rely on depth sensors for structural information, which suffer from high costs, complex setups, and technical limitations. Inspired by recent advancements in computer vision, we propose an innovative approach that leverages foundation models to enhance suction grasping using only RGB images. Trained solely on a synthetic dataset, our method generalizes its grasp prediction capabilities to real-world robots and a diverse range of novel objects not included in the training set. Our network achieves an 82.3% success rate in real-world applications.


ThBT3	403
Physical Human-Robot Interaction 2	Regular Session
Chair: Li, Xiang	Tsinghua University
Co-Chair: Charbonneau, Marie	University of Calgary

13:20-13:25, Paper ThBT3.1
A Null Space Compliance Approach for Maintaining Safety and Tracking Performance in Human-Robot Interactions

Yang, Zi-Qi	University of Western Ontario
Wang, Miaomiao	Huazhong University of Science and Technology
R. Kermani, Mehrdad	University of Western Ontario
Keywords: Physical Human-Robot Interaction, Compliance and Impedance Control, Safety in HRI Abstract: In recent years, the focus on developing robot manipulators has shifted towards prioritizing safety in Human-Robot Interaction (HRI). Impedance control is a typical approach for interaction control in collaboration tasks. However, such a control approach has two main limitations: 1) the end-effector (EE)'s limited compliance to adapt to unknown physical interactions, and 2) inability of the robot body to compliantly adapt to unknown physical interactions. In this work, we present an approach to address these drawbacks. We introduce a modified Cartesian impedance control method combined with a Dynamical System (DS)-based motion generator, aimed at enhancing the interaction capability of the EE without compromising main task tracking performance. This approach enables human coworkers to interact with the EE on-the-fly, e.g. tool changeover, after which the robot compliantly resumes its task. Additionally, combining with a new null space impedance control method enables the robot body to exhibit compliant behaviour in response to interactions, avoiding serious injuries from accidental contact while mitigating the impact on main task tracking performance. Finally, we prove the passivity of the system and validate the proposed approach through comprehensive comparative experiments on a 7 Degree-of-Freedom (DOF) KUKA LWR IV+ robot.

13:25-13:30, Paper ThBT3.2
Stereo Hand-Object Reconstruction for Human-To-Robot Handover

Pang, Yik Lung	Queen Mary University of London
Xompero, Alessio	Queen Mary University of London
Oh, Changjae	Queen Mary University of London
Cavallaro, Andrea	Queen Mary University of London
Keywords: Physical Human-Robot Interaction, Deep Learning for Visual Perception, Perception for Grasping and Manipulation Abstract: Jointly estimating hand and object shape facilitates the grasping task in human-to-robot handovers. However, relying on hand-crafted prior knowledge about the geometric structure of the object fails when generalising to unseen objects, and depth sensors fail to detect transparent objects such as drinking glasses. In this work, we propose a stereo-based method for hand-object reconstruction that combines single-view reconstructions probabilistically to form a coherent stereo reconstruction. We learn 3D shape priors from a large synthetic hand-object dataset to ensure that our method is generalisable, and use RGB inputs to better capture transparent objects. We show that our method reduces the object Chamfer distance compared to existing RGB based hand-object reconstruction methods on single view and stereo settings. We process the reconstructed hand-object shape with a projection-based outlier removal step and use the output to guide a human-to-robot handover pipeline with wide-baseline stereo RGB cameras. Our hand-object reconstruction enables a robot to successfully receive a diverse range of household objects from the human.

13:30-13:35, Paper ThBT3.3
Automated Repositioning from Supine to Lateral with a Humanoid Robot Based on Body Modeling

Matsumura, Misa	The University of Tokyo
Miyake, Tamon	Waseda University
Choi, Woohyeok	University of Tokyo
Sugano, Shigeki	Waseda University
Nakagawa, Keiichi	The University of Tokyo
Kobayashi, Etsuko	The University of Tokyo
Keywords: Safety in HRI, Physical Human-Robot Interaction, Human-Aware Motion Planning Abstract: The application of humanoid robots is gaining attention as a solution to the caregiver shortage caused by an aging population. In this study, we automated the process of changing a patient's body position from supine to lateral, a key aspect of nursing care. We proposed a method for recognizing a patient's 3D posture at close range by simultaneously using a fisheye camera and a RGBD camera. For robot motion, we developed a trajectory generation method that adapts to the patient's posture by converting measurement data into a mathematical model. Additionally, we identified the optimal timing for the movement of robot arms with minimal physical strain by considering human body dynamics. In all experiments using mannequins of different body shapes, the robot successfully reached the target joint and lifted one side of the body by more than 48 degrees. Future work will include detection of joints unaffected by body bulges and application the method to other repositioning movements.

13:35-13:40, Paper ThBT3.4
On the Analysis of Stability, Sensitivity and Transparency in Variable Admittance Control for pHRI Enhanced by Virtual Fixtures

Tebaldi, Davide	University of Modena and Reggio Emilia
Onfiani, Dario	University of Modena and Reggio Emilia
Biagiotti, Luigi	University of Modena and Reggio Emilia
Keywords: Physical Human-Robot Interaction, Compliance and Impedance Control, Dynamics Abstract: The interest in Physical Human-Robot Interaction (pHRI) has significantly increased over the last two decades, thanks to the availability of collaborative robots that ensure user safety during force exchanges. For this reason, stability concerns remain a key issue in the literature whenever new control schemes for pHRI applications are proposed. Due to the nonlinear nature of robots, stability analyses generally rely on passivity concepts and consider ideal models of robot manipulators. Therefore, the first objective of this paper is to conduct a detailed analysis of the sources of instability in proxy-based constrained admittance controllers for pHRI applications, taking into account parasitic effects such as transmission elasticity, motor velocity saturation, and actuation delay. The second objective of this paper is to perform a sensitivity analysis, supported by experimental results, to identify how the control parameters affect the stability of the overall system. Finally, the third objective is to propose an adaptation technique for the proxy parameters, with the goal of maximizing transparency in pHRI. The proposed adaptation method is validated through simulations and experimental tests.

13:40-13:45, Paper ThBT3.5
Subject-Independent sEMG-Based Prosthetic Control Using MAMBA2 with Domain Adaptation

Kim, Kihyun	Gwangju Institute of Science and Technology
Kang, Jiyeon	Gwangju Institute of Science and Technology
Keywords: Physical Human-Robot Interaction, Prosthetics and Exoskeletons, Neurorobotics Abstract: Integrating functional wrist articulation in prosthetic robot arms is crucial for enhancing natural movement and reducing compensatory upper limb motions. However, two significant challenges remain in electromyography (sEMG)-based prosthetic control: (1) real-time processing via efficient model design and (2) cross-subject generalization to address the individual variability in muscle signals. This study employs the MAMBA2 architecture to address the first challenge, leveraging Structured State Space Models (SSM) for efficient long-sequence inference. This enables real-time control with minimal computational overhead, making it well-suited for prosthetic robot arm applications. To tackle the second challenge, we implement a Representation Subspace Distance (RSD)-based Unsupervised Domain Adaptation (UDA), which preserves feature scale while aligning inter-subject variations, mitigating domain shift effects, and improving subject-independent wrist movement estimation. The model is trained on the Ninapro DB2 dataset, utilizing multi-channel sEMG signals and corresponding wrist kinematics. Evaluation results demonstrate that the MAMBA architecture outperforms conventional recurrent neural networks, achieving lower Mean Squared Error (MSE) and higher R² values, with the Attention variant exhibiting the best prediction performance. Furthermore, this study highlights that the proposed UDA approach, combined with RSD-based alignment, significantly enhances cross-subject performance, reducing the need for extensive calibration. By enabling real-time processing through a computationally efficient model structure and effectively addressing cross-subject variability, this study contributes to developing a more reliable and generalizable sEMG-based robotic prosthesis controller, ultimately improving its applicability across diverse individuals.

13:45-13:50, Paper ThBT3.6
Vision-Based Contact Wrench Estimation in Human-Robot Interaction

Farajtabar, Mohammad	University of Calgary
Christa, Sabine	University of Erlangen-Nuremberg
Charbonneau, Marie	University of Calgary
Keywords: Physical Human-Robot Interaction, Multi-Modal Perception for HRI, Safety in HRI Abstract: With the rapid integration of robotics across diverse sectors, human interaction with these technologies is becoming inevitable. In physical human-robot interaction (pHRI), where direct contact occurs, ensuring safety is crucial to prevent injuries and maintain effective interaction. Accurate force estimation enables robots to sense contact forces and respond appropriately. This paper presents a vision-based method for estimating interaction wrenches in multi-contact pHRI. Utilizing an RGB-D sensor, it detects 3D hand positions to identify contact points and employs a generalized momentum observer (GMO) to distinguish joint torques from external wrenches. A long short-term memory (LSTM) network compensates for uncertainties arising from unmodelled dynamics. Addressing challenges like wrench null space and Jacobian singularities, the approach identifies computable external wrench components. The method achieves a 0.9 N estimation error in complex, multi-contact interactions, enhancing safety and responsiveness. Key contributions include a novel wrench identification method leveraging robot configuration and contact points, derived from a vision-based system, to enhance real-time estimation.

13:50-13:55, Paper ThBT3.7
Touch-Sensitive Hand Interactions for Social Robots Using Fiber Bragg Grating Sensors

Gaitán-Padilla, María	Federal University of Espirito Santo
Garcia A., Daniel E.	Federal University of Espirito Santo
Sanchez R., Elizabeth	Federal University of Espirito Santo
Pontes, Maria Jose	Federal University of Espirito Santo
Segatto, Marcelo	UFES
Cifuentes, Carlos A.	University of the West of England, Bristol
Diaz, Camilo	Federal University of Espírito Santo
Keywords: Physical Human-Robot Interaction, Sensor-based Control, Social HRI Abstract: Physical human-robot interaction (pHRI) has been demonstrated to be essential in the implementation of social assistive robots (SARs), which require advanced sensing capabilities for accurate and responsive engagement. This study presents the development and validation of a fiber Bragg grating (FBG) sensor network integrated into the hand of the CASTOR robot to classify complex pHRIs. Nine pHRIs were collected and evaluated within the high five, pets, handshakes, hits, and pinches categories. Four machine learning (ML) algorithms were tested, and the Bagged Decision Tree Classifier (BDTC) achieved the best performance. During testing, the model achieved an accuracy of 98%. The results demonstrate that the proposed FBG sensor network can classify complex pHRIs. Future work will explore additional instrumented areas of the robot and expand the physical interaction analysis to enhance social robot adaptability and user experience.

13:55-14:00, Paper ThBT3.8
Human Demonstrations Are Generalizable Knowledge for Robots

Cui, Te	Beijing Institute of Technology
Zhou, Tianxing	Beijing Institute of Technology
Hu, Mengxiao	Beijing Institute of Technology
Lu, Haoyang	Beijing Institute of Techonology
Peng, Zicai	Beijing Institute of Technology
Li, Haizhou	Beijing Institute of Technology
Chen, Guangyan	Beijing Institute of Technology
Wang, Meiling	Beijing Institute of Technology
Yue, Yufeng	Beijing Institute of Technology
Keywords: Physical Human-Robot Interaction, Perception-Action Coupling Abstract: Learning from human demonstrations is an emerging trend for designing intelligent robotic systems. However, previous methods typically regard videos as instructions, simply dividing videos into action sequences for robotic repetition, which pose obstacles to generalization to diverse tasks or object instances. In this paper, we propose a different perspective, considering human demonstration videos not as mere instructions, but as a source of knowledge for robots. Motivated by this perspective and the remarkable comprehension and generalization capabilities exhibited by large language models (LLMs), we propose DigKnow, a method that DIstills Generalizable KNOWledge with a hierarchical structure. Specifically, DigKnow begins by converting human demonstration video frames into observation knowledge. This knowledge is then subjected to analysis to extract human action knowledge and further distilled into pattern knowledge that comprises task and object instances, resulting in the acquisition of generalizable knowledge with a hierarchical structure. In settings with different tasks or object instances, DigKnow retrieves relevant knowledge for the current task and object instances. Subsequently, the LLM-based planner conducts planning based on the retrieved knowledge, and the policy executes actions in line with the plan to achieve the designated task. Utilizing the retrieved knowledge, we validate and rectify planning and execution outcomes, resulting in a substantial enhancement of the success rate. Experimental results across a range of tasks and scenes demonstrate the effectiveness of this approach in facilitating real-world robots to accomplish tasks with the knowledge derived from human demonstrations.


ThBT4	404
AI-Enabled Robotics 2	Regular Session
Chair: Garrabé, Émiland	ISIR, Sorbonne Université
Co-Chair: Zhu, Haiyue	Agency for Science, Technology and Research (A*STAR)

13:20-13:25, Paper ThBT4.1
AdaToken-3D: Dynamic Spatial Gating for Efficient 3D Large Multimodal-Models Reasoning

Zhang, Kai	College of Control Science and Engineering, Shandong University
Chen, Xingyu	Johns Hopkins University
Zhang, Xiaofeng	Shang Hai Jiao Tong University
Keywords: AI-Based Methods, RGB-D Perception, Sensor Fusion Abstract: Large Multimodal Models (LMMs) have become a pivotal research focus in deep learning, demonstrating remarkable capabilities in 3D scene understanding. However, current 3D LMMs employing thousands of spatial tokens for multimodal reasoning suffer from critical inefficiencies: excessive computational overhead and redundant information flows. Unlike 2D VLMs processing single images, 3D LMMs exhibit inherent architectural redundancy due to the heterogeneous mechanisms between spatial tokens and visual tokens. To address this challenge, we propose AdaToken-3D, an adaptive spatial token optimization framework that dynamically prunes redundant tokens through spatial contribution analysis. Our method automatically tailors pruning strategies to different 3D LMM architectures by quantifying token-level information flows via attention pattern mining. Extensive experiments on LLaVA-3D (a 7B parameter 3D-LMMs) demonstrate that AdaToken-3D achieves 21% faster inference speed and 63% FLOPs reduction while maintaining original task accuracy. Beyond efficiency gains, this work systematically investigates redundancy patterns in multimodal spatial information flows through quantitative token interaction analysis. Our findings reveal that over 60% of spatial tokens contribute minimally (<5%) to the final predictions,establishing theoretical foundations for efficient 3D multimodal learning.

13:25-13:30, Paper ThBT4.2
Autonomous Subtask Generation for Indoor Search and Rescue Mission Via Large-Language-Model and Behavior-Tree Integration

Shi, Junfeng	National University of Defense Technology
Huang, Kaihong	National University of Defense Technology
Pan, Hainan	National University of Defense Technology
Xu, Junpeng	National University of Defense Technology
Cheng, Chuang	National University of Defense Technology
Zhang, Hui	National University of Defense Technology
Keywords: AI-Enabled Robotics, Task Planning, Vision-Based Navigation Abstract: The ability of autonomous subtask generation is important for robots to effectively cope with unforeseen situations during indoor search and rescue missions. While prior work mainly focused on improving individual low-level skills of the rescue robot, this paper proposes AutoExpand: a high-level framework that takes advantage of the extensive knowledge and reasoning abilities inherent in large language models (LLM) to understand human instructions and environmental situation. Through tight coupling LLM with behavior tree, our method enables the robot to autonomously generate reactive context-aware operational subtasks on-site without human intervention or additional training. A series of real-world experiments demonstrate that AutoExpand can effectively generate appropriate tasks for search and rescue missions, leading to a search scope increased by 34.45% when compared with traditional methods. The sample code is available at https://github.com/nubot-nudt/AutoExpand.

13:30-13:35, Paper ThBT4.3
Enhancing Robustness in Language-Driven Robotics: A Modular Approach to Failure Reduction

Garrabé, Émiland	ISIR, Sorbonne Université
Teixeira, Pierre	ISIR, Sorbonne Université
Khoramshahi, Mahdi	Sorbonne University
Doncieux, Stéphane	Sorbonne University
Keywords: AI-Enabled Robotics, AI-Based Methods, Human-Centered Robotics Abstract: Recent advances in large language models (LLMs) have led to significant progress in robotics, enabling embodied agents to understand and execute open-ended tasks. However, existing approaches using LLMs face limitations in grounding their outputs within the physical environment and aligning with the capabilities of the robot. While fine-tuning is an attractive approach to addressing these issues, the required data can be expensive to collect. Smaller language models, while more computationally efficient, are less robust in task planning and execution, leading to a difficult trade-off between performance and tractability. In this paper, we present a novel, modular architecture designed to enhance the robustness of locally-executable LLMs in the context of robotics by addressing these grounding and alignment issues. We formalize the task planning problem within a goal-conditioned POMDP framework, identify key failure modes in LLM-driven planning, and propose targeted design principles to mitigate these issues. Our architecture introduces an ``expected outcomes'' module to prevent mischaracterization of subgoals and a feedback mechanism to enable real-time error recovery. Experimental results, both in simulation and on physical robots, demonstrate that our approach leads to significant improvements in success rates for pick-and-place and manipulation tasks, surpassing baselines using larger models. Through hardware experiments, we also demonstrate how our architecture can be run efficiently and locally. This work highlights the potential of smaller, locally-executable LLMs in robotics and provides a scalable, efficient solution for robust task execution and data collection.

13:35-13:40, Paper ThBT4.4
Towards an Extremely Robust Baby Robot with Rich Interaction Ability for Advanced Machine Learning Algorithms

Alhakami, Mohannad	King Abdullah University of Science and Technology
Ashley, Dylan Robert	King Abdullah University of Science and Technology
Dunham, Joel	OptoXense, Inc
Dai, Yanning	King Abdullah University of Science and Technology (KAUST)
Faccio, Francesco	King Abdullah University of Science and Technology
Feron, Eric	King Abdullah University of Science and Technology
Schmidhuber, Jurgen	Technische Universität München
Keywords: AI-Enabled Robotics, Soft Robot Materials and Design, Biologically-Inspired Robots Abstract: Advanced machine learning algorithms require platforms that are extremely robust and equipped with rich sensory feedback to handle extensive trial-and-error learning without relying on overwhelming inductive biases. Traditional robotic designs, while well-suited for their specific use cases, are often fragile when used with these algorithms as they fail to address the intermediate sub-optimal posterior-based behavior these algorithms exhibit. To address this gap---and inspired by the vision of enabling curiosity-driven baby robots---we present a novel robotic limb designed from scratch. Our design features a semi-soft structure, a high degree of redundancy achieved through rich non-contact sensors (exclusively cameras), and strategically designed, easily replaceable failure points. Proof-of-concept experiments using two contemporary reinforcement learning algorithms on a physical prototype demonstrate that our design is able to succeed in a target-finding task even under simulated sensor failures, all with minimal human oversight during extended learning periods. Additional experiments on the robustness of the design show that it is able to withstand relatively large amounts of mechanical stress. We believe this design represents a concrete step toward more tailored robotic designs capable of supporting general-purpose, generally intelligent robots.

13:40-13:45, Paper ThBT4.5
STEP Planner: Constructing Cross-Hierarchical Subgoal Tree As an Embodied Long-Horizon Task Planner

Zhou, Tianxing	Beijing Institute of Technology
Wang, Zhirui	Beijing Institute of Technology
Ao, Haojia	Beijing Institute of Technology
Chen, Guangyan	Beijing Institute of Technology
Xing, Boyang	Beijing Institute of Technology
Cheng, Jingwen	Humanoid Robot (Shanghai) Co., Ltd
Yang, Yi	Beijing Institute of Technology
Yue, Yufeng	Beijing Institute of Technology
Keywords: AI-Enabled Robotics, Planning, Scheduling and Coordination Abstract: The ability to perform reliable long-horizon task planning is crucial to deploy robots in real-world environments. However, directly employ Large Language Models(LLMs) to robots as an action sequence generator may lead to a low success rate due to the poor reasoning ability of LLMs for long-horizon embodied tasks. In the STEP framework, we construct a subgoal tree through a pair of close-loop models: the subgoal decomposition model and the leaf node termination model. Within this framework, we develop a hierarchical tree structure ranging from coarse to fine resolutions. The subgoal decomposition model leverages a foundation LLM to break down the goal into several subgoals, thereby spanning the subgoal tree. The leaf node termination model provides real-time feedback based on the environmental state, which determines when to terminate the spanning of the subgoal tree, ensuring each leaf node can be directly converted into a primitive action. Experiments conducted in Virtualhome watch-and-help (wah) benchmark and real robot demonstrate that STEP achieves long-horizon embodied task completion with up to 34% success rates(wah) and 25%(real robot) outperforming SOTA methods.

13:45-13:50, Paper ThBT4.6
RaGNNarok: A Light-Weight Graph Neural Network for Enhancing Radar Point Clouds on Unmanned Ground Vehicles

Hunt, David	Duke University
Luo, Shaocheng	Duke University
Hallyburton, Robert	Duke University
Li, Yi	Duke University
Nillongo, Shafii Issa	Duke University
Chen, Tingjun	Duke University
Pajic, Miroslav	Duke University
Keywords: AI-Enabled Robotics, Deep Learning Methods, Range Sensing Abstract: Low-cost indoor mobile robots have gained popularity with the increasing adoption of automation in homes and commercial spaces. However, existing lidar and camera-based solutions have limitations such as poor performance in visually obscured environments, high computational overhead for data~processing, and high costs for lidars. In contrast, mmWave radar sensors offer a cost-effective and lightweight alternative, providing accurate ranging regardless of visibility. However, existing radar-based localization suffers from sparse point cloud generation, noise, and false detections. Thus, in this work, we introduce RaGNNarok, a real-time, lightweight, and generalizable graph neural network (GNN)-based framework to enhance radar point clouds, even in complex and dynamic environments. With an inference time of just 7.3 ms on the low-cost Raspberry Pi 5, RaGNNarok runs efficiently even on such resource-constrained devices, requiring no additional computational resources. We evaluate its performance across key tasks, including localization, SLAM, and autonomous navigation, in three different environments. Our results demonstrate strong reliability and generalizability, making RaGNNarok a robust solution for low-cost indoor mobile robots.

13:50-13:55, Paper ThBT4.7
Task-Guided and Object-Centric Conditioning for Effective and Adaptive Diffusion Policy

Wang, Wenshuo	National University of Singapore
Zhao, Ruiteng	National University of Singapore
Teo, Tat Joo	Singapore Institute of Manufacturing Technology
Ang Jr, Marcelo H	National University of Singapore
Zhu, Haiyue	Agency for Science, Technology and Research (A*STAR)
Keywords: AI-Enabled Robotics, Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation Abstract: Imitation learning has emerged as an effective paradigm for training visuo-motor policies in robotic manipulation. In real-world scenarios, visuo-motor policies are required to be effective, sample-efficient, and capable of adapting to dynamic environments. A key factor influencing these capabilities is the quality of visual representations. Conventional approaches that learn a vision encoder and policy network from scratch often result in suboptimal representations, as the training process tends to prioritize policy optimization over rich semantic feature extraction. Alternatively, while pre-trained large vision models offer strong general-purpose features, they often fail to capture the fine-grained, taskspecific information required for effective manipulation. To capture rich and informative visual features, we propose TOC-DP, a novel framework that integrates SlotAttention to facilitate object-centric representation learning. Task-specific segmentation priors are incorporated as an inductive bias to enhance the task-awareness and object-awareness of the learned visual features. The extracted representations are subsequently refined to encode action-aware information during downstream policy learning. Extensive experiments on the Meta-World benchmark and real-world tasks demonstrate that TOC-DP improves success rate by 30% over baseline methods across a variety of task scenarios.


ThBT5	407
Formal Method in Robotics and Automation 2	Regular Session
Chair: Natale, Lorenzo	Istituto Italiano Di Tecnologia
Co-Chair: Shirafuji, Shouhei	Kansai University

13:20-13:25, Paper ThBT5.1
Automated Manipulation of Magnetic Microswarms for Temporal Logic Cargo Delivery Tasks in Complex Environments

Zhang, Naifu	Xiamen University
Luo, Tao	Xiamen University
Jiang, Chongjie	Xiamen University
Yin, Xiang	Shanghai Jiao Tong Univ
Yu, Xiao	Xiamen University
Ji, Rongrong	Xiamen University
Keywords: Formal Methods in Robotics and Automation, Motion and Path Planning, Automation at Micro-Nano Scales Abstract: Micromanipulation using magnetic microswarms has garnered significant attention in recent years due to their potential in microscale cargo delivery tasks. While existing studies have demonstrated the capabilities of microswarms in basic manipulation tasks, they often lack the autonomy required to handle more complex specifications, particularly temporal logic tasks. In this paper, we propose a novel formal planning strategy for magnetic microswarms that enables cargo delivery in complex environments while satisfying finite linear temporal logic (LTL_f) specifications. Our approach consists of two key components. First, we develop a high-level path planner based on a bidirectional temporal logic rapid-explore random tree star (BTL-RRT*) algorithm, which facilitates efficient planning while ensuring compliance with the given task specifications. Second, we employ an automaton to manage the manipulation modes of the microswarm, enabling real-time control over the capture and release of cargoes. In addition, we implement the planning strategy on microswarms actuated by a visual feedback magnetic tweezers system. Extensive simulations and experimental results demonstrate the effectiveness of the proposed planning strategy. The results indicate that, using the proposed approach, microswarms can autonomously select and deliver multiple microbeads to designated regions in both static and dynamic environments, adhering to the LTL_f specifications.

13:25-13:30, Paper ThBT5.2
Code Generation and Monitoring for Deliberation Components in Autonomous Robots

Bernagozzi, Stefano	Istituto Italiano Di Tecnologia
Faraci, Sofia	Italian Institute of Technology
Ghiorzi, Enrico	Università Degli Studi Di Genova
Pedemonte, Karim	Università Degli Studi Di Genova
Natale, Lorenzo	Istituto Italiano Di Tecnologia
Tacchella, Armando	Università Di Genova
Keywords: Methods and Tools for Robot System Design, Formal Methods in Robotics and Automation, Software, Middleware and Programming Environments Abstract: Hand-coded deliberation components are prone to flaws that may not be discovered before deployment and that can be harmful to the robot and its execution environment, including the people within it. To reduce development effort and at the same time increase confidence in robot's safety, we propose to model deliberation components at a conceptual level, to automatically generate code from such models and also to monitor their execution during robot operation. We present two tools, one which compiles models of deliberation components into executable code, and one which generates runtime monitors from the models. We have tested them in simulation, to demonstrate the usefulness of combining together model-based development, code generation, and monitoring.

13:30-13:35, Paper ThBT5.3
Physically-Feasible Reactive Synthesis for Terrain-Adaptive Locomotion Via Trajectory Optimization and Symbolic Repair

Zhou, Ziyi	Georgia Institute of Technology
Meng, Qian	Cornell University
Kress-Gazit, Hadas	Cornell University
Zhao, Ye	Georgia Institute of Technology
Keywords: Legged Robots, Formal Methods in Robotics and Automation, Task and Motion Planning Abstract: We propose an integrated planning framework for quadrupedal locomotion over dynamically changing, unforeseen terrains. Existing approaches either rely on heuristics for instantaneous foothold selection--compromising safety and versatility--or solve expensive trajectory optimization problems with complex terrain features and long time horizons. In contrast, our framework leverages reactive synthesis to generate correct-by-construction controllers at the symbolic level, and mixed-integer convex programming (MICP) for dynamic and physically feasible footstep planning for each symbolic transition. We use a high-level manager to reduce the large state space in synthesis by incorporating local environment information, improving synthesis scalability. To handle specifications that cannot be met due to dynamic infeasibility, and to minimize costly MICP solves, we leverage a symbolic repair process to generate only necessary symbolic transitions. During online execution, re-running the MICP with real-world terrain data, along with runtime symbolic repair, bridges the gap between offline synthesis and online execution. We demonstrate, in simulation, our framework's capabilities to discover missing locomotion skills and react promptly in safety-critical environments, such as scattered stepping stones and rebars.

13:40-13:45, Paper ThBT5.5
Robotic in Situ Measurement of Multiple Intracellular Physical Parameters Based on Three-Micropipettes System

Liu, Mengya	Nankai University
Qiu, Jinyu	Nankai University
Fu, Shaojie	Nankai University
Li, Ruimin	Nankai University
Liu, Yuzhu	Nankai University
Chen, Hao	Nankai University
Zhao, Xin	Nankai University
Zhao, Qili	Nankai University
Keywords: Micro/Nano Robots, Automation at Micro-Nano Scales, Dual Arm Manipulation Abstract: Physical parameters of the intracellular environment such as mass density, intracellular pressure and elasticity have significant effects on the physiological activities of the cell and intracellular operation results. However, the significantly different measurement principles of the above parameters make it a challenging task for in situ measurement of them for the same cell, which significantly limits the study of their comprehensive regulation mechanisms to cell physiological activities and intracellular operation results. For the first time, a robotic in situ measurement system of multiple intracellular physical parameters is proposed based on a self-developed three-micropipettes system in this paper. Using this system, the mass density, elasticity and intracellular pressure of the same cell are measured automatically in sequence, according to a robotic in situ measurement process. Experimental results on sheep oocytes demonstrate an 83.3% measurement success rate at an average speed of 97.75 s/cell. The measurement results of the above three parameters are close to the reported results of individual, while with a significantly shorter operation time than theirs combined in references. Our system lays a solid foundation for the future research on the comprehensive regulation mechanism of these parameters to cell physiological activities and intracellular operation results.

13:45-13:50, Paper ThBT5.6
Kinematic Synthesis of a Serial Manipulator Using Gradient-Based Optimization on Lie Groups

Shirafuji, Shouhei	Kansai University
Shimamura, Keiichiro	Kansai University
Keywords: Mechanism Design, Kinematics Abstract: This paper addresses a specialized kinematic synthesis problem: designing a manipulator capable of following a specific trajectory of end-effector positions and orientations with minimal actuators. This requires optimizing the robot's kinematic parameters and solving inverse kinematics to ensure its configuration aligns with the desired trajectory. This paper introduces a method for optimizing robot design by representing joint motions using Lie algebra and applying the Levenberg–Marquardt (LM) algorithm. The proposed approach integrates inverse kinematics into the optimization process, solving both problems simultaneously. To achieve this, the method computes derivatives of the end-effector's positions and orientations with respect to both kinematic parameters and the robot's configuration, leveraging the intrinsic relationship between Lie groups and their corresponding Lie algebra. The use of Lie algebra-based derivatives eliminates the singularities inherent in traditional kinematic parameterizations, enhancing stability and smoothness in the optimization process. Experimental results on a synthetic example demonstrate the method's robustness, showing independence from initial parameter selection and superiority over approaches based on local parameterization.

13:50-13:55, Paper ThBT5.7
ChatBuilder: LLM-Assisted Modular Robot Creation

Chen, Xin	Southeast University
Gao, Xifeng	Tencent America
Zhu, Lifeng	Southeast University
Song, Aiguo	Southeast University
Pan, Zherong	Tencent America
Keywords: Methods and Tools for Robot System Design, Cellular and Modular Robots Abstract: Modular robotic structures simplify robot design and manufacturing by using standardized modules, enhancing flexibility and adaptability. However, the need for manual input in design and assembly limit their potential. Current methods to automate this process still require significant human effort and technical expertise. This paper introduces a novel approach that employs Large Language Models (LLMs) as intelligent agents to automate the creation of modular robotic structures. We decompose the modular robot creation task and develop two agents based on LLM to plan and assemble the modular robots from text prompts. By inputting a textual description, users can generate robot designs that are validated in both simulated and real-world environments. This method reduces the need for manual intervention and lowers the technical barrier to creating complex robotic systems.


ThBT6	301
Deep Learning in Grasping and Manipulation 6	Regular Session
Chair: Chen, Lu	Shanxi University
Co-Chair: Kong, Tao	ByteDance

13:20-13:25, Paper ThBT6.1
Domain-Invariant Feature Learning Via Margin and Structure Priors for Robotic Grasping

Chen, Lu	Shanxi University
Li, Zhuomao	Shanxi University
Lu, Zhenyu	South China University of Technology
Wang, Yuwei	Shanxi University
Nie, Hong	Shanxi University
Yang, Chenguang	University of Liverpool
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation, Deep Learning Methods Abstract: Existing grasp detection methods usually rely on data-driven strategies to learn grasping features from labelled data, restricting their generalization to new scenes and objects. Preliminary researches introduce domain-invariant methods which tend to simply consider single visual representations and ignore the widely-existing graspable priors shared by objects from multiple domains. To solve this, we propose a novel margin and structure prior to a guided grasp detection network to effectively extract domain-invariant grasping features, leading to more accurate grasps. The structure prior aggregates dynamic high-probability grasping features and static multi-scene structure representations using cross-attention. The margin prior employs cosine similarity to encode the visual association between grasp boxes and foreground-background through metric learning. Extensive comparative experiments under public datasets (single domain, cross domains and stricter metrics) and real-world scenarios are thoroughly deployed to prove the superiority of the proposed method, especially for complicated backgrounds and cluttered objects. Under cross-domain scenarios, the average improvements caused by prior information for existing methods are 6.87% and 9.53% on VMRD and GraspNet datasets. Moreover, under stricter metrics, our MSPNet outperforms SOTA methods by 9.0% and 12.9% on Cornell and Jacquard datasets.

13:25-13:30, Paper ThBT6.2
GR-MG: Leveraging Partially-Annotated Data Via Multi-Modal Goal-Conditioned Policy

Li, Peiyan	Institute of Automation, Chinese Academy of Sciences
Wu, Hongtao	Bytedance
Huang, Yan	Institute of Automation, Chinese Academy of Sciences
Cheang, Chilam	Fudan University
Wang, Liang	Institute of Automation, Chinese Academy of Sciences
Kong, Tao	ByteDance
Keywords: Deep Learning in Grasping and Manipulation, Imitation Learning, Learning from Demonstration Abstract: The robotics community has consistently aimed to achieve generalizable robot manipulation with flexible natural language instructions. One primary challenge is that obtaining robot trajectories fully annotated with both actions and texts is time-consuming and labor-intensive.However, partially-annotated data, such as human activity videos without action labels and robot trajectories without text labels, are much easier to collect. Can we leverage these data to enhance the generalization capabilities of robots? In this paper, we propose GR-MG, a novel method which supports conditioning on a text instruction and a goal image. During training, GR-MG samples goal images from trajectories and conditions on both the text and the goal image or solely on the image when text is not available. During inference, where only the text is provided, GR-MG generates the goal image via a diffusion-based image-editing model and conditions on both the text and the generated image. This approach enables GR-MG to leverage large amounts of partially-annotated data while still using languages to flexibly specify tasks. To generate accurate goal images, we propose a novel progress-guided goal image generation model which injects task progress information into the generation process. In simulation experiments, GR-MG improves the average number of tasks completed in a row of 5 from 3.35 to 4.04. In real-robot experiments, GR-MG is able to perform 58 different tasks and improves the success rate from 68.7% to 78.1% and 44.4% to 60.6% in simple and generalization settings, respectively. It also outperforms comparing baseline methods in few-shot learning of novel skills. Video demos, code, and checkpoints are available on the project page: https://gr-mg.github.io/.

13:30-13:35, Paper ThBT6.3
GAP-RL: Grasps As Points for RL towards Dynamic Object Grasping

Xie, Pengwei	Tsinghua University
Chen, Siang	Tsinghua University
Chen, Qianrun	Vanderbilt University
Tang, Wei	Tsinghua University
Hu, Dingchang	Tsinghua University
Dai, Yixiang	Tsinghua University
Chen, Rui	Tsinghua University
Wang, Guijin	Tsinghua University
Keywords: Deep Learning in Grasping and Manipulation, Reinforcement Learning, Computer Vision for Automation Abstract: Dynamic grasping of moving objects in complex, continuous motion scenarios remains challenging. Reinforcement Learning (RL) has been applied in various robotic manipulation tasks, benefiting from its closed-loop property. However, existing RL-based methods do not fully explore the potential for enhancing visual representations. In this letter, we propose a novel framework called Grasps As Points for RL (GAP-RL) to effectively and reliably grasp moving objects. By implementing a fast region-based grasp detector, we build a Grasp Encoder by transforming 6D grasp poses into Gaussian points and extracting grasp features as a higher-level abstraction than the original object point features. Additionally, we develop a Graspable Region Explorer for real-world deployment, which searches for consistent graspable regions, enabling smoother grasp generation and stable policy execution. To assess the performance fairly, we construct a simulated dynamic grasping benchmark involving objects with various complex motions. Experiment results demonstrate that our method effectively generalizes to novel objects and unseen dynamic motions compared with other baselines. Real-world experiments further validate the framework's sim-to-real transferability. Our code and benchmark environment are available at https://github.com/THU-VCLab/GAP-RL.

13:35-13:40, Paper ThBT6.4
A Backbone for Long-Horizon Robot Task Understanding

Chen, Xiaoshuai	Imperial College London
Chen, Wei	Imperial College London
Lee, Dongmyoung	Imperial College London
Ge, Yukun	Imperial College London
Rojas, Nicolas	RAI Institute
Kormushev, Petar	Imperial College London
Keywords: Deep Learning in Grasping and Manipulation, Manipulation Planning, Learning from Demonstration Abstract: End-to-end robot learning, particularly for long-horizon tasks, often results in unpredictable outcomes and poor generalization. To address these challenges, we propose a novel Therblig-Based Backbone Framework (TBBF) as a fundamental structure to enhance interpretability, data efficiency, and generalization in robotic systems. TBBF utilizes expert demonstrations to enable therblig-level task decomposition, facilitate efficient action-object mapping, and generate adaptive trajectories for new scenarios. The approach consists of two stages: offline training and online testing. During the offline training stage, we developed the Meta-RGate SynerFusion (MGSF) network for accurate therblig segmentation across various tasks. In the online testing stage, after a one-shot demonstration of a new task is collected, our MGSF network extracts high-level knowledge, which is then encoded into the image using Action Registration (ActionREG). Additionally, Large Language Model (LLM)-Alignment Policy for Visual Correction (LAP-VC) is employed to ensure precise action registration, facilitating trajectory transfer in novel robot scenarios. Experimental results validate these methods, achieving 94.37% recall in therblig segmentation and success rates of 94.4% and 80% in real-world online robot testing for simple and complex scenarios, respectively.

13:40-13:45, Paper ThBT6.5
Sim2Real Learning with Domain Randomization for Autonomous Guidewire Navigation in Robotic-Assisted Endovascular Procedures (I)

Yao, Tianliang	Tongji University
Wang, Haoyu	Shanghai University of Engineering Science
Lu, Bo	Soochow University
Ge, Jiajia	Siemens Healthineers Ltd
Pei, Zhiqiang	University of Shanghai for Science and Technology
Kowarschik, Markus	Siemens Healthineers AG
Sun, Lining	Soochow University
Seneviratne, Lakmal	L. D. Seneviratne Is with Kings College London, UK, and Robotics
Qi, Peng	Tongji University
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles, Modeling, Control, and Learning for Soft Robots Abstract: Over the past decade, significant advancements have been made in the research and industrialization of robotic systems for endovascular procedures, yet their clinical application remains relatively limited. Physicians commonly report that these robots lack certain intelligent assistive capabilities during procedures. There has been increasing interest and attempts to apply learning-centered algorithms to the training and enhancement of surgical robot skills. This paper proposes an autonomous navigation algorithm for interventional guidewires that is initially trained solely in a virtual simulation environment and subsequently deployed to a real-world robot. Experimental results demonstrate the feasibility of this approach for real-world applications. The proposed approach can help physicians reduce the learning curve for guidewire manipulation and elevate the robot to a higher level of autonomous operation, thereby breaking through the current bottleneck in the level of intelligence for clinical applications of interventional robots. It also holds promise for bringing intelligent transformation to future interventional procedures.

13:45-13:50, Paper ThBT6.6
CMRNext: Camera to LiDAR Matching in the Wild for Localization and Extrinsic Calibration

Cattaneo, Daniele	University of Freiburg
Valada, Abhinav	University of Freiburg
Keywords: Deep Learning in Robotics and Automation, Localization, Computer Vision for Transportation, AI-Based Methods Abstract: LiDARs are widely used for mapping and localization in dynamic environments. However, their high cost limits their widespread adoption. On the other hand, monocular localization in LiDAR maps using inexpensive cameras is a cost-effective alternative for large-scale deployment. Nevertheless, most existing approaches struggle to generalize to new sensor setups and environments, requiring retraining or fine-tuning. In this paper, we present CMRNext, a novel approach for camera-LIDAR matching that is independent of sensor-specific parameters, generalizable, and can be used in the wild for monocular localization in LiDAR maps and camera-LiDAR extrinsic calibration. CMRNext exploits recent advances in deep neural networks for matching cross-modal data and standard geometric techniques for robust pose estimation. We reformulate the point-pixel matching problem as an optical flow estimation problem and solve the Perspective-n-Point problem based on the resulting correspondences to find the relative pose between the camera and the LiDAR point cloud. We extensively evaluate CMRNext on six different robotic platforms.


ThBT7	307
Human-Aware Motion Planning 2	Regular Session
Chair: Lim, Angelica	Simon Fraser University
Co-Chair: Lou, Yunjiang	Harbin Institute of Technology, Shenzhen

13:20-13:25, Paper ThBT7.1
Autonomous Adjustment of Tracking Position in Dynamic Environments for Human-Following Robots Using Deep Reinforcement Learning

Vu, Cong-Thanh	National Cheng Kung University
Liu, Yen-Chen	National Cheng Kung University
Keywords: Human-Aware Motion Planning, Human Detection and Tracking, Motion Control Abstract: Achieving flexible human-following in real-world environments remains a critical yet challenging problem in Human-Robot Interaction (HRI). Traditional approaches typically constrain robots to fixed tracking positions—such as following from behind, ahead, or alongside—thereby limiting their adaptability in dynamic and unstructured environments. This study introduces a reinforcement learning-based framework that allows the robot to dynamically adjust its tracking positions in response to workspace constraints. An interaction space is defined to capture the relationship between the human and the robot while considering the environment. This space serves as the basis for state spaces in Deep Reinforcement Learning (DRL), helping the robot adapt to environmental changes. The selected tracking position is then utilized as input for a human-following controller, ensuring smooth and continuous motion. Experimental evaluations in both indoor and outdoor environments demonstrate that the proposed approach enables robots to follow humans flexibly and adaptively, adjusting their positions autonomously and avoiding obstacles without requiring a predefined tracking position.

13:25-13:30, Paper ThBT7.2
Safe and Efficient Navigation for Differential-Drive Robots in Dynamic Pedestrian Environments

Liu, Wenhao	School of Mechanical Engineering and Automation in Harbin Instit
Fu, Letian	Harbin Institute of Technology Shenzhen
Li, Chen	Harbin Institute of Technology, Shenzhen
Li, Wanlei	Harbin Institute of Technology(ShenZhen)
Lou, Yunjiang	Harbin Institute of Technology, Shenzhen
Keywords: Human-Aware Motion Planning, Nonholonomic Motion Planning, Motion and Path Planning Abstract: Differential-drive robots are widely used in dynamic pedestrian environments, such as hospitals, for time-sensitive tasks like medication delivery, which require high navigation efficiency to ensure timely arrivals. However, existing methods tend to overemphasize safety, resulting in overly conservative behaviors and prolonged navigation times, which in turn lead to reduced efficiency. To address this issue, this paper proposes a novel navigation framework that integrates a pedestrian risk map, modeled using asymmetric Gaussian distributions, into B-spline trajectory optimization. Rather than strictly avoiding high-risk regions, the method balances collision risk and trajectory length minimization, leading to more effective navigation. Additionally, multiple planning modes enhance adaptability in complex environments, ensuring both safety and efficiency. Furthermore, kinematic constraints specific to differential-drive robots are incorporated to ensure the feasibility of the generated trajectories. Simulations and real-world experiments validate the proposed method's effectiveness in achieving safe and efficient navigation in dynamic pedestrian environments. The video is available at https://youtu.be/S9qJmXyPEzw.

13:30-13:35, Paper ThBT7.3
Demonstration-Enhanced Adaptable Multi-Objective Robot Navigation

de Heuvel, Jorge	University of Bonn
Sethuraman, Tharun	Hochschule Bonn-Rhein-Sieg
Bennewitz, Maren	University of Bonn
Keywords: Human-Aware Motion Planning Abstract: Preference-aligned robot navigation in human environments is typically achieved through learning-based approaches, utilizing user feedback or demonstrations for personalization. However, personal preferences are subject to change and might even be context-dependent. Yet traditional reinforcement learning (RL) approaches with static reward functions often fall short in adapting to evolving user preferences, inevitably reflecting demonstrations once training is completed. This paper introduces a structured framework that combines demonstration-based learning with multi-objective reinforcement learning (MORL). To ensure real-world applicability, our approach allows for dynamic adaptation of the robot navigation policy to changing user preferences without retraining. It fluently modulates the amount of demonstration data reflection and other preference-related objectives. Through rigorous evaluations, including a baseline comparison and sim-to-real transfer on two robots, we demonstrate our framework's capability to adapt to user preferences accurately while achieving high navigational performance in terms of collision avoidance and goal pursuance.

13:35-13:40, Paper ThBT7.4
STC-TEB: Spatial-Temporally Complete Trajectory Generation Based on Incremental Optimization

Zhu, Zeqing	Nankai University
Zhang, Qianyi	Nankai University; Shenzhen Institute of Advanced Technology, Ch
Song, Yinuo	NanKai University
Yang, Yifan	Nankai University
Liu, Jingtai	Nankai University
Keywords: Human-Aware Motion Planning, Motion and Path Planning, Nonholonomic Motion Planning Abstract: In the context of indoor crowd navigation for mobile robots, the generation of spatial-temporally complete trajectories gives the robot a wider range of options, thereby improving the robustness and safety of navigation. Focused on this topic, this letter presents an incremental optimization framework that initially searches for complete trajectories in spatial dimension and subsequently optimizes them iteratively in spatial-temporal dimension. It benefits from a discount factor design to increase obstacle velocities incrementally, an adaptive determination of the discount factor step to ensure optimization robustness and a dragging force mechanism to adjust trajectories according to obstacle velocities at each step. Compared to the baseline algorithm EGO-TEB and Graphic-TEB, simulations and experiments conducted in lobbies and corridors, with both pedestrians and static obstacles, demonstrate that the proposed STC-TEB algorithm achieves the highest scenario success rate, the highest trajectory completeness rate, nearly the shortest time to the goal, and competitive optimization time.

13:40-13:45, Paper ThBT7.5
Adapting to Frequent Human Direction Changes in Autonomous Frontal Following Robots

Leisiazar, Sahar	Simon Fraser University
Razavi Rohani, Seyed Roozbeh, Roozbeh	Sharif University of Technology
Park, Edward J.	Simon Fraser University
Lim, Angelica	Simon Fraser University
Chen, Mo	Simon Fraser University
Keywords: Human-Aware Motion Planning, Motion and Path Planning, Robot Companions Abstract: This paper addresses the challenge of robot follow ahead applications where the human behavior is highly variable. We propose a novel approach that does not rely on single human trajectory prediction but instead considers multiple potential future positions of the human, along with their associated probabilities, in the robot’s decision-making process. We trained an LSTM-based model to generate a probability distribution over the human’s future actions. These probabilities, along with different potential actions and future positions, are integrated into the tree expansion of Monte Carlo Tree Search (MCTS). Additionally, a trained Reinforcement Learning (RL) model is used to evaluate the nodes within the tree. By incorporating the likelihood of each possible human action and using the RL model to assess the value of the different trajectories, our approach enables the robot to effectively balance between focusing on the most probable future trajectory and considering all potential trajectories. This methodology enhances the robot's ability to adapt to frequent and unpredictable changes in human direction, improving its navigation and ability to navigate in front of the person. The codes and supplementary videos of the experiments are available on the project page, which can be accessed through this: {https://saharleisiazar.github.io/follow-ahead-adoption/}

13:45-13:50, Paper ThBT7.6
PMM-Net: Single-Stage Multi-Agent Trajectory Prediction with Patching-Based Embedding and Explicit Modal Modulation

Liu, Huajian	Harbin Institute of Technology
Dong, Wei	Harbin Institute of Technology
Fan, Kunpeng	Harbin Institute of Technology
Wang, Chao	Harbin Institute of Technology
Gao, Yongzhuo	Harbin Institute of Technology
Keywords: Human-Aware Motion Planning, Autonomous Agents, Computer Vision for Transportation Abstract: Analyzing and forecasting trajectories of agents like pedestrians plays a pivotal role for embodied intelligent applications. The inherent indeterminacy of human behavior and complex social interaction among a rich variety of agents make this task more challenging than common time-series forecasting. In this paper, we aim to explore a distinct formulation for multi-agent trajectory prediction framework. Specifically, we proposed a patching-based temporal feature extraction module and a graph-based social feature extraction module, enabling effective feature extraction and cross-scenario generalization. Moreover, we reassess the role of social interaction and present a novel method based on explicit modality modulation to integrate temporal and social features, thereby constructing an efficient single-stage inference pipeline. Results on public benchmark datasets demonstrate the superior performance of our model compared with the state-of-the-art methods. The code is available at: github.com/TIB-K330/pmm-net.


ThBT8	308
Human-Robot Collaboration and Teaming 2	Regular Session
Chair: Williams, Tom	Colorado School of Mines
Co-Chair: Zhu, Yaonan	University of Tokyo

13:20-13:25, Paper ThBT8.1
From Attraction to Engagement: A Robot-Clerk Collaboration Strategy for Retail Success

Song, Sichao	CyberAgent Inc
Iwamoto, Takuya	CyberAgent
Okafuji, Yuki	CyberAgent, Inc
Baba, Jun	CyberAgent, Inc
Nakanishi, Junya	Osaka Univ
Yoshikawa, Yuichiro	Osaka University
Ishiguro, Hiroshi	Osaka University
Keywords: Human-Robot Teaming, Design and Human Factors Abstract: Service robots have demonstrated significant potential in retail environments, with research efforts increasingly focused on designing effective robot-customer interactions. However, the collaboration between service robots and human clerks is crucial for optimizing customer engagement in real-world retail settings. While robots excel at attracting passersby, human clerks bring essential emotional intelligence and deep acting skills necessary for building trust and delivering personalized service. In this paper, we propose a novel robot-clerk collaboration model that leverages the complementary strengths of robots and clerks. Specifically, robots are tasked with attracting passersby and persuading them to interact with the products. Once a customer engages with a product, the robot subtly notifies a human clerk to take over, ensuring a seamless transition into the store for more in-depth product exploration and engagement. To develop this approach, we carried out a pilot study to identify the most effective and practical interaction configurations, followed by a 12-day experiment to evaluate the proposed strategy. The experimental results demonstrate the effectiveness of the robot-clerk collaboration approach. Based on these findings, we present detailed design implications and discuss key challenges to guide future research in this area.

13:25-13:30, Paper ThBT8.2
Enhancing Context-Aware Human Motion Prediction for Efficient Robot Handovers

Gomez Izquierdo, Gerard	Institut De Robotica Y Informatica Industrial (CSIC-UPC) and Uni
Laplaza, Javier	Universitat Politècnica De Catalunya
Sanfeliu, Alberto	Universitat Politècnica De Cataluyna
Garrell, Anais	UPC-CSIC
Keywords: Human-Robot Collaboration, Human-Robot Teaming Abstract: Accurate human motion prediction (HMP) is critical for seamless human-robot collaboration, particularly in handover tasks that require real-time adaptability. Despite the high accuracy of state-of-the-art models, their computational complexity limits practical deployment in real-world robotic applications. In this work, we enhance human motion forecasting for handover tasks by leveraging siMLPe [1], a lightweight yet powerful architecture, and introducing key improvements. Our approach, named IntentMotion incorporates intention-aware conditioning, task-specific loss functions, and a novel intention classifier, significantly improving motion prediction accuracy while maintaining efficiency. Experimental results demonstrate that our method reduces body loss error by over 50%, achieves 200× faster inference, and requires only 3% of the parameters compared to existing state-of-the-art HMP models. These advancements establish our framework as a highly efficient and scalable solution for real-time human-robot interaction.

13:30-13:35, Paper ThBT8.3
SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-And-Language Navigation

Shi, Xiangyu	The University of Adelaide
Li, Zerui	Adelaide University
Lyu, Wenqi	The University of Adelaide
Xia, Jiatong	The University of Adelaide
Dayoub, Feras	The University of Adelaide
Qiao, Yanyuan	The University of Adelaide
Wu, Qi	University of Adelaide
Keywords: Human-Robot Collaboration, Vision-Based Navigation Abstract: Vision-and-Language Navigation (VLN) in continuous environments requires agents to interpret natural language instructions while navigating unconstrained 3D spaces. Existing VLN-CE frameworks rely on a two-stage approach: a waypoint predictor to generate waypoints and a navigator to execute movements. However, current waypoint predictors struggle with spatial awareness, while navigators lack historical reasoning and backtracking capabilities, limiting adaptability. We propose a zero-shot VLN-CE framework integrating an enhanced waypoint predictor with a Multi-modal Large Language Model (MLLM)-based navigator. Our predictor employs a stronger vision encoder, masked cross-attention fusion, and an occupancy-aware loss for better waypoint quality. The navigator incorporates history-aware reasoning and adaptive path planning with backtracking, improving robustness. Experiments on R2R-CE and MP3D benchmarks show our method achieves state-of-the-art (SOTA) performance in zero-shot settings, demonstrating competitive results compared to fully supervised methods. Real-world validation on Turtlebot 4 further highlights its adaptability.

13:35-13:40, Paper ThBT8.4
Evaluating Human-Robot Collaboration through Online Video: Perspective Matters

Tian, Leimin	CSIRO
Xu, Shiyu	Monash University
He, Kerry	Monash University
Love, Rachel	Monash University
Cosgun, Akansel	Monash University
Kulic, Dana	Monash University
Keywords: Human-Robot Collaboration, Gesture, Posture and Facial Expressions, Methods and Tools for Robot System Design Abstract: Online evaluation is increasingly adopted in robotics research, providing an efficient approach to collect data from large and diverse populations. However, there have been ongoing debates about online studies as a proxy for in-person studies, especially where a participant passively observes video of robot behaviours or interaction. We conduct an online video comparison study (N=178) evaluating three robot handover policies in a collaborative assembly task, namely an adaptive autonomous policy, a non-adaptive scripted policy, and teleoperation. Participants watched three sets of videos in third-person view, each consisting of 9 sequential handovers executing one of the policies. Compared to in-person participants in two previous studies who evaluated handovers as users, online participants were observant of different robot behaviours and human-robot collaboration contexts, with 76.4% and 71.9% recognising the adaptive handovers exhibited by the teleoperated and autonomous robot, respectively. However, as observers, online participants showed more critical subjective perceptions compared to the in-person participants with a user's perspective. They valued efficiency over adaptation with twice more autonomous handovers rated as being too late compared to scripted handovers. Our work highlights the need to consider user contexts when evaluating human-robot collaboration.

13:40-13:45, Paper ThBT8.5
Action Recognition for Underwater Gesture Communication in Human Diver and Robot Teaming

Zhang, Zi-Hao	University of Florida
Herrin, Baker	University of Florida
Guo, Jia	Cornell University
Penumarti, Aditya	University of Florida
He, Zilong	Cornell University
Pulido, Andres Pulido	University of Florida
Shin, Jaejeong	University of Florida
Keywords: Human-Robot Teaming, Marine Robotics, Deep Learning for Visual Perception Abstract: This paper presents a Spatio-Temporal Transformer-based algorithm for underwater diver hand gesture recognition, forming a key component of diver-robot teaming. Existing computer vision-based approaches primarily rely on frame-wise gesture detection, which often fails to capture motion continuity and suffers under degraded underwater visibility. The presented method integrates temporal modeling to (i) improve recognition accuracy by capturing spatio-temporal patterns in hand motion, and (ii) increase robustness in challenging underwater environments by leveraging sequential image data, thereby mitigating the impact of intermittent misclassifications. The system is evaluated using real-world underwater footage, demonstrating high recognition accuracy and robustness to lighting fluctuations and partial occlusions. The results highlight the effectiveness and practicality of the presented method for real-world diver-robot collaboration, establishing a foundation for more reliable and intelligent underwater human-robot collaboration.

13:45-13:50, Paper ThBT8.6
Using High-Level Patterns to Estimate How Humans Predict a Robot Will Behave

Parekh, Sagar	Virginia Tech
Bramblett, Lauren	University of Virginia
Bezzo, Nicola	University of Virginia
Losey, Dylan	Virginia Tech
Keywords: Human-Robot Teaming, Human-Robot Collaboration, Representation Learning Abstract: Humans interacting with robots often form predictions of what the robot will do next. For instance, based on the recent behavior of an autonomous car, a nearby human driver might predict that the car is going to remain in the same lane. It is important for the robot to understand the human's prediction for safe and seamless interaction: e.g., if the autonomous car knows the human thinks it is not merging --- but the autonomous car actually intends to merge --- then the car can adjust its behavior to prevent an accident. Prior works typically assume that humans make precise predictions of robot behavior. However, recent research on human-human prediction suggests the opposite: humans tend to approximate other agents by predicting their high-level behaviors. We apply this finding to develop a second-order theory of mind approach that enables robots to estimate how humans predict they will behave. To extract these high-level predictions directly from data, we embed the recent human and robot trajectories into a discrete latent space. Each element of this latent space captures a different type of behavior (e.g., merging in front of the human, remaining in the same lane) and decodes into a vector field across the state space that is consistent with the underlying behavior type. We hypothesize that our resulting high-level and course predictions of robot behavior will correspond to actual human predictions. We provide initial evidence in support of this hypothesis through proof-of-concept simulations, testing our method's predictions against those of real users, and experiments on a real-world interactive driving dataset.

13:50-13:55, Paper ThBT8.7
Adaptive Safety-Critical Control with Uncertainty Estimation for Human-Robot Collaboration (I)

Zhang, Dianhao	Queen's University Belfast
Keywords: Physical Human-Robot Interaction, Human-Centered Robotics, Kinematics Abstract: In advanced manufacturing, strict safety guarantees are required to allow humans and robots to work together in a shared workspace. One of the challenges in this application field is the variety and unpredictability of human behavior, leading to potential dangers for human coworkers. This paper presents a novel control framework by adopting safety-critical control and uncertainty estimation for human-robot collaboration. Additionally, to select the shortest path during collaboration, a novel quadratic penalty method is presented. The innovation of the proposed approach is that the proposed controller will prevent the robot from violating any safety constraints even in cases where humans move accidentally in a collaboration task. This is implemented by the combination of a time-varying integral barrier Lyapunov function (TVIBLF) and an adaptive exponential control barrier function (AECBF) to achieve a flexible mode switch between path tracking and collision avoidance with guaranteed closed-loop system stability. The performance of our approach is demonstrated in simulation studies on a 7-DOF robot manipulator. Additionally, a comparison between the tasks involving static and dynamic targets is provided.


ThBT9	309
Transportation Vision	Regular Session
Chair: Zhao, Hao	Tsinghua University

13:20-13:25, Paper ThBT9.1
Delving into Mapping Uncertainty for Mapless Trajectory Prediction

Zhang, Zongzheng	Tsinghua University
Qiu, Xuchong	Bosch
Zhang, Boran	Harbin Engineering University
Zheng, Guantian	Huazhong University of Science and Technology
Gu, Xunjiang	University of Toronto
Chi, Guoxuan	Tsinghua University
Gao, Huan-ang	Tsinghua University
Wang, LeiChen	Robert Bosch CN
Liu, Ziming	Bosch Research
Li, Xinrun	Newcastle University
Gilitschenski, Igor	University of Toronto
Li, Hongyang	The University of Hong Kong
Zhao, Hang	Tsinghua University
Zhao, Hao	Tsinghua University
Keywords: Computer Vision for Transportation, Semantic Scene Understanding, Vision-Based Navigation Abstract: Recent advances in autonomous driving are moving towards mapless approaches, where High-Definition (HD) maps are generated online directly from sensor data, reducing the need for expensive labeling and maintenance. However, the reliability of these online-generated maps remains uncertain. While incorporating map uncertainty into downstream trajectory prediction tasks has shown potential for performance improvements, current strategies provide limited insights into the specific scenarios where this uncertainty is beneficial. In this work, we first analyze the driving scenarios in which mapping uncertainty has the greatest positive impact on trajectory prediction and identify a critical, previously overlooked factor: the agent’s kinematic state. Building on these insights, we propose a novel Proprioceptive Scenario Gating that adaptively integrates map uncertainty into trajectory prediction based on forecasts of the ego vehicle’s future kinematics. This lightweight, self-supervised approach enhances the synergy between online mapping and trajectory prediction, providing interpretability around where uncertainty is advantageous and outperforming previous integration methods. Additionally, we introduce a Covariance-based Map Uncertainty approach that better aligns with map geometry, further improving trajectory prediction. Extensive ablation studies confirm the effectiveness of our approach, achieving up to 23.6% improvement in mapless trajectory prediction performance over the state-of-the-art method using the real-world nuScenes driving dataset. Our code, data, and models are publicly available at https://github.com/Ethan-Zheng136/Map-Uncertainty-for-Traje ctory-Prediction.

13:25-13:30, Paper ThBT9.2
Reusing Attention for One-Stage Lane Topology Understanding

Li, Yang	ETH Zurich
Zhang, Zongzheng	Tsinghua University
Qiu, Xuchong	Bosch
Li, Xinrun	Newcastle University
Liu, Ziming	Bosch Research
Wang, LeiChen	Robert Bosch CN
Li, Ruikai	Beihang University
Zhu, Zhenxin	Beihang University
Gao, Huan-ang	Tsinghua University
Lin, Xiaojian	Tsinghua University
Cui, Zhiyong	Beihang University
Zhao, Hang	Tsinghua University
Zhao, Hao	Tsinghua University
Keywords: Computer Vision for Transportation, Semantic Scene Understanding, RGB-D Perception Abstract: Understanding lane toplogy relationships accurately is critical for safe autonomous driving. However, existing two-stage methods suffer from inefficiencies due to error propagations and increased computational overheads. To address these challenges, we propose a one-stage architecture that simultaneously predicts traffic elements, lane centerlines and topology relationship, improving both the accuracy and inference speed of lane topology understanding for autonomous driving. Our key innovation lies in reusing intermediate attention resources within distinct transformer decoders. This approach effectively leverages the inherent relational knowledge within the element detection module to enable the modeling of topology relationships among traffic elements and lanes without requiring additional computationally expensive graph networks. Furthermore, we are the first to demonstrate that knowledge can be distilled from models that utilize standard definition (SD) maps to those operates without using SD maps, enabling superior performance even in the absence of SD maps. Extensive experiments on the OpenLane-V2 dataset show that our approach outperforms baseline methods in both accuracy and efficiency, achieving superior results in lane detection, traffic element identification, and topology reasoning. Our code is available at https://github.com/Yang-Li-2000/one-stage.git.

13:30-13:35, Paper ThBT9.3
MambaPlace: Text-To-Point-Cloud Cross-Modal Place Recognition with Attention Mamba Mechanisms

Shang, Tianyi	Fuzhou University
Zhenyu, Li	Qilu University of Technology (Shandong Academy of Sciences)
Xu, Pengjie	Shanghai Jiaotong University
Qiao, Jinwei	Qilu University of Technology
Keywords: Computer Vision for Transportation, Computer Vision for Automation, Data Sets for Robotic Vision Abstract: Vision-Language Place Recognition (VLPR) enhances robot localization performance by incorporating natural language descriptions from images. By utilizing language information, VLPR directs robot place matching, overcoming the constraint of solely depending on vision. However, general multimodal information integration methods are not well equipped to capture the dynamics of cross-modal interactions, especially in the presence of complex intra-modal and inter-modal correlations. To this end, this paper proposes a novel coarse-to-fine and end-to-end connected cross-modal place recognition framework, called MambaPlace. In the coarse-localization stage, the text description and 3D point cloud are encoded by the pre-trained T5 and instance encoder, respectively. They are then processed using Text-Attention Mamba (TAM) and Point Cloud Multi-Strategy Scanning Mamba (MSSM), with the latter mimicking the eye's focusing mechanism, for data enhancement and alignment. In the subsequent fine-localization stage, the features of the text description and 3D point cloud are cross-modally fused and further enhanced through Cascaded Cross-Attention Mamba (CCAM). Finally, we predict the positional offset from the fused text-point cloud features, achieving the most accurate localization. Extensive experiments show that MambaPlace achieves improved localization accuracy on the KITTI360Pose dataset compared to the state-of-the-art methods. Specifically, as shown in Fig. 1, when ϵ<5, MambaPlace achieves 5% higher test accuracy compared to the existing state-of-the-art. Our code is available at https://github.com/nuozimiaowu/MambaPlace/tree/main.

13:35-13:40, Paper ThBT9.4
CoMamba: Real-Time Cooperative Perception Unlocked with State Space Models

Li, Jinlong	Cleveland State University
Liu, Xinyu	Cleveland State University
Li, Baolu	Cleveland State University
Xu, Runsheng	UCLA
Li, Jiachen	University of California, Riverside
Yu, Hongkai	Cleveland State University
Tu, Zhengzhong	Texas A&M University
Keywords: Computer Vision for Transportation, Intelligent Transportation Systems, Deep Learning for Visual Perception Abstract: Cooperative perception systems play a vital role in enhancing the safety and efficiency of vehicular autonomy. Although recent studies have highlighted the efficacy of vehicle-to-everything (V2X) communication techniques in autonomous driving, a significant challenge persists: how to efficiently integrate multiple high-bandwidth features across an expanding network of connected agents such as vehicles and infrastructure. In this paper, we introduce CoMamba, a novel cooperative 3D detection framework designed to leverage state-space models for real-time onboard vehicle perception. Compared to prior state-of-the-art transformer-based models, CoMamba enjoys being a more scalable 3D model using bidirectional state space models, bypassing the quadratic complexity pain-point of attention mechanisms. Through extensive experimentation on V2X/V2V datasets, CoMamba achieves superior performance compared to existing methods while maintaining real-time processing capabilities. The proposed framework not only enhances object detection accuracy but also significantly reduces processing time, making it a promising solution for next-generation cooperative perception systems in intelligent transportation networks.

13:40-13:45, Paper ThBT9.5
RALAD: Bridging the Real-To-Sim Domain Gap in Autonomous Driving with Retrieval-Augmented Learning

Zuo, Jiacheng	Soochow University
Hu, Haibo	City University of Hong Kong
Zhou, Zikang	City University of Hong Kong
Cui, Yufei	McGill University
Liu, Ziquan	Queen Mary University of London
Wang, Jianping	City University of Hong Kong
Guan, Nan	City University of Hong Kong
Wang, Jin	Soochow University
Xue, Chun Jason	MBZUAI
Keywords: Computer Vision for Transportation, Continual Learning, Visual Learning Abstract: As end-to-end autonomous driving continues to advance towards real-world deployment, ensuring the safety of autonomous vehicles (AVs) has become a critical requirement for their commercial viability. It is now essential for AVs to undergo rigorous testing in both real-world and simulated environments before deployment. However, models trained on real-world datasets often struggle to generalize effectively in simulation environments, posing a significant challenge to autonomous driving testing. To address this issue, we propose Retrieval-Augmented Learning for Autonomous Driving(RALAD), a novel framework designed to bridge the real-to-sim gap in a cost-effective manner. RALAD consists of three key components: (1) domain adaptation via an enhanced Optimal Transport (OT) method, which retrieves the most similar scenarios between real and simulated environments; (2)feature fusion across similar scenarios, enabling the construction of a feature mapping between real-world and simulated domains; and (3) feature extraction freezing with fine-tuning on the fused features, allowing the model to learn simulation specific characteristics through feature mapping. We evaluate RALAD on three monocular 3D object detection models from different development stages in autonomous driving. The results demonstrate that our approach significantly improves model accuracy in simulation while maintaining stable performance in real-world scenarios. Additionally, we provide a visualization of scenario similarities between real-world and simulation environments to further illustrate the effectiveness of our method. Our code is available at https://github.com/JiachengZuo/RALAD.git.

13:45-13:50, Paper ThBT9.6
EFFOcc: Learning Efficient Occupancy Networks from Minimal Labels for Autonomous Driving

Shi, Yining	Tsinghua University
Jiang, Kun	Tsinghua University
Miao, Jinyu	Tsinghua University
Wang, Ke	Kargobot.AI
Qian, Kangan	Tsinghua University
Wang, Yunlong	Tsinghua University
Li, Jiusi	Tsinghua University
Wen, Tuopu	Tsinghua University
Yang, Mengmeng	Tsinghua University
Xu, Yiliang	Autel Robotics
Yang, Diange	Tsinghua University
Keywords: Computer Vision for Transportation, Visual Learning Abstract: 3D occupancy prediction (3DOcc) is a rapidly rising and challenging perception task in the field of autonomous driving. Existing 3D occupancy networks (OccNets) are both computationally heavy and label-hungry. In terms of model complexity, OccNets are commonly composed of heavy Conv3D modules or transformers at the voxel level. Moreover, OccNets are supervised with expensive large-scale dense voxel labels. Model and label inefficiencies, caused by excessive network parameters and label annotation requirements, severely hinder the onboard deployment of OccNets. This paper proposes an EFFicient Occupancy learning framework, EFFOcc, that targets minimal network complexity and label requirements while achieving state-of-the-art accuracy. We first propose an efficient fusion-based OccNet that only uses simple 2D operators and improves accuracy to the state-of-the-art on three large-scale benchmarks: Occ3D-nuScenes, Occ3D-Waymo, and OpenOccupancy-nuScenes. On the Occ3D-nuScenes benchmark, the fusion-based model with ResNet-18 as the image backbone has 21.35M parameters and achieves 51.49 in terms of mean Intersection over Union (mIoU). Furthermore, we propose a multi-stage occupancy-oriented distillation to efficiently transfer knowledge to vision-only OccNet. Extensive experiments on occupancy benchmarks show state-of-the-art precision for both fusion-based and vision-based OccNets. For the demonstration of learning with limited labels, we achieve 94.38% of the performance (mIoU = 28.38) of a 100% labeled vision OccNet (mIoU = 30.07) using the same OccNet trained with only 40% labeled sequences and distillation from the fusion-based OccNet. Code is available at https://github.com/synsin0/EFFOcc.


ThBT10	310
Visual Servoing and Tracking	Regular Session
Co-Chair: Wan, Weiwei	Osaka University

13:20-13:25, Paper ThBT10.1
VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis

Wang, Sen	Techinische Universität München
Cheng, Qing	Technical University of Munich
Gasperini, Stefano	Technical University of Munich
Zhang, Wei	University of Stuttgart
Wu, Shun-Cheng	Technical University of Munich
Zeller, Niclas	Karlsruhe University of Applied Sciences
Cremers, Daniel	Technical University of Munich
Navab, Nassir	TU Munich
Keywords: Visual Learning, RGB-D Perception, Vision-Based Navigation Abstract: The generation of high-fidelity view synthesis is essential for robotic navigation and interaction but remains challenging, particularly in indoor environments and real-time scenarios. Existing techniques often require significant computational resources for both training and rendering, and they frequently result in suboptimal 3D representations due to insufficient geometric structuring. To address these limitations, we introduce VoxNeRF, a novel approach that utilizes easy-to-obtain geometry priors to enhance both the quality and efficiency of neural indoor reconstruction and novel view synthesis. We propose an efficient voxel-guided sampling technique that allocates computational resources selectively to the most relevant segments of rays based on a voxel-encoded geometry prior, significantly reducing training and rendering time. Additionally, we incorporate a robust depth loss to improve reconstruction and rendering quality in sparse view settings. Our approach is validated with extensive experiments on ScanNet and ScanNet++ where VoxNeRF outperforms existing state-of-the-art methods and establishes a new benchmark for indoor immersive interpolation and extrapolation settings.

13:25-13:30, Paper ThBT10.2
Development of a Multifunction Piezoelectric Microscopic Observation System Based on Visual Feedback (I)

Li, Jianxing	Harbin Institute of Technology
Zhang, Shijing	Harbin Institute of Technology
Deng, Jie	Harbin Institute of Technology
Liu, Yingxiang	Harbin Institute of Technology
Keywords: Visual Servoing, Actuation and Joint Mechanisms, Automation at Micro-Nano Scales Abstract: Microscopic observation system (MOS) is a basic instrument in the fields of biomedical and micro-nano manipulations. Here, A multifunction piezoelectric microscopic observation system (MPMOS) based on the visual feedback is proposed, which has functions of automatic focusing, cell counting, cell localization and wide range scanning. A focusing mechanism combined with a PZT stack and a rhombus amplifying mechanism (RAM) is integrated in the MPMOS, which has the motion stroke of 359.6 μm and the displacement resolution of 30 nm. The MPMOS adopts an image sharpness evaluation function based on Laplace operator and an in-focus position searching algorithm combining coarse tuning and fine-tuning processes. The focusing time is less than 3.36 s and the focusing success rate is better than 94%. The applications are carried out, which includes the automatic focus and microscopic observation of both transparent and opaque objects, the red blood cells counting and locating, the range of 350 μm and observe the object in different depths. This work provides a new approach for the design of a MOS.

13:30-13:35, Paper ThBT10.3
Image-Based Visual-Admittance Control with Prescribed Performance of Manipulators in Feature Space (I)

Wang, Dongrui	Southwest Jiaotong University
Lin, Jianfei	Southwest Jiaotong University
Ma, Lei	Southwest Jiaotong University
Huang, Deqing	Southwest Jiaotong University
Wu, Yue	Southwest Jiaotong University
Keywords: Visual Servoing, Compliance and Impedance Control Abstract: Visual servoing systems provide higher safety and autonomy when dealing with unstructured environments. Visual and force sensing are essential for accurate and reliable robotic manipulation. In this paper, a novel image-based visual-admittance control with prescribed performance of manipulator is presented. The method couples visual and force information in feature space, which avoids control limitations due to inconsistent driver layers and improves system flexibility. Thanks to the prescribed performance and the tan-type barrier Lyapunov function, the image feature error is satisfied under the finite field-of-view constraint, which improves the transient and steady-state performance of the visual servoing system. The proposed image-based visual-admittance control with prescribed performance is experimentally validated. The results show the control framework improves the transient/steady-state response, and task success of the system.

13:35-13:40, Paper ThBT10.4
Integrating a Pipette into a Robot Manipulator with Uncalibrated Vision and TCP for Liquid Handling (I)

Zhang, Junbo	Osaka University
Wan, Weiwei	Osaka University
Tanaka, Nobuyuki	RIKEN
Fujita, Miki	RIKEN
Takahashi, Koichi	RIKEN
Harada, Kensuke	Osaka University
Keywords: Visual Servoing, Manipulation Planning Abstract: This paper presents a system integration approach for a 6-DoF (Degree of Freedom) collaborative robot to operate a pipette for liquid dispensing. Its technical development is three-fold. First, we designed an end-effector for holding and triggering manual pipettes. Second, we took advantage of direct teaching to specify global labware poses and planned robotic motion based on them. Third, we leveraged hand-mounted cameras and visual classifiers to predict and correct positioning errors, which allowed precisely attaching pipettes and tips without calibration. Through experiments and analysis, we confirmed that the developed system, especially the planning and visual recognition methods, could help secure high-precision and flexible liquid dispensing. The developed system is suitable for low-frequency, high-repetition biochemical liquid dispensing tasks. We expect it to promote the deployment of collaborative robots for laboratory automation and thus improve the experimental efficiency without significantly customizing a laboratory environment.

13:40-13:45, Paper ThBT10.5
Shape Visual Servoing of a Cable Suspended between Two Drones

Smolentsev, Lev	Centre Inria De l'Université De Rennes
Krupa, Alexandre	Inria Centre at Rennes University
Chaumette, Francois	Inria Center at University of Rennes
Keywords: Visual Servoing, Aerial Systems: Perception and Autonomy Abstract: In this paper, we propose a shape visual servoing approach for manipulating a suspended cable attached between two quadrotor drones. A leader-follower control strategy is presented, where a human operator controls the rigid motion of the cable by teleoperating one drone (the leader), while the second drone (the follower) performs a shape visual servoing task to autonomously apply a desired deformation to the cable. The proposed shape visual servoing approach uses an RGB-D camera embedded on the follower drone and has the advantage to rely on a simple geometrical model of the cable that only requires the knowledge of its length. In the same time, our control strategy maintains the best visibility of the cable in the camera field of view. A robust image processing pipeline allows detecting and tracking in real-time the cable shape from the data provided by the onboard RGB-D camera. Experimental results demonstrate the effectiveness of the proposed visual control approach to shape a flexible cable into a desired shape. In addition, we demonstrate experimentally that such system can be used to perform an aerial transport task by grasping with the cable an object fitted with a hook, then moving and releasing it at another location.

13:45-13:50, Paper ThBT10.6
A Multi-Modal Fusion-Based 3D Multi-Object Tracking Framework with Joint Detection

Wang, Xiyang	Chongqing University
Fu, Chunyun	Chongqing University
He, Jiawei	Chongqing University
Huang, Mingguang	Chongqing University
Meng, Ting	Chongqing University
Zhang, Siyu	Mach
Zhou, Hangning	Mach-Drive
Xu, Ziyao	Moonshot AI
Zhang, Chi	Mach
Keywords: Visual Tracking, Computer Vision for Transportation, Sensor Fusion Abstract: In the classical tracking-by-detection (TBD) paradigm, detection and tracking are separately and sequentially conducted, and data association must be properly performed to achieve satisfactory tracking performance. In this paper, a new end-to-end multi-object tracking framework is proposed, which integrates object detection and multi-object tracking into a single model. The proposed tracking framework eliminates the complex data association process in the classical TBD paradigm, and requires no additional training. Secondly, the regression confidence of historical trajectories is investigated, and the possible states of a trajectory (weak object or strong object) in the current frame are predicted. Then, a confidence fusion module is designed to guide non-maximum suppression for trajectories and detections to achieve ordered and robust tracking. Thirdly, by integrating historical trajectory features, the regression performance of the detector is enhanced, which better reflects the occlusion and disappearance patterns of objects in real world. Lastly, extensive experiments are conducted on the commonly used KITTI and Waymo datasets. The results show that the proposed framework can achieve robust tracking by using only a 2D detector and a 3D detector, and it is proven more accurate than many of the state-of-the-art TBD-based multi-modal tracking methods. The source codes of the proposed method are available at https://github.com/wangxiyang2022/YONTD-MOT.

13:50-13:55, Paper ThBT10.7
Lost & Found: Tracking Changes from Egocentric Observations in 3D Dynamic Scene Graphs

Behrens, Tjark	ETH Zürich
Zurbrügg, René	ETH Zürich
Pollefeys, Marc	ETH Zurich
Bauer, Zuria	ETH Zürich
Blum, Hermann	Uni Bonn \| Lamarr Institute
Keywords: Visual Tracking, Mapping, Semantic Scene Understanding Abstract: Recent approaches have successfully focused on the segmentation of static reconstructions, thereby equipping downstream applications with semantic 3D understanding. However, the world in which we live is dynamic, characterized by numerous interactions between the environment and humans or robotic agents. Static semantic maps are unable to capture this information, and the naive solution of rescanning the environment after every change is both costly and ineffective in tracking e.g. objects being stored away in drawers. With Lost & Found we present an approach that addresses this limitation. Based solely on egocentric recordings with corresponding hand position and camera pose estimates, we are able to track the 6DoF poses of the moving object within the detected interaction interval. These changes are applied online to a transformable scene graph that captures object-level relations. Compared to state-of-the-art object pose trackers, our approach is more reliable in handling the challenging egocentric viewpoint and the lack of depth information. It outperforms the second-best approach by 34% and 56% for translational and orientational error, respectively, and produces visibly smoother 6DoF object trajectories. In addition, we illustrate how the acquired interaction information in the dynamic scene graph can be employed in the context of robotic applications that would otherwise be unfeasible: We show how our method allows to command a mobile manipulator through teach & repeat, and how information about prior interaction allows a mobile manipulator to retrieve an object hidden in a drawer. Code, videos and corresponding data are accessible at https://behretj.github.io/LostAndFound/

13:55-14:00, Paper ThBT10.8
Long-Term Active Object Detection for Service Robots Using Generative Adversarial Imitation Learning with Contextualized Memory Graph (I)

Yang, Ning	Shandong Univesity
Lu, Fei	Shandong University, China
Tian, Guohui	Shandong University
Liu, Jun	The University of Hong Kong
Keywords: Service Robotics, Imitation Learning, Computer Vision for Automation Abstract: Active object detection (AOD) is a crucial task in Embodied AI within robotics. While previous works mainly address this challenge through deep reinforcement learning (DRL), characterized by prolonged training cycles and model convergence difficulties. Moreover, they often emphasize whether a single AOD task can be completed, overlooking the reality that robots perform long-term AOD tasks. To this end, this paper introduces a new AOD solution utilizing a graph based on Generative Adversarial Imitation Learning (GAIL). A new expert strategy is devised using the active vision dataset benchmark (AVDB), generating high-quality expert trajectories. Meanwhile, a new AOD model based on GAIL is proposed to predict the robot's execution actions. Moreover, a contextualized memory graph (CMG) is constructed, providing partial state information for the GAIL model and enabling the robot directly make decisions based on the human-like memory function. Experimental validation against existing methods in AVDB demonstrates superior results, achieving an 88.8% action prediction accuracy, reducing average path length to 12.182 steps, and shortening single-step action prediction time to 0.133s. The proposed method is further evaluated in a real-world home scene, affirming its efficacy and generalization capabilities.


ThBT11	311A
Reinforcement Learning 10	Regular Session
Chair: Qi, Chenkun	Shanghai Jiao Tong University
Co-Chair: Su, Housheng	Huazhong University of Science and Technology

13:20-13:25, Paper ThBT11.1
Learning Natural and Robust Hexapod Locomotion Over Complex Terrains Via Motion Priors Based on Deep Reinforcement Learning

Liu, Xin	Shanghai Jiao Tong University
Wu, Jinze	Shanghai Jiao Tong University
Li, Yinghui	Shanghai Jiao Tong University
Qi, Chenkun	Shanghai Jiao Tong University
Xue, Yufei	Shanghai Jiao Tong University
Gao, Feng	Shanghai Jiao Tong University
Keywords: Reinforcement Learning, Legged Robots, Motion Control Abstract: Multi-legged robots offer enhanced stability to navigate complex terrains with their multiple legs interacting with the environment. However, how to effectively coordinate the multiple legs in a larger action exploration space to generate natural and robust movements is a key issue. In this paper, we introduce a motion prior-based approach, successfully applying deep reinforcement learning algorithms to a real hexapod robot. We generate a dataset of optimized motion priors, and train an adversarial discriminator based on the priors to guide the hexapod robot to learn natural gaits. The learned policy is then successfully transferred to a real hexapod robot, and demonstrate natural gait patterns and remarkable robustness without visual information in complex terrains. This is the first time that a reinforcement learning controller has been used to achieve complex terrain walking on a real hexapod robot.

13:25-13:30, Paper ThBT11.2
Impact of Static Friction on Sim2Real in Robotic Reinforcement Learning

Hu, Xiaoyi	Lenovo Research Shanghai
Sun, Qiao	Shanghai Jiao Tong University
He, Bailin	Lenovo
Liu, Haojie	Lenovo Information Technology
Zhang, Xueyi	Lenovo
Lu, Chunpeng	LENOVO
Zhong, Jiangwei	Lenovo Research Shanghai
Keywords: Reinforcement Learning, Legged Robots, AI-Based Methods Abstract: In robotic reinforcement learning, the Sim2Real gap remains a critical challenge. However, the impact of Static friction on Sim2Real has been underexplored. Conventional domain randomization methods typically exclude Static friction from their parameter space. In our robotic reinforcement learning task, such conventional domain randomization approaches resulted in significantly underperforming real-world models. To address this Sim2Real challenge, we employed Actuator Net as an alternative to conventional domain randomization. While this method enabled successful transfer to flat-ground locomotion, it failed on complex terrains like stairs. To further investigate physical parameters affecting Sim2Real in robotic joints, we developed a control-theoretic joint model and performed systematic parameter identification. Our analysis revealed unexpectedly high friction-torque ratios in our robotic joints. To mitigate its impact, we implemented Static friction-aware domain randomization for Sim2Real. Recognizing the increased training difficulty introduced by friction modeling, we proposed a simple and novel solution to reduce learning complexity. To validate this approach, we conducted comprehensive Sim2Sim and Sim2Real experiments comparing three methods: conventional domain randomization (without Static friction), Actuator Net, and our Static friction-aware domain randomization. All experiments utilized the Rapid Motor Adaptation (RMA) algorithm. Results demonstrated that our method achieved superior adaptive capabilities and overall performance.

13:30-13:35, Paper ThBT11.3
Multi-UAV-UGV Collision-Free Tracking Control Via Control Barrier Function-Based Reinforcement Learning

Haojie, Xia	Sichuan University
Qi, Qihan	Sichuan University
Yang, Xinsong	Sichuan University
Ju, Xingxing	Sichuan University
Su, Housheng	Huazhong University of Science and Technology
Keywords: Reinforcement Learning, Multi-Robot Systems Abstract: This paper introduces a novel hierarchical con- trol approach for feature matching, real-time tracking and inter-UAV collision avoidance in multiple unmanned aerial vehicle-unmanned ground vehicle (multi-UAV-UGV) collabora- tive tracking. Our approach divides into three layers: optimal feature matching, tracking control by reinforcement learning (RL), and collision avoidance using control barrier functions (CBFs). First, a distance cost matrix is cleverly constructed based on the feature matching capabilities of UAVs and UGVs to determine the optimal matching configuration. It allows UAVs to perform the tracking task while minimizing travel distance. Second, a RL-based tracker is developed to achieve precise real-time tracking without depending on UAV dynamic models. The tracker is trained in a single UAV-UGV environment, which reduces policy convergence difficulty by simplifying state space and interactions compared with training in complex multi-UAV-UGV scenarios. Third, a collision avoidance mechanism based on CBFs is introduced to transform RL commands into collision-free actions by solving a quadratic programming (QP) problem. Extensive simulations and real-world experiments demonstrate the effectiveness of the proposed approach.

13:35-13:40, Paper ThBT11.4
RMG: Real-Time Expressive Motion Generation with Self-Collision Avoidance for 6-DOF Companion Robotic Arms

Li, Jiansheng	Hong Kong University of Science and Technology (Guangzhou)
Song, Haotian	The Hong Kong University of Science and Technology (Guangzhou)
Li, Haoang	Hong Kong University of Science and Technology (Guangzhou)
Zhou, Jinni	Hong Kong University of Science and Technology (Guangzhou)
Nie, Qiang	Hong Kong University of Science and Technology (Guangzhou)
Cai, Yi	The Hong Kong University of Science and Technology (Guangzhou)
Keywords: Robot Companions, Art and Entertainment Robotics, Motion and Path Planning Abstract: The six-degree-of-freedom (6-DOF) robotic arm has gained widespread application in human-coexisting environments. While previous research has predominantly focused on functional motion generation, the critical aspect of expressive motion in human-robot interaction remains largely unexplored. This paper presents a novel real-time motion generation planner that enhances interactivity by creating expressive robotic motions between arbitrary start and end states within predefined time constraints. Our approach involves three key contributions: first, we develop a mapping algorithm to construct an expressive motion dataset derived from human dance movements; second, we train motion generation models in both Cartesian and joint spaces using this dataset; third, we introduce an optimization algorithm that guarantees smooth, collision-free motion while maintaining the intended expressive style. Experimental results demonstrate the effectiveness of our method, which can generate expressive and generalized motions in under 0.5 seconds while satisfying all specified constraints.

13:40-13:45, Paper ThBT11.5
Efficient Navigation among Movable Obstacles Using a Mobile Manipulator Via Hierarchical Policy Learning

Yang, Taegeun	Korea Advanced Institute of Science and Technology
Hwang, Jiwoo	Korea Advanced Institute of Science and Technology
Jeong, Jeil	Korea Advanced Institute of Science and Technology
Yoon, Minsung	Korea Advanced Institute of Science and Technology (KAIST)
Yoon, Sung-eui	KAIST
Keywords: Reinforcement Learning, Mobile Manipulation Abstract: We propose a hierarchical reinforcement learning (HRL) framework for efficient Navigation Among Movable Obstacles (NAMO) using a mobile manipulator. Our approach combines interaction-based obstacle property estimation with structured pushing strategies, facilitating dynamic manipulation of unforeseen obstacles while adhering to a pre-planned global path. The high-level policy generates pushing commands that consider environmental constraints and path-tracking objectives, while the low-level policy precisely and stably executes these commands through coordinated whole-body movements. Comprehensive simulation-based experiments demonstrate improvements in performing NAMO tasks, including higher success rates, shortened traversed path length, and reduced goal-reaching times, compared to baselines. Additionally, ablation studies assess the efficacy of each component, while a qualitative analysis further validates the accuracy and reliability of the real-time obstacle property estimation.

13:45-13:50, Paper ThBT11.6
Emergent Cooperative Strategies for Pursuit-Evasion in Cluttered Environments: A Knowledge-Enhanced Multi-Agent Deep Reinforcement Learning Approach

Sun, Yihao	National University of Defense Technology
Yan, Chao	Nanjing University of Aeronautics and Astronautics
Zhou, Han	National University of Defense Technology
Xiang, Xiaojia	National University of Defense Technology
Jiang, Jie	National University of Defense Technology, College of Intelligen
Keywords: Reinforcement Learning, Multi-Robot Systems Abstract: Deep reinforcement learning (DRL) has recently emerged as a promising tool for tackling pursuit-evasion tasks. However, most existing DRL-based pursuit approaches still rely on individual rewards and struggle with complex scenarios. To address these challenges, we propose a knowledge-enhanced DRL approach for multi-agent pursuit-evasion in complex environments. Specifically, the cooperative pursuit problem is modeled as a decentralized partially observable Markov decision process from each pursuer’s perspective, where the team reward function is elaborately designed to encourage collaborative behavior and enhance team coordination. Then, a novel knowledge enhanced multi-agent twin delayed deep deterministic policy gradient (KE-MATD3) algorithm is presented to efficiently learn the cooperative pursuit policy. By integrating a knowledge enhancement mechanism that extracts effective information from an improved artificial potential field method, the cooperative pursuit policy achieves more robust convergence, mitigating the local optima that typically arise from individual reward-based learning. Finally, extensive numerical simulations and real-world experiments validate the efficiency and superiority of the proposed approach, demonstrating emergent cooperative behaviors among the pursuers.

13:50-13:55, Paper ThBT11.7
MarineGym: A High-Performance Reinforcement Learning Platform for Underwater Robotics

Chu, Shuguang	Zhejiang University
Huang, Zebin	Edinburgh Centre for Robotics
Li, Yutong	The Hong Kong University of Science and Technology
Lin, Mingwei	Zhejiang University
Li, Dejun	Zhejiang University
Carlucho, Ignacio	Heriot-Watt University
Petillot, Yvan R.	Heriot-Watt University
Yang, Canjun	Zhejiang University
Keywords: Marine Robotics, Reinforcement Learning Abstract: This work presents the MarineGym, a high-performance reinforcement learning (RL) platform specifically designed for underwater robotics. It aims to address the limitations of existing underwater simulation environments in terms of RL compatibility, training efficiency, and standardized benchmarking. MarineGym integrates a proposed GPU-accelerated hydrodynamic plugin based on Isaac Sim, achieving a rollout speed of 250,000 frames per second on a single NVIDIA RTX 3060 GPU. It also provides five models of unmanned underwater vehicles (UUVs), multiple propulsion systems, and a set of predefined tasks covering core underwater control challenges. Additionally, the DR toolkit allows flexible adjustments of simulation and task parameters during training to improve Sim2Real transfer. Further benchmark experiments demonstrate that MarineGym improves training efficiency over existing platforms and supports robust policy adaptation under various perturbations. We expect this platform could drive further advancements in RL research for underwater robotics. For more details about MarineGym and its applications, please visit our project page: https://marine-gym.com/.


ThBT12	311B
Vision-Based Navigation 2	Regular Session
Chair: Xiao, Xuesu	George Mason University
Co-Chair: Yu, Shumei	Soochow University

13:20-13:25, Paper ThBT12.1
FalconGym: A Photorealistic Simulation Framework for Zero-Shot Sim-To-Real Vision-Based Quadrotor Navigation

Miao, Yan	University of Illinois at Urbana-Champaign
Shen, Will	University of Illinois Urbana-Champaign
Mitra, Sayan	University of Ilinois, Urbana Champagne
Keywords: Vision-Based Navigation, Aerial Systems: Perception and Autonomy Abstract: We present a novel framework demonstrating zero-shot sim-to-real transfer of visual control policies learned in a Neural Radiance Field (NeRF) environment for quadrotors to fly through racing gates. Robust transfer from simulation to real flight poses a major challenge, as standard simulators often lack sufficient visual fidelity. To address this, we construct a photorealistic simulation environment of quadrotor racing tracks, called FalconGym, which provides effectively unlimited synthetic images for training. Within FalconGym, we develop a pipelined approach for crossing gates that combines (i) a Neural Pose Estimator (NPE) coupled with a Kalman filter to reliably infer quadrotor poses from single-frame RGB images and IMU data, and (ii) a self-attention-based multi-modal controller that adaptively integrates visual features and pose estimation. This multi-modal design compensates for perception noise and intermittent gate visibility. We train this controller purely in FalconGym with imitation learning and deploy the resulting policy to real hardware with no additional fine-tuning. Simulation experiments on three distinct tracks (circle, U-turn and figure-8) demonstrate that our controller outperforms a vision-only state-of-the-art baseline in both success rate and gate-crossing accuracy. In 30 live hardware flights spanning three tracks and 120 gates, our controller achieves a 95.8% success rate and an average error of just 10cm when flying through 38cm-radius gates.

13:25-13:30, Paper ThBT12.2
NOLO: Navigate Only Look Once

Zhou, Bohan	Peking University
Zhang, Zhongbin	Tsinghua University
Wang, Jiangxing	Peking University
Lu, Zongqing	Peking University
Keywords: Vision-Based Navigation, Learning from Demonstration, Reinforcement Learning Abstract: The in-context learning ability of Transformer models has brought new possibilities to visual navigation.In this paper, we focus on a novel video navigation setting,where an in-context navigation policy needs to be learned purely from videos in an offline manner, without access to the actual environment. For this setting, we propose Navigate Only Look Once (NOLO), a method for learning a navigation policy that possesses the in-context ability and adapts to new scenes by taking corresponding context videos as input without finetuning or re-training. To enable learning from videos, we first propose a pseudo action labeling procedure using optical flow to recover the action label from egocentric videos. Then, offline reinforcement learning is applied to learn the navigation policy. Through extensive experiments on different scenes both in simulation and the real world, we show that our algorithm outperforms baselines by a large margin, which demonstrates the effectiveness of the learned policy in in-context learning.

13:30-13:35, Paper ThBT12.3
LGNav: Zero-Shot Object Navigation Driven by Language and Pointing Gesture Using Large Vision-Language Models

Zhu, WeiYi	Southeast University
Liu, Juan	Samsung Electronics（China）R&D Center
Li, Xinde	Southeast University
Lv, Zhiwei	Southeast University
Yang, Zhehan	Southeast University
Keywords: Vision-Based Navigation, Human Factors and Human-in-the-Loop, AI-Enabled Robotics Abstract: In human communication, referring to a specific object within an environment often involves the combination of a pointing gesture to indicate the object's direction and linguistic descriptions specifying its name and attributes, thereby enabling precise object identification. Inspired by this natural multimodal interaction, we formalize the zero-shot object navigation driven by language and pointing gesture (LG-ZSON) task, which aims to more closely approximate real-world human-agent communication scenarios. To address this task, we propose LGNav, an open-set, training-free navigation framework. LGNav estimates the pointing gesture direction by extracting human body landmarks and integrates this directional information with depth images to initialize a versatile candidate position map (VCPM). The framework further employs open-vocabulary object detection to identify all potential candidate objects in the environment, projecting them onto the VCPM. Guided by a motion policy derived from the VCPM, LGNav continuously explores the unknown environment, sequentially visits candidate objects, and utilizes a large vision-language model (LVLM) to verify whether each candidate object satisfies the given navigation instruction. Extensive experimental results validate the effectiveness of LGNav, demonstrating its strong performance in the LG-ZSON task. Furthermore, even in the absence of pointing gestures, LGNav achieves competitive results on standard object navigation benchmarks, including the Gibson and HM3D datasets, outperforming a range of strong baseline methods.

13:35-13:40, Paper ThBT12.4
Assistive Guidance System Based on Online Path Structure Recognition for the Visually Impaired

Lee, Jae-Yeong	ETRI
Jeong, Yongseop	Electronics and Telecommunications Research Institute
Choi, Seungmin	ETRI AI Lab, KAIST Future Vehicle Department
Seo, BeomSu	ETRI
Keywords: Vision-Based Navigation, Human-Aware Motion Planning, Legged Robots Abstract: Independent mobility is essential for visually impaired individuals, yet existing mobility aids such as white canes and guide dogs have limitations in accessibility and effectiveness. To address this, we propose an assistive robotic guidance system that replicates key functionalities of guide dogs. Our system operates without pre-built maps and dynamically recognizes road structures using LiDAR, allowing the robot to navigate based on the user’s orientation commands. The system consists of a modular architecture with separate guidance and mobility modules, ensuring adaptability across various environments and robotic platforms. Real-world experiments using both wheeled and quadrupedal robots demonstrate high accuracy in path recognition and effective navigation based on user-provided directional inputs. The results validate the feasibility of our approach in providing accessible and reliable mobility support for the visually impaired.

13:40-13:45, Paper ThBT12.5
Ray Visual Odometry

Xu, Fanqi	University of Oxford
Almalioglu, Yasin	The University of Oxford
Trigoni, Niki	University of Oxford
Keywords: Vision-Based Navigation, Computer Vision for Automation, Localization Abstract: Learning-based Visual Odometry (VO) has seen significant advancements over the past decades. However, all the existing methods rely on the six degrees of freedom (6-DoF) representation for pose prediction, which is sparse and less conducive for neural network learning. In this work, we introduce a novel dense and distributed representation by modeling VO as ray bundles, referred to as RayVO. This richly parameterized representation is tightly coupled with corresponding spatial features, making it highly effective for neural learning. Additionally, the ray-based approach enables simultaneous prediction of both intrinsic and extrinsic parameters. To prove its effectiveness against the traditional 6-DoF representation, we propose three specialized loss functions for ray's training: a ray-based loss, a 6-DoF-based loss and a hybrid loss. We extensively evaluate RayVO on both indoor and outdoor benchmark datasets and show that it outperforms the state-of-the-art VO methods.

13:45-13:50, Paper ThBT12.6
Social-LLaVA: Enhancing Social Robot Navigation through Human-Language Reasoning

Payandeh, Amirreza	George Mason University
Song, Daeun	George Mason University
Nazeri, Mohammad	George Mason University
Liang, Jing	University of Maryland
Mukherjee, Praneel	Academies of Loudoun
Raj, Amir Hossain	George Mason University
Kong, Yangzhe	George Mason University
Manocha, Dinesh	University of Maryland
Xiao, Xuesu	George Mason University
Keywords: Vision-Based Navigation, Human-Centered Robotics, Gesture, Posture and Facial Expressions Abstract: As mobile robots become increasingly common in human-centric environments, social navigation—adhering to unwritten social norms rather than merely avoiding pedestrians has drawn growing attention. Existing methods, from hand-crafted techniques to learning-based approaches, often overlook the nuanced context and scene understanding that humans naturally exhibit. Inspired by studies indicating the critical role of language in cognition and reasoning, we propose a new approach to bridge robot perception and socially aware actions through human-like language reasoning. We introduce Social robot Navigation via Explainable Interactions (SNEI), a human-annotated vision-language dataset comprising over 40K Visual Question Answering (VQA) pairs across 2K unique social scenarios, drawn from diverse, unstructured public spaces. SNEI contains perception, prediction, chain-of-thought reasoning, action, and explanation, thereby allowing robots to interpret social contexts in human language. We fine-tune a Vision-Language Model, Social-LLaVA, on SNEI to demonstrate the potential of language-guided reasoning for high-level navigation tasks. Experimental evaluations—both quantitative and qualitative—demonstrate that Social-LLaVA can outperform state-of-the-art models.

13:50-13:55, Paper ThBT12.7
SkyVLN: Vision-And-Language Navigationand NMPC Control for UAVs in Urban Environments

Li, Tianshun	The HongKong University of Science and Technology(Guangzhou)
Huai, Tianyi	Hong Kong University of Science and Technology (Guangzhou)
Zhen, Li	The Hong Kong University of Science and Technology(Guang Zhou)
Gao, Yichun	The HongKong University of Science and Technology (Guangzhou)
Li, Haoang	Hong Kong University of Science and Technology (Guangzhou)
Zheng, Xinhu	The HongKong University of Science and Technology (Guangzhou)
Keywords: Vision-Based Navigation, Collision Avoidance, Motion Control Abstract: Unmanned Aerial Vehicles (UAVs) have emerged as versatile tools across various sectors, driven by their mobility and adaptability. This paper introduces SkyVLN, a novel framework integrating vision-and-language navigation (VLN) with Nonlinear Model Predictive Control (NMPC) to enhance UAV autonomy in complex urban environments. Unlike traditional navigation methods, SkyVLN leverages Large Language Models (LLMs) to interpret natural language instructions and visual observations, enabling UAVs to navigate through dynamic 3D spaces with improved accuracy and robustness. We present a multimodal navigation agent equipped with a fine-grained spatial verbalizer and a history path memory mechanism. These components allow the UAV to disambiguate spatial contexts, handle ambiguous instructions, and backtrack when necessary. The framework also incorporates an NMPC module for dynamic obstacle avoidance, ensuring precise trajectory tracking and collision prevention. To validate our approach, we developed a high-fidelity 3D urban simulation environment using AirSim, featuring realistic imagery and dynamic urban elements. Extensive experiments demonstrate that SkyVLN significantly improves navigation success rates and efficiency, particularly in new and unseen environments.

13:55-14:00, Paper ThBT12.8
VecNav: Vector Goal Robot Navigation from In-The-Wild Videos

Cao, Ruixiang	Kyoto University
Yagi, Satoshi	Kyoto University
Yamamori, Satoshi	Kyoto University
Morimoto, Jun	Kyoto University
Keywords: Vision-Based Navigation, Autonomous Vehicle Navigation, Deep Learning Methods Abstract: We propose VecNav, a novel approach that trains a monocular navigation model through self-supervision using uncalibrated, human-captured videos. These videos, characterized by unknown camera intrinsics and extrinsics, are readily available from video-sharing platforms (e.g. YouTube) and are referred to as ``in-the-wild" videos due to their unregulated capture conditions. Our approach involves estimating ground truth trajectories from these videos using monocular visual odometry. We then train a transformer-based diffusion policy that takes a goal specified by a vector and RGB images as input and generates action predictions. Our method leverages a significantly larger and more diverse dataset compared to existing monocular visual navigation approaches. This diversity holds the potential to develop a generalist navigation model capable of guiding various types of robots in unfamiliar environments. We evaluated our method on a differential drive robot, demonstrating its capability to effectively navigate using solely ``in-the-wild" videos for training. Our experiments demonstrate that VecNav successfully learned to act based on visual affordances, relying solely on uncalibrated ``in-the-wild" data.


ThBT13	311C
Deep Learning for Visual Perception 10	Regular Session
Chair: Feng, Chen	New York University

13:20-13:25, Paper ThBT13.1
VLM See, Robot Do: Human Demo Video to Robot Action Plan Via Vision Language Model

Wang, Beichen	New York University
Zhang, Juexiao	New York University
Dong, Shuwen	New York University
Fang, Irving	New York University
Feng, Chen	New York University
Keywords: Deep Learning for Visual Perception, AI-Enabled Robotics, Learning from Demonstration Abstract: Large Vision Language Models (VLMs) have been adopted in robotics for their strong common sense understanding and generalization capabilities. Existing works leverage VLMs for task and motion planning based on language instructions and robot observations. In this work, we explore using VLM to interpret long-horizon human demonstration videos to generate a sequence of robot task plans in natural language. To achieve this, we propose SeeDo, an agent that integrates keyframe selection module, visual prompting module, and a VLM interpreter into a pipeline that enables the VLM to ``see'' human demonstrations and generate step-by-step plans for robots to ``do'' them. To evaluate, we curate a benchmark of long-horizon human demonstration videos of pick-and-place tasks in three diverse categories and designed comprehensive evaluation metrics. The experiments demonstrate SeeDo's superior performance in generating subtask planning in natural language from long-horizon human demo videos. Experiments show SeeDo outperforms state-of-the-art video VLMs in generating subtask plans. By further integrating SeeDo with low-level action primitive functions and language model programs, we validated SeeDo in both simulated and real-world deployments.

13:25-13:30, Paper ThBT13.2
EvidMTL: Evidential Multi-Task Learning for Uncertainty-Aware Semantic Surface Mapping from Monocular RGB Images

Menon, Rohit	University of Bonn
Dengler, Nils	University of Bonn
Pan, Sicong	University of Bonn
Chenchani, Gokul Krishna Gandhi	Hochschule Bonn-Rhein-Sieg
Bennewitz, Maren	University of Bonn
Keywords: Mapping, RGB-D Perception, Deep Learning for Visual Perception Abstract: For scene understanding in unstructured environments, an accurate and uncertainty-aware metric-semantic mapping is required to enable informed action selection by autonomous systems. Existing mapping methods often suffer from overconfident semantic predictions, and sparse and noisy depth sensing, leading to inconsistent map representations. In this paper, we therefore introduce EvidMTL, a multi-task learning framework that uses evidential heads for depth estimation and semantic segmentation, enabling uncertainty-aware inference from monocular RGB images. To enable uncertainty-calibrated evidential multi-task learning, we propose a novel evidential depth loss function that jointly optimizes the belief strength of the depth prediction in conjunction with evidential segmentation loss. Building on this, we present EvidKimera, an uncertainty-aware semantic surface mapping framework, which uses evidential depth and semantics prediction for improved 3D metric-semantic consistency. We train and evaluate EvidMTL on the NYUDepthV2 and assess its zero-shot performance on ScanNetV2, demonstrating superior uncertainty estimation compared to conventional approaches while maintaining comparable depth estimation and semantic segmentation. In zero-shot mapping tests on ScanNetV2, EvidKimera outperforms Kimera in semantic surface mapping accuracy and consistency by 30%, highlighting the benefits of uncertainty-aware mapping and underscoring its potential for real-world robotic applications.

13:30-13:35, Paper ThBT13.3
FUSE: Label-Free Image-Event Joint Monocular Depth Estimation Via Frequency-Decoupled Alignment and Degradation-Robust Fusion

Sun, Pihai	Harbin Institute of Technology
Jiang, Junjun	Harbin Institute of Technology
Yao, Yuanqi	Harbin Institute of Technology
Chen, Youyu	Harbin Institute of Technology
Zhao, Wenbo	Harbin Institute of Technology
Jiang, Kui	Harbin Institute of Technology
Liu, Xianming	Harbin Institute of Technology
Keywords: RGB-D Perception, Deep Learning for Visual Perception Abstract: Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability stemming from two factors: 1) limited annotated image-event-depth datasets lead to poor model generalization, and 2) inherent frequency mismatches between static images and dynamic event streams with distinct spatiotemporal patterns. To address this dual challenge, we propose Frequency-decoupled Unified Self-supervised Encoder (textit{FUSE}) with two synergistic components: The Parameter-efficient Self-supervised Transfer framework (PST) establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity by enabling joint encoding without depth ground truth. Complementing this, we propose the Frequency-Decoupled Fusion module (FreDFuse) to explicitly decouple high-frequency edge features from low-frequency structural components, resolving modality-specific frequency conflicts through physics-aware fusion. This combined approach enables textit{FUSE} to construct a universal image-event encoder that only requires lightweight decoder adaptation for target datasets. Extensive experiments demonstrate state-of-the-art performance with (14.9%) and (24.9%) improvements in Abs.Rel on MVSEC and DENSE datasets. The framework exhibits remarkable zero-shot adaptability to challenging scenarios including extreme lighting and motion blur, significantly advancing real-world deployment capabilities. The source code for our method is publicly available at: url{https://github.com/sunpihai-up/FUSE}.

13:35-13:40, Paper ThBT13.4
BiCo-Fusion: Bidirectional Complementary LiDAR-Camera Fusion for Semantic and Spatial-Aware 3D Object Detection

Song, Yang	The Hong Kong University of Science and Technology (Guangzhou)
Wang, Lin	Nanyang Technological University (NTU)
Keywords: Deep Learning for Visual Perception, Sensor Fusion, Object Detection, Segmentation and Categorization Abstract: 3D object detection is an important task that has been widely applied in autonomous driving. To perform this task, a new trend is to fuse multi-modal inputs, i.e., LiDAR and camera. Under such a trend, recent methods fuse these two modalities by unifying them in the same 3D space. However, during direct fusion in a unified space, the drawbacks of both modalities (LiDAR features struggle with detailed semantic information and the camera lacks accurate 3D spatial information) are also preserved, diluting semantic and spatial awareness of the final unified representation. To address the issue, this letter proposes a novel bidirectional complementary LiDAR-camera fusion framework, called BiCo-Fusion that can achieve robust semantic- and spatial-aware 3D object detection. The key insight is to fuse LiDAR and camera features in a bidirectional complementary way to enhance the semantic awareness of the LiDAR and the 3D spatial awareness of the camera. The enhanced features from both modalities are then adaptively fused to build a semantic- and spatial-aware unified representation. Specifically, we introduce Pre-Fusion consisting of a Voxel Enhancement Module (VEM) to enhance the semantic awareness of voxel features from 2D camera features and Image Enhancement Module (IEM) to enhance the 3D spatial awareness of camera features from 3D voxel features. We then introduce Unified Fusion (U-Fusion) to adaptively fuse the enhanced features from the last stage to build a unified representation. Extensive experiments demonstrate the superiority of our BiCo-Fusion against the prior arts. Project page: https://t-ys.github.io/BiCo-Fusion/.

13:40-13:45, Paper ThBT13.5
VirInteraction: Enhancing Virutal-LiDAR Points Interaction by Using Image Semantics and Density Estimation for 3D Object Detection

Zhu, Huming	Xidian University
Xue, Yiyu	Xidian University
Dong, Ximiao	Xidian University
Cheng, Xinyue	Xidian University
Keywords: Deep Learning for Visual Perception, Sensor Fusion, Computer Vision for Transportation Abstract: Distant object detection is a difficult problem in LiDAR-based 3D object detection. In recent years, the 3D detection of distant objects has achieved great success with the proposed fusion method of the virtual points generated by depth completion and LiDAR points. However, the inaccuracy of depth completion brings a lot of noise which significantly reduces the detection accuracy. To reduce noise and improve the detection accuracy of distant objects, we propose a solution called VirInteraction, which is a semantic-guided Virtual-LiDAR fusion method to enhance the interaction of virtual points and LiDAR points. Specifically, VirInteraction mainly includes three new designs: 1) FgVD (Foreground-based adaptive Voxel Denoising), 2) Se-Sampling (Semantic neighboring Sampling), and 3) MDC-Attention (Multi-scale Density-aware Cross Attention). FgVD uses kernel density estimation (KDE) to adaptively denoise the foreground and background voxels. Se-Sampling completes the shape cues of distant objects using bidirectional sampling based on self-attention mechanism. Meanwhile, we built on these two designs and VirConvNet to develop a more robust VirInterNet as our virtual-point-based backbone. Finally, MDC-Attention elegantly aggregates the features of the images and points at the feature level according to the density distribution. Extensive experiments on KITTI and nuScenes demonstrate the effectiveness of VirInteraction.

13:45-13:50, Paper ThBT13.6
A Modern Take on Visual Relationship Reasoning for Grasp Planning

Rabino, Paolo	Politecnico Di Torino
Tommasi, Tatiana	Politecnico Di Torino
Keywords: Deep Learning for Visual Perception, Perception for Grasping and Manipulation, AI-Based Methods Abstract: Interacting with real-world cluttered scenes poses several challenges to robotic agents that need to understand complex spatial dependencies among the observed objects to determine optimal pick sequences or efficient object retrieval strategies. Existing solutions typically manage simplified scenarios and focus on predicting pairwise object relationships following an initial object detection phase, but often overlook the global context or struggle with handling redundant and missing object relations. In this work, we present a modern take on visual relational reasoning for grasp planning. We introduce D3GD, a novel testbed that includes bin picking scenes with up to 35 objects from 97 distinct categories. Additionally, we propose D3G, a new end-to-end transformer-based dependency graph generation model that simultaneously detects objects and produces an adjacency matrix representing their spatial relationships. Recognizing the limitations of standard metrics, we employ the Average Precision of Relationships for the first time to evaluate model performance, conducting an extensive experimental benchmark. The obtained results establish our approach as the new state-of-the-art for this task, laying the foundation for future research in robotic manipulation. We publicly release the code and dataset at url{https://paolotron.github.io/d3g.github.io}.

13:50-13:55, Paper ThBT13.7
Uncertainty-Aware Real-Time Visual Anomaly Detection with Conformal Prediction in Dynamic Indoor Environments

Saboury, Arya	Eastern Mediterranean University
Uyguroglu, Mustafa Kemal	Eastern Mediterranean University
Keywords: Deep Learning for Visual Perception, Probability and Statistical Methods Abstract: This paper presents an efficient visual anomaly detection framework designed for safe autonomous navigation in dynamic indoor environments, such as university hallways. The approach employs an unsupervised autoencoder method within deep learning to model regular environmental patterns and detect anomalies as deviations in the embedding space. To enhance reliability and safety, the system integrates a statistical framework, conformal prediction, that provides uncertainty quantification with probabilistic guarantees. The proposed solution has been deployed on a real-time robotic platform, demonstrating efficient performance under resource constrained conditions. Extensive hyperparameter optimization ensures the model remains dynamic and adaptable to changes, while rigorous evaluations confirm its effectiveness in anomaly detection. By addressing challenges related to real-time processing and hardware limitations, this work advances the state-of-the-art in autonomous anomaly detection. The probabilistic insights offered by this framework strengthen operational safety and pave the way for future developments, such as richer sensor fusion and advanced learning paradigms. This research highlights the potential of uncertainty-aware deep learning to enhance safety monitoring frameworks, thereby enabling the development of more reliable and intelligent autonomous systems for real-world applications.

13:55-14:00, Paper ThBT13.8
CUBE360: Learning Cubic Field Representation for Monocular Panoramic Depth Estimation

Chang, Wenjie	University of Science and Technology of China
Ai, Hao	Hong Kong University of Science and Technology (Guangzhou)
Zhang, Tianzhu	USTC
Wang, Lin	Nanyang Technological University (NTU)
Keywords: Deep Learning for Visual Perception, Range Sensing Abstract: Panoramic depth estimation presents significant challenges due to the severe distortion caused by equirectangular projection (ERP) and the limited availability of panoramic RGB-D datasets. Inspired by the recent success of neural rendering, we propose a self-supervised method, named CUBE360, that learns a cubic field composed of multiple Multi-Plane Images (MPIs) from a single panoramic image for continuous depth estimation at any view direction. Our CUBE360 employs cubemap projection to transform an ERP image into six faces and extract the MPIs for each, thereby reducing the memory consumption required for MPIs processing of high-resolution data. An attention-based blending module is then employed to learn correlations among the MPIs of cubic faces, constructing a cubic field representation with color and density information at various depth levels. Furthermore, a dual-sampling strategy is introduced to render novel views from the cubic field at both cubic and planar scales. The entire pipeline is trained using photometric loss calculated from rendered views within a self-supervised learning (SSL) approach, enabling training without depth annotations. Experiments on synthetic and real-world datasets demonstrate the superior performance of CUBE360 compared to previous SSL methods.


ThBT14	311D
Medical Robots and Systems 6	Regular Session
Chair: Vartholomeos, Panagiotis	University of Thessaly
Co-Chair: Wu, Liao	University of New South Wales

13:20-13:25, Paper ThBT14.1
Vine4Spine: A Steerable Tip-Growing Robot with Contact Force Estimation for Navigation in the Spinal Subarachnoid Space

Wu, Zicong	King's College London
Sadati, S.M.Hadi	King's College London
Vartholomeos, Panagiotis	University of Thessaly
Abdelaziz, Mohamed Essam Mohamed Kassem	Imperial College London
Temelkuran, Burak	Imperial College London
Petrou, Georgios	Artificial Limbs
Booth, Thomas	King's College London
Shapey, Jonathan	King's College London
Ahmed, Aminul	King's College London
Bergeles, Christos	King's College London
Keywords: Medical Robots and Systems, Soft Robot Applications, Force and Tactile Sensing Abstract: Therapies targeting neurodegenerative diseases via brain ventricles and spinal parenchyma face delivery challenges. Systemic administration is ineffective due to the blood-brain barrier, while direct surgical access, especially for multi-site delivery, is highly invasive. The spinal subarachnoid space offers potential for microcatheter-based delivery, but existing robotic catheter technologies are unsuitable due to spinal anatomy constraints. This paper presents a miniaturised and sensorised steerable eversion-growing robot tailored to navigation of the subarachnoid space of the spine. The property of eversion reduces interaction forces with the anatomy, rendering our approach safer than microcatheters that need to be pushed. Our system is capable of real-time tip force estimation with three degrees of freedom (DoF) using fibre Bragg gratings (FBG). Additionally, it incorporates a microendoscope and a steerable tip, all within a tiny 2 mm outer diameter. The system’s navigation, sensing, and imaging capabilities were evaluated using a realistic up-scaled phantom of the subarachnoid space covering the cervical spine, demonstrating interaction forces within the safe range of 2 -5 N during phantom navigation. Comparison study of instrument-tissue interactions further approved its clinical relevance, presenting a 73.78% decrease of the mean absolute forces to traditional insertion without the sheath in global measurements.

13:25-13:30, Paper ThBT14.2
Fast-Adaptive Permanent Magnetic Positioning-Based Navigation Framework for Continuum Robots in Colonoscopic Biopsy (I)

Yao, Shilong	City University of Hong Kong/Southern University of Science And
Luo, Peiyu	Southern University of Science and Technology
Liu, Li	Great Bay University
Yan, Hong	City University of Hong Kong
Meng, Max Q.-H.	The Chinese University of Hong Kong
Keywords: Medical Robots and Systems, Soft Robot Applications, Sensor-based Control Abstract: The potential of continuum robots in medical applications is considerable, due to their flexibility and capacity to navigate complex anatomical environments. This article introduces a novel framework based on the Fast-Adaptive Permanent Magnetic Tracking method, which has been designed with the objective of enhancing the accuracy and autonomy of colonoscopic biopsies. The system incorporates a permanent magnet positioning methodology, enabling the robot to maintain a trajectory tracking root mean square error of less than 4 mm, with magnet speeds up to 150 mm/s and positioning errors under 2 mm. Furthermore, the framework includes an adaptive obstacle avoidance strategy, allowing the robot to navigate around obstacles and adjust its posture in response to dynamic movement. Extensive experimental validations in both simulation and real-world environments demonstrate the system's effectiveness in delivering precise, responsive, and continuous operation. This work represents a significant advancement in autonomous navigation based on permanent magnetic localization techniques, with the potential to enhance the efficacy and safety of robotic-assisted surgeries.

13:30-13:35, Paper ThBT14.3
A Soft Robot Attachment with Variable Stiffness Effector for Advanced Endoscopic Surgical Tasks

Zhou, Zhangxi	Imperial College London
Yang, Jianlin	Nanjing University of Aeronautics and Astronautics
Luo, Mingrui	Institute of Automation, Chinese Academy of Sciences
Lou, Hanqi	Imperial College London
Runciman, Mark	Imperial College London
Mylonas, George	Imperial College London
Keywords: Medical Robots and Systems, Soft Robot Applications, Tendon/Wire Mechanism Abstract: This paper presents a soft Cable-Driven Parallel Robot for gastrointestinal surgery. The robot consists of an inflatable scaffold and a hydraulic variable stiffness end-effector and features six degrees of freedom. Experiments involving passage through colon model, knot tying, and retraction have demonstrated its flexibility and the concept of navigating through the colon in a soft configuration, then increasing rigidity at the lesion site to collaborate with the endoscope in performing surgery. Meanwhile, the robot can sense the contact force through hydraulic pressure variations within the end-effector shaft, providing haptic feedback, which reduces the effects of Coulomb friction. When using the robot to calculate pressing forces, it achieved accuracy with a mean error of 0.051 N and a standard deviation (STD) of 0.066 N. For lifting forces, it achieved a mean error of 0.066 N and a STD of 0.083 N. These results demonstrate the potential of the robot for tissue palpation applications.

13:35-13:40, Paper ThBT14.4
Synchronous Inflation of a Valvuloplasty Balloon Catheter with Heart Rate: In-Vitro Evaluation in Terms of Dilatation Performance

Yao, Junke	King's College London
Pi, Xinyi	University College London
Bosi, Giorgia Maria	University College London
Burriesci, Gaetano	University College London
Wurdemann, Helge Arne	University College London
Keywords: Medical Robots and Systems, Soft Sensors and Actuators, Hydraulic/Pneumatic Actuators Abstract: Balloon aortic valvuloplasty (BAV), a minimally invasive procedure to alleviate aortic valve stenosis, commonly employs rapid ventricular pacing (RVP) for balloon stabilization. However, the repeated and extended operation time associated with this technique poses potential complications. This paper introduces a novel approach to mitigate these concerns by employing a dilatation mechanism that is synchronized with the cardiac frequency, wherein the balloon catheter is fully inflated and deflated to a safe, low volume during the decrement of the ventricular pressure. The synchronized pacing was tested at a heart rate of 60 bpm. To experimentally validate the performance of this new approach, mock aortic roots reproducing different calcification patterns were used to compare the leaflets' mobility after the dilatation test with traditional BAV. Results confirm successful balloon pacing, maintaining low volume before the ventricular pressure increases. The dilatation performance assessment underscores that the proposed methodology resulted in a higher improvement in terms of the transvalvular pressure gradient and opening area. Optimal performance occurs at 60 bpm, yielding a 30.28% gradient decrease and a 21.35% opening area increase. This research represents a notable step forward toward the development of BAV devices capable of autonomous stabilization, eliminating the need for RVP and its related complications. Furthermore, the use of calcified aortic root (AR) phantoms contributes to an enhanced understanding of hemodynamic implications during BAV procedures.

13:40-13:45, Paper ThBT14.5
Automatic Tissue Traction Using Miniature Force-Sensing Forceps for Minimally Invasive Surgery

Liu, Tangyou	The University of New South Wales
Wang, Xiaoyi	University of New South Wales
Katupitiya, Jayantha	The University of New South Wales
Wang, Jiaole	Harbin Institute of Technology, Shenzhen
Wu, Liao	University of New South Wales
Keywords: Medical Robots and Systems, Surgical Robotics: Laparoscopy, Automation in Life Sciences: Biotechnology, Pharmaceutical and Health Care, Minimally Invasive Surgery Abstract: A common limitation of autonomous tissue manipulation in robotic minimally invasive surgery (MIS) is the absence of force sensing and control at the tool level. Recently, our team has developed miniature force-sensing forceps that can simultaneously measure the grasping and pulling forces during tissue manipulation. Based on this design, here we further present a method to automate tissue traction that comprises grasping and pulling stages. During this process, the grasping and pulling forces can be controlled either separately or simultaneously through force decoupling. The force controller is built upon a static model of tissue manipulation, considering the interaction between the force-sensing forceps and soft tissue. The efficacy of this force control approach is validated through a series of experiments comparing targeted, estimated, and actual reference forces. To verify the feasibility of the proposed method in surgical applications, various tissue resections are conducted on ex vivo tissues employing a dual-arm robotic setup. Finally, we discuss the benefits of multi-force control in tissue traction, evidenced through comparative analyses of various ex vivo

13:45-13:50, Paper ThBT14.6
Wireless Powered Capsule Robots with a Wide Locomotion Range and Random Orientation Via Planar Transmitting Coils

Zheng, Tianxiang	Shanghai Jiao Tong University
Kang, Ning	Nanyang Technological University
Lee, Christopher H. T.	Nanyang Technological University
Shao, Lei	Shanghai Jiao Tong University
Keywords: Medical Robots and Systems, Surgical Robotics: Laparoscopy, Embedded Systems for Robotic and Automation Abstract: Capsule endoscopy and drug delivery hold great promise but are constrained by power supply limitations. This study introduces a battery-less capsule robot powered by wireless power transfer (WPT), utilizing a phase-controlled 2D planar array operating at 6.78 MHz. This setup provides a stable energy supply for a micro capsule robot in a dynamic 3D space. The robot’s receiving coils and on-board circuits are optimized to consistently acquire approximately 1 W of power across various positions and orientations. This enhancement significantly boosts the robot’s capabilities, including high-resolution imaging and extended wireless communication. We demonstrate that the capsule can capture and transmit high-resolution images via Wi-Fi, and successfully operated in an ex-vivo digestive system, supporting its potential for biomedical applications within the gastrointestinal tract. This research also advances the WPT technology, paving the way for its use in other miniature biomedical devices and expanding their practical applications.

13:50-13:55, Paper ThBT14.7
A New Concept for Reconstruction of Volumetric Muscle Loss Injuries Using Spatial Robotic Embedded Bioprinting: A Feasibility Study

Rezayof, Omid	University of Texas at Austin
Rafiee Javazm, Mohammad	University of Texas at Austin
Kulkarni, Yash	The University of Texas at Austin
Kamaraj, Meenakshi	Terasaki Institute for Biomedical Innovation, Los Angeles, Calif
Tilton, Maryam	University of Texas at Austin
John, Johnson V.	Terasaki Institute for Biomedical Innovation
Alambeigi, Farshid	University of Texas at Austin
Keywords: Medical Robots and Systems, Surgical Robotics: Planning, Additive Manufacturing Abstract: In this study, we introduce a new concept for reconstruction of Volumetric Muscle Loss (VML) injuries and propose the spatial robotic embedded bioprinting technique. As opposed to the traditional layer-by-layer printing, we leverage the support-free nature of embedded bioprinting to print spatial and complex structures of fascicles in a fusiform muscle. To demonstrate feasibility of this concept, we first propose our robotic bioprinting framework including a robotic arm integrated with a custom-designed bioprinting injector. Complementary motion planning algorithms uniquely designed for this printing task are further proposed. Moreover, the effect of embedded bioprinting parameters, as well as the supporting bath and injecting materials compatibility on the uniformity and quality of the printed constructs has been analyzed. Finally, we perform a case study by printing a fusiform muscle-shape construct using the proposed concept and algorithms, and evaluate the quality of the printed structure.

13:55-14:00, Paper ThBT14.8
Real-Time 3D Guidewire Reconstruction from Intraoperative DSA Images for Robot-Assisted Endovascular Interventions

Yao, Tianliang	Tongji University
Bingrui, Li	University of Birmingham
Lu, Bo	Soochow University
Pei, Zhiqiang	University of Shanghai for Science and Technology
Yuan, Yixuan	Chinese University of Hong Kong
Qi, Peng	Tongji University
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles, Computer Vision for Medical Robotics Abstract: Accurate three-dimensional (3D) reconstruction of guidewire shapes is crucial for precise navigation in robot-assisted endovascular interventions. Conventional 2D Digital Subtraction Angiography (DSA) is limited by the absence of depth information, leading to spatial ambiguities that hinder reliable guidewire shape sensing. This paper introduces a novel multimodal framework for real-time 3D guidewire reconstruction, combining preoperative 3D Computed Tomography Angiography (CTA) with intraoperative 2D DSA images. The method utilizes robust feature extraction to address noise and distortion in 2D DSA data, followed by deformable image registration to align the 2D projections with the 3D CTA model. Subsequently, the inverse projection algorithm reconstructs the 3D guidewire shape, providing real-time, accurate spatial information. This framework significantly enhances spatial awareness for robotic-assisted endovascular procedures, effectively bridging the gap between preoperative planning and intraoperative execution. The system demonstrates notable improvements in real-time processing speed, reconstruction accuracy, and computational efficiency. The proposed method achieves a projection error of 1.76±0.08 pixels and a length deviation of 2.93±0.15%, with a frame rate of 39.3±1.5 frames per second (FPS). These advancements have the potential to optimize robotic performance and increase the precision of complex endovascular interventions, ultimately contributing to better clinical outcomes.


ThBT15	206
Telerobotics and Teleoperation 2	Regular Session
Chair: Porcini, Francesco	PERCRO Laboratory, TeCIP Institute, Sant’Anna School of Advanced Studies, Pisa
Co-Chair: Sadeghian, Hamid	Technical University of Munich

13:20-13:25, Paper ThBT15.1
Improved Free Motion Performance for TDPA-Passivated Position-Force Measured Teleoperation Architectures

Celli, Camilla	Scuola Superiore Sant'Anna
Porcini, Francesco	PERCRO Laboratory, TeCIP Institute, Sant’Anna School of Advanced
Bini, Andrea	Scuola Superiore Sant'Anna
Novelli, Valerio	Scuola Superiore Sant'Anna
Filippeschi, Alessandro	Scuola Superiore Sant'Anna
Frisoli, Antonio	TeCIP Institute, Scuola Superiore Sant'Anna
Keywords: Telerobotics and Teleoperation, Physical Human-Robot Interaction, Haptics and Haptic Interfaces Abstract: Passivity-based methods are widely used in teleoperation to guarantee stability, especially for the widely used Position-Force measured (PFm) architecture. Among them, Time Domain Passivity Approach (TDPA) achieves stability through passivation, generating a dissipation action that degrades the reference signals and thus, the performance of the system. Whereas passivization is necessary to ensure stability during contact, in free motion, it acts just as a disturb, without having a real impact on stability. In fact, during free motion the force feedback is zero, i.e. the teleoperation loop is not closed and thus, a stabilization action is not needed. Therefore, this paper proposes a formal demonstration that passivation is not needed during free motion. Accordingly, the paper introduces a new formulation of the TDPA to take into account the free motion condition. A one-degree-of-freedom case study is then proposed to provide a simple example to instantiate the formalism and to show the advantages of the proposed method, that achieve an almost total cancellation of the drift during free motion. Finally, the paper discusses the limitations of the method in real-case scenarios. In particular, how the inertia of tools mounted after the force sensors can affect the measurements and the perception of the system.

13:25-13:30, Paper ThBT15.2
MobiExo: GPS-SLAM Fusion for Seamless Indoor-Outdoor Mobile Manipulation with Hand-Foot Coordination

Wang, JianPeng	Chongqing University of Posts and Telecommunications
Tian, Zhen	Guangdong Laboratory of Artificial Intelligence and Digital Econ
Chen, Wenlong	Guangdong Laboratory of Artificial Intelligence and Digital Econ
Yuan, Dian	Chongqing University of Posts and Telecommunications
Zhou, Zhou	Guangdong Laboratory of Artificial Intelligence and Digital Econ
Cen, Ming	Chongqing University of Posts and Telecommunications
Hua, Xia	Shanghai University
Yu, Fei	Guangming Lab
Keywords: Telerobotics and Teleoperation, Prosthetics and Exoskeletons, Physical Human-Robot Interaction Abstract: Teleoperation systems for mobile robots face significant challenges in achieving seamless coordination across dynamic environments. We present MobiExo, a teleoperation system that unlocks seamless indoor-outdoor mobile manipulation. Our approach tackles two fundamental challenges: robust cross-environment localization and intuitive full-body control. A novel self-adaptive federated filter unifies GPS and SLAM, delivering continuous centimeter-level positioning (4.5±0.8 cm indoor, 6.8±1.2 cm outdoor) and eliminating transition errors. Simultaneously, an integrated hand-foot coordination framework translates the operator's natural gait and gestures into fluid robot actions, maintaining remarkable millimeter-level end-effector precision (3.5±0.4 mm) during navigation. Extensive field trials validate our design, demonstrating high task success (96.7% indoor, 94.3% outdoor) and a 5.9x efficiency improvement in multi-location tasks over stationary setups. Code is available at: https://github.com/wangjianpeng200/MobiExo.git

13:30-13:35, Paper ThBT15.3
New Network Protocol for Supermedia-Enhanced Telerobotics

Liu, Xinyu	The University of Hong Kong
Song, Zekun	The University of Hong Kong
Xue, Yuxuan	University of Hong Kong
Huang, Hongli	Hong Kong Metropolitan University
Wang, Yichen	The University of HongKong
Roy, Vellaisamy Arul lenus	Hong Kong Metropolitan University
Xi, Ning	The University of Hong Kong
Keywords: Telerobotics and Teleoperation, Networked Robots Abstract: The growing complexity of robotic teleoperation systems necessitates the integration of multiple feedback modalities, including video, audio, force, tactile, and temperature feedback. The concept of supermedia is utilized to describe the aggregation of these feedback streams. By integrating multiple media forms, supermedia can offer a more comprehensive interactive experience for robot teleoperation systems. However, existing transmission protocols struggle to maintain synchronization among these diverse feedback streams, particularly in demanding network environments. In this paper, we present the Tele-Robotic Control Protocol (TRCP), a novel network transmission protocol specifically designed for supermedia-enhanced robotic teleoperation systems. TRCP incorporates an event reference mechanism that coordinates multiple feedback streams based on robot state rather than traditional time-based sampling. It also employs multi-queue management for the independent handling of different feedback types and integrates an adaptive adjustment mechanism that optimizes transmission parameters in response to real-time network conditions. The effectiveness of TRCP is demonstrated through a cross-continental teleoperation experiment between the University of Glasgow and the University of Hong Kong. TRCP achieves superior feedback synchronization and real-time responsiveness, significantly enhancing both task success rates and operator performance.

13:35-13:40, Paper ThBT15.4
Geometric Retargeting: A Principled, Ultrafast Neural Hand Retargeting Algorithm

Yin, Zhao-Heng	University of California, Berkeley
Wang, Changhao	University of California, Berkeley
Pineda, Luis	Meta AI
Bodduluri, Chaithanya krishna	Meta Platforms
Wu, Tingfan	Meta AI
Abbeel, Pieter	UC Berkeley
Mukadam, Mustafa	Meta
Keywords: Telerobotics and Teleoperation, Multifingered Hands, Machine Learning for Robot Control Abstract: We introduce Geometric Retargeting (GeoRT), an ultrafast, and principled neural hand retargeting algorithm for teleoperation, developed as part of our recent Dexterity Gen (DexGen) system. GeoRT converts human finger keypoints to robot hand keypoints at 1KHz, achieving state-of-the-art speed and accuracy with significantly fewer hyperparameters. This high-speed capability enables flexible postprocessing, such as leveraging a foundational controller for action correction like DexGen. GeoRT is trained in an unsupervised manner, eliminating the need for manual annotation of hand pairs. The core of GeoRT lies in novel geometric objective functions that capture the essence of retargeting: preserving motion fidelity, ensuring configuration space (C-space) coverage, maintaining uniform response through high flatness, and preventing self-collisions. This approach is free from intensive test-time optimization, offering a more scalable and practical solution for real-time hand retargeting.

13:40-13:45, Paper ThBT15.5
Diegetic Graphical User Interfaces for Robot Control Via Eye-Gaze

Nunez Sardinha, Emanuel	Bristol Robotics Lab, University of the West of England
Munera, Marcela	University of West England
Zook, Nancy	University of the West of England
Western, David	University of Bristol
Ruiz Garate, Virginia	University of Mondragon
Keywords: Telerobotics and Teleoperation, Physically Assistive Devices, Virtual Reality and Interfaces Abstract: Eye-gaze stands out as an intuitive interface for hands-free control of robotic devices due to its brief training time, fast calibration, low invasiveness, and reduced complexity and cost. However, current approaches are limited by available screen space, excessive wait times, frequent context switching, inconsistent gaze tracker accuracy, and the trade-off between feature-richness and usability. This article presents Diegetic Graphical User Interfaces, a novel, intuitive, and computationally inexpensive approach for gaze-controlled interfaces applied to a robotic arm for precision pick-and-place tasks. By using customizable symbols paired with fiducial markers, interactive buttons are defined and embedded into the robot, which users can trigger via gaze. Twenty-one participants completed the Yale-CMU-Berkeley (YCB) Block Pick and Place Protocol, reporting good usability and user experience, while achieving comparable workload to similar systems. The resulting system is fast to learn, does not restrain the user’s head, and mitigates context switching, while demonstrating intuitive control continuous Cartesian control of a robot arm in precision tasks.

13:45-13:50, Paper ThBT15.6
Haptic Shared Control of a Pair of Microrobots for Telemanipulation Using Constrained Optimization

Raphalen, Léon	Université De Rennes, CNRS
Ferro, Marco	CNRS
Misra, Sarthak	University of Twente
Robuffo Giordano, Paolo	Irisa Cnrs Umr6074
Pacchierotti, Claudio	Centre National De La Recherche Scientifique (CNRS)
Keywords: Telerobotics and Teleoperation, Optimization and Optimal Control, Automation at Micro-Nano Scales Abstract: Microrobotics implies actuation-related constraints that make safe telemanipulation particularly challenging. We present a haptic shared control system for electromagnetic-based telemanipulation of a pair of microrobots using a constrained optimization framework. Our contributions include: (1) a Quadratic Programming formulation with Control Lyapunov Functions and Control Barrier Functions, for safe and stable navigation in cluttered environments; (2) a shared control architecture, combining a haptic interface and simulation environment, to teleoperate the microrobots and enable micromanipulation capabilities; and (3) haptic shared control strategies offering visuo-haptic cues for task execution. The approach is validated through a user study, highlighting better navigation accuracy, control stability and task efficiency.

13:50-13:55, Paper ThBT15.7
Model-Mediated Teleoperation with 3D Dynamic Environment Tracking (MMT-DET): A Comparative Study of Task Performance with Time-Domain Passivity Control

Fernandez Prado, Diego	Technical University of Munich / School of Computation, Informat
Chen, Xiao	Technical University of Munich
Elsner, Jean	Technical University of Munich
Sadeghian, Hamid	Technical University of Munich
Rajaei, Nader	Technical University of Munich
Naceri, Abdeldjallil	Technical University of Munich
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Steinbach, Eckehard	Technical University of Munich
Keywords: Telerobotics and Teleoperation, Multi-Modal Perception for HRI, Simulation and Animation Abstract: Teleoperation with haptic feedback allows users to interact with remote environments while retaining a sense of touch. However, the stability and transparency of these systems are compromised under communication network delay. This paper presents an augmented Model-Mediated Teleoperation with 3D object and dynamic environment tracking (MMT-DET) by a vision-based algorithm, enabling users to receive haptic feedback in structured dynamic environments while maintaining robustness against network delays. A user study comparing the proposed method with teleoperation using the Time Domain Passivity Approach (TDPA) was conducted. The results demonstrate that our MMT-DET exhibits robustness to varying delays in task performance and outperforms TDPA at higher delay levels.


ThBT16	207
Task and Motion Planning 2	Regular Session
Chair: Wang, Hongpeng	Nankai University
Co-Chair: Lu, Xiaojun	Jiangsu University of Science and Technology

13:20-13:25, Paper ThBT16.1
Aerobatic Maneuver Planning for Tilt-Rotor UAVs Based on Multi-Modal Consistent Dynamic Model

Wang, Hongpeng	Nankai University
Zhang, Qinghao	Nankai University
Zong, Jianping	Nankai University
Duan, Zhiwen	Nankai University
Deng, Kun	Nankai University
Sun, Chuanyu	Nankai University
Han, Jianda	Nankai University
Keywords: Task and Motion Planning, Dynamics Abstract: The unique tilt-servo mechanism of the tilt-rotor unmanned aerial vehicle (UAV) facilitates seamless transitions between multi-rotor and fixed-wing modes, enhancing both flexibility and maneuverability. However, traditional modeling methods, which treat each flight mode independently, fail to provide a unified dynamic representation, limiting the accurate description of aerobatic maneuvers during mode transitions. This paper introduces a novel modeling approach based on transient Computational Fluid Dynamics (CFD) to capture the aerodynamics of the transition mode, resulting in a multi-modal, consistent dynamics model. This model simplifies the mathematical representation for specific tilt angles, ensuring compatibility with both multi-rotor and fixed-wing dynamics, and accurately describes aerobatic maneuvers. An autonomous feedback motion planning method, utilizing third-order Bézier curves for angular velocity planning, is applied, along with a modal switching strategy to address the limitations of traditional fixed-wing UAVs. The feasibility of this method was validated through numerical simulations, hardware-in-the-loop simulations, and outdoor flight experiments of a tilt-rotor UAV performing the Cobra maneuver in transition mode.

13:25-13:30, Paper ThBT16.2
Reactive Temporal Logic Planning for Safe Human-Robot Interaction

Liu, Xiangcheng	University of Science and Technology of China
Chen, Ziyang	University of Science and Technology of China
Tian, Yinxiao	University of Science and Technology of China
Kan, Zhen	University of Science and Technology of China
Keywords: Task and Motion Planning, Formal Methods in Robotics and Automation Abstract: Human-robot interaction plays a critical role in scientific experiments by ensuring efficient and reliable execution of experimental tasks. To achieve successful task completion, robots must adapt in real time to unexpected task variations, external disturbances, and safety constraints. In this work, we propose a reactive task and motion plan- ning framework designed to address these challenges. By formulating interaction tasks using Linear Temporal Logic (LTL), our approach introduces Planning Decision Tree and Augmented Planning Decision Tree approach to dynamically adjust task sequences in response to environmental changes. At the execution layer, we employ a Model Predictive Path Integral controller, which ensures both efficient and safe con- trol. Additionally, the planning interface effectively coordinates the planning and execution layers, ensuring strict adherence to experimental task specifications. The effectiveness of the proposed reactive planning framework is demonstrated through physical experiments using a 7-DoFs robot. Project website: https://sites.google.com/view/rtlp-iros/

13:35-13:40, Paper ThBT16.4
Modeling Human-Like Driving Behavior Based on Maximum Entropy Deep Inverse Reinforcement Learning

Shi, Jiamin	Xi'an Jiaotong University
Zhang, Tangyike	Xi'an Jiaotong University
Chen, Shitao	Xi'an Jiaotong University
Zheng, Nanning	Xi'an Jiaotong University
Xin, Jingmin	Xi'an Jiaotong University
Keywords: Task and Motion Planning, Imitation Learning, Reinforcement Learning Abstract: Modeling expert driving behavior is crucial for the successful implementation of human-like autonomous driving. In this paper, we propose a new sampling-based Maximum Entropy Deep Inverse Reinforcement Learning (MEDIRL) framework. It leverages naturalistic human driving data to train the reward model and thus evaluates driving behaviors from the reward of sampled candidate trajectories. The proposed framework utilizes deep neural networks to learn the feature-reward mapping, which offers superior fitting capabilities compared to traditional linear reward functions. A polynomial trajectory sampler for long-term decision making and a dynamic window trajectory sampler for short-term planning are adopted to simplify the calculation of partition function in the MEDIRL algorithm. In addition, the proposed framework offers a solution to the probability estimation of driving behaviors by calculating the likelihood of sampled candidate trajectories based on their reward values. Comparative experiments are conducted on the NGSIM US-101 Highway dataset, and the experimental results demonstrate the superiority of the proposed model in personalizing reward functions, as well as the applicability of the proposed method in modeling driving behaviors across various time horizons.

13:40-13:45, Paper ThBT16.5
Heteroscedastic Bayesian Optimization-Based Dynamic PID Tuning for Accurate and Robust UAV Trajectory Tracking

Gu, Fuqiang	Chongqing University
Ai, Jiangshan	Chongqing University
Lu, Xu	Chongqing University
Long, Xianlei	Chongqing University
Li, Yan	Macquarie University
Jiang, Tao	Chongqing University
Chen, Chao	Chongqing University
Huidong, Liu	The College of Computer Science, Chongqing University,
Keywords: Underactuated Robots, Motion Control, Motion and Path Planning Abstract: Unmanned Aerial Vehicles (UAVs) play an important role in various applications, where precise trajectory tracking is crucial. However, conventional control algorithms for trajectory tracking often exhibit limited performance due to the underactuated, nonlinear, and highly coupled dynamics of quadrotor systems. To address these challenges, we propose HBO-PID, a novel control algorithm that integrates the Heteroscedastic Bayesian Optimization (HBO) framework with the classical PID controller to achieve accurate and robust trajectory tracking. By explicitly modeling input-dependent noise variance, the proposed method can better adapt to dynamic and complex environments, and therefore improve the accuracy and robustness of trajectory tracking. To accelerate the convergence of optimization, we adopt a two-stage optimization strategy that allow us to more efficiently find the optimal controller parameters. Through experiments in both simulation and real-world scenarios, we demonstrate that the proposed method significantly outperforms state-of-the-art (SOTA) methods. Compared to SOTA methods, it improves the position accuracy by 24.7% to 42.9%, and the angular accuracy by 40.9% to 78.4%.

13:45-13:50, Paper ThBT16.6
Regrasp Maps for Sequential Manipulation Planning

Levit, Svetlana	TU Berlin
Toussaint, Marc	TU Berlin
Keywords: Task and Motion Planning, Manipulation Planning Abstract: Abstract— We consider manipulation problems in constrained and cluttered settings, which require several regrasps at unknown locations. We propose to inform an optimization-based task and motion planning (TAMP) solver with possible regrasp areas and grasp sequences to speed up the search. Our main idea is to use a state space abstraction, a regrasp map, capturing the combinations of available grasps in different parts of the configuration space, and allowing us to provide the solver with guesses for the mode switches and additional constraints for the object placements. By interleaving the creation of regrasp maps, their adaptation based on failed refinements, and solving TAMP (sub)problems, we are able to provide a robust search method for challenging regrasp manipulation problems.

13:50-13:55, Paper ThBT16.7
Hitchhiker: A Quadrotor Aggressively Perching on a Moving Inclined Surface Using Compliant Suction Cup Gripper (I)

Liu, Sensen	ShanghaiJiaotong University
Wang, Zhaoying	Shanghai Jiao Tong University
Sheng, Xinjun	Shanghai Jiao Tong University
Dong, Wei	Shanghai Jiao Tong University
Keywords: Aerial Systems: Mechanics and Control, Task and Motion Planning, Contact Modeling Abstract: Perching on the surface of moving objects, like vehicles, could extend the flight time and range of quadrotors. Suction cups are usually adopted for surface attachment due to their durability and large adhesive force. To seal on a surfaces,suction cups must be aligned with the surface and possess proper relative tangential velocity. However, quadrotors’ attitude and relative velocity errors would become significant when the object surface is moving and inclined. To address this problem, we proposed a real-time trajectory planning algorithm. The time optimal aggressive trajectory is efficiently generated through multimodal search in a dynamic time-domain. The velocity errors relative to the moving surface are alleviated. To further adapt to the residual errors, we design a compliant gripper using self-sealing cups. Multiple cups in different directions are integrated into a wheel-like mechanism to increase the tolerance to attitude errors. The wheel mechanism also eliminates the requirement of matching the attitude and tangential velocity. Extensive tests are conducted to perch on static and moving surfaces at various inclinations. Results demonstrate that our proposed system enables a quadrotor to reliably perch on moving inclined surfaces (up to 1.07m/s and 90◦ ) with a success rate of 70% or higher. The efficacy of the trajectory planner is also validated. Our gripper has larger adaptability to attitude errors and tangential velocities than conventional suction cup grippers. The success rate increases by 45% in dynamic perches.

13:55-14:00, Paper ThBT16.8
Successor Features for Transfer in Alternating Markov Games

Amatya, Sunny	ARIZONA State University
Ren, Yi	Arizona State University
Xu, Zhe	Arizona State University
Zhang, Wenlong	Arizona State University
Keywords: Transfer Learning, Reinforcement Learning, AI-Based Methods Abstract: This paper explores successor features for knowledge transfer in zero-sum, complete-information, and turnbased games. Prior research in single-agent systems has shown that successor features can provide a “jump start” for agents when facing new tasks with varying reward structures. However, knowledge transfer in games typically relies on value and equilibrium transfers, which heavily depends on the similarity between tasks. This reliance can lead to failures when the tasks differ significantly. To address this issue, this paper presents an application of successor features to games and presents a novel algorithm called Game Generalized Policy Improvement (GGPI), designed to address Markov games in multi-agent reinforcement learning. The proposed algorithm enables the transfer of learning values and policies across games. An upper bound of the errors for transfer is derived as a function the similarity of the task. Through experiments with a turnbased pursuer-evader game, we demonstrate that the GGPI algorithm can generate high-reward interactions and one-shot policy transfer. When further tested in a wider set of initial conditions, the GGPI algorithm achieves higher success rates with improved path efficiency compared to those of the baseline algorithms.


ThBT17	210A
Field Robots 2	Regular Session
Chair: Gu, Yu	West Virginia University
Co-Chair: Huang, Hailong	The Hong Kong Polytechnic University

13:20-13:25, Paper ThBT17.1
Scalable Wing Sailing and Snowboarding Enhance Efficient and Energy-Saving Mobility of Polar Robot (I)

Luo, Yongsheng	Harbin Institute of Technology
Liu, Gangfeng	Harbin Institute of Technology
Guo, Lefan	Harbin Institute of Technology
Zhu, Yanhe	Harbin Institute of Technology
Zhao, Jie	Harbin Institute of Technology
Keywords: Field Robots, Kinematics, Mechanism Design Abstract: The polar regions have typical representative characteristics of the Earth's environmental background, and conducting scientific exploration here is of great significance. Although a large number of polar robots have been designed and applied, efficient and pollution-free mobile operations still face challenges. A polar robot that achieves efficient, stable, and pollution-free movement in snowy environments through sails and snowboards is proposed in this article. It can use snowboard to achieve carve slip steering and plow braking to enhance movement maneuverability. Meanwhile, based on the proposed optimized assistance strategy, it achieves energy-saving control of the sail assistance mechanism under the constraint of robot movement stability. Finally, the robot prototype was developed, and the motion ability test and energy saving mobile efficiency test were carried out. The results showed that the steering radius could be reduced by 6% compared to the wheel structure steering radius and the efficiency of sail assistance could be increased by 28% compared to no sail assistance.

13:25-13:30, Paper ThBT17.2
T-CBF: Traversability-Based Control Barrier Function to Navigate Vertically Challenging Terrain

Gupta, Manas	George Mason University
Xiao, Xuesu	George Mason University
Keywords: Field Robots, Robot Safety, Wheeled Robots Abstract: Safety has been of paramount importance in motion planning and control techniques and is an active area of research in the past few years. Most safety research for mobile robots target at maintaining safety with the notion of collision avoidance. However, safety goes beyond just avoiding collisions, especially when robots have to navigate unstructured, vertically challenging, off-road terrain, where vehicle rollover and immobilization is as critical as collisions. In this work, we introduce a novel Traversability-based Control Barrier Function (T-CBF), in which we use neural Control Barrier Functions (CBFs) to achieve safety beyond collision avoidance on unstructured vertically challenging terrain by reasoning about new safety aspects in terms of traversability. The neural T-CBF trained on safe and unsafe observations specific to traversability safety is then used to generate safe trajectories. Furthermore, we present experimental results in simulation and on a physical Verti-4 Wheeler (V4W) platform, demonstrating that T-CBF can provide traversability safety while reaching the goal position. T-CBF planner outperforms previously developed planners by 30% in terms of keeping the robot safe and mobile when navigating on real world vertically challenging terrain.

13:30-13:35, Paper ThBT17.3
TVFET-VD: Time-Varying Formation Encircling and Tracking Control Based on Visual Detection

Yang, Guang	Tianjin University
Qi, Juntong	Shanghai University
Wang, Mingming	Tianjin University
Huang, Hailong	The Hong Kong Polytechnic University
Peng, Yan	Shanghai University
Wu, Chong	EFY Intelligent Control (Tianjin) Technology Co., Ltd
Ping, Yuan	Tianjin University
Keywords: Field Robots, Multi-Robot Systems Abstract: This paper proposes a whole process method of multi-quadrotors from detecting and locating to encircle and track targets. The reconnaissance quadrotor realizes accurate target detection based on the one-stage target detector of convolutional neural network. Then, based on a pinhole camera projection model, the target is located from the 2D pixel coordinates to 3D North East Down(NED) coordinates world. Finally, the hunter quadrotors realize the target encircling and the time-varying formation tracking based on the consensus theory. At the same time, we prove the stability of the time-varying formation tracking control. We built a multiple quadrotors platform composed of one reconnaissance quadrotor and four hunter quadrotors, and deployed the method on the platform to conduct a series of experiments with a minibus as the target for validation. The results indicate that reconnaissance quadrotor can accurately detect target and have small localization errors in the north and east directions. Hunter quadrotors can encircle and track targets in time-varying formation based on target information provided by reconnaissance quadrotor. Experiments have demonstrated that the method achieves high-speed and accurate target encirclement.

13:35-13:40, Paper ThBT17.4
A Novel Effective Loop Gait and Stabilizing Morphology Parameterization in Snake Robots

Tang, Chaoquan	China University of Mining and Technology
Lu, Jingwen	China University of Mining and Technology
Sun, Xiaowen	China University of Mining and Technology
Gao, Erfei	China University of Mining and Technology
Zhou, Gongbo	China University of Mining and Technology
Wang, Gang	University of Shanghai for Science and Technology
Ma, Shugen	Hong Kong University of Science and Technology (Guangzhou)
Hu, Eryi	Information Institute, Ministry of Emergency Management of the P
Li, Peng	Harbin Institute of Technology ShenZhen
Keywords: Field Robots, Robotics in Hazardous Fields, Surveillance Robotic Systems Abstract: Improving motion speed and efficiency remains a critical challenge in snake robots gait control. This paper introduces the Loop gait, a novel locomotion gait designed to enhance both speed and energy efficiency of snake robots without passive wheels. Compared to Crawler gait and S-pedal gait, which are more widely used, the Loop gait has a better motion speed (1.8 times of the Crawler gait in the same parameter) and a better motion efficiency (1.6 times of the Crawler gait in the same parameter) due to its more loop body morphology. A static stability model is developed to guide parameter optimization, addressing potential instability caused by elevated center of mass of snake robots. Experiments confirm the Loop gait’s exceptional energy efficiency and propulsion, validating the static stability model’s utility in selecting parameters.

13:40-13:45, Paper ThBT17.5
Design and Development of a GPR-Equipped Robot for Full-Space External Diseases Detection in Drainage Pipelines

Fang, Yuanjin	China University of Mining and Technology-Beijing
Yang, Feng	China University of Mining and Technology-Beijjing
Qiao, Xu	China University of Mining and Technology-Beijjing
Xu, Maoxuan	China University of Mining and Technology-Beijing
Keywords: Field Robots, Robotics in Hazardous Fields, Software-Hardware Integration for Robot Systems Abstract: Soil diseases around drainage pipelines are a major factor in road collapse. Robots designed to detect these diseases face multiple challenges, including harsh internal environments, size limitations, difficulties in achieving full external space coverage, and the impact of pose misalignment on disease localization. To address these challenges, this work presents the design and development of a pipeline robot equipped with Ground-Penetrating Radar (GPR), capable of adapting to a pipe diameter range of 500-1000 millimeters and providing comprehensive detection of external space diseases. A radial offset estimation model is introduced, and by integrating multi-sensor data, the robot achieves full-pose perception, overcoming challenges related to angular and positional misalignment during disease localization. Experimental results demonstrate that the robot can achieve a maximum detection speed of up to 0.5 meters per second and is capable of adapting to various field drainage pipeline scenarios, including full water, rough terrain, pose misalignment, and 90-degree bends. Azimuth errors for external disease localization are controlled within 1 degree, and axial displacement errors are controlled within 2 centimeters.

13:45-13:50, Paper ThBT17.6
Project Yukionna: Fabrication of Ice-Based Robotic Components Via Formative Methods

Wu, Kunlun	ITMO University
Vlasov, Sergey	ITMO University
Keywords: Field Robots, Manufacturing, Maintenance and Supply Chains, Wheeled Robots Abstract: Ice, a naturally abundant resource in polar regions and extraterrestrial environments, has gained significant attention for its potential applications in robotics. However, there is a noticeable gap in existing research concerning the manufacturing processes of ice-based components, particularly those utilizing formative technologies for field deployment. Furthermore, outdoor field validations of robots incorporating ice components remain scarce. To bridge these gaps, this paper introduces "Project Yukionna," which includes developing and validating the Ice Formative Method (IFM). Additionally, the feasibility of three distinct manufacturing approaches: formative manufacturing (FM), subtractive manufacturing (SM), and additive manufacturing (AM), is evaluated using a two-tier Analytic Hierarchy Process (AHP). The paper further pioneers the execution of most production processes and experimental validations in fully outdoor field conditions. Numerous tests were conducted to assess the effectiveness and limitations of the IFM and a 4WD (Four-Wheel Drive) rover equipped with ice-based components. The findings demonstrate the feasibility of ice-based robotic components while highlighting the manufacturing challenges and the inherent constraints of ice as a material.

13:50-13:55, Paper ThBT17.7
SPADE: Towards Scalable Path Planning Architecture on Actionable Multi-Domain 3D Scene Graphs

Kottayam Viswanathan, Vignesh	Lulea University of Technology
Patel, Akash	Luleå University of Technology
Valdes Saucedo, Mario Alberto	Lulea University of Technology
Satpute, Sumeet	Luleå University of Technology
Kanellakis, Christoforos	LTU
Nikolakopoulos, George	Luleå University of Technology
Keywords: Autonomous Agents, Field Robots Abstract: In this work, we introduce SPADE, a path planning framework designed for autonomous navigation in dynamic environments using 3D scene graphs. SPADE combines hierarchical path planning with local geometric awareness to enable collision-free movement in dynamic scenes. The framework bifurcates the planning problem into two: (a) solving the sparse abstract global layer plan and (b) iterative path refinement across denser lower local layers in step with local geometric scene navigation. To ensure efficient extraction of a feasible route in a dense multi-task domain scene graphs, the framework enforces informed sampling of traversable edges prior to path-planning. This removes extraneous information not relevant to path-planning and reduces the overall planning complexity over a graph. Existing approaches address the problem of path planning over scene graphs by decoupling hierarchical and geometric path evaluation processes. Specifically, this results in an inefficient replanning over the entire scene graph when encountering path obstructions blocking the original route. In contrast, SPADE prioritizes local layer planning coupled with local geometric scene navigation, enabling navigation through dynamic scenes while maintaining efficiency in computing a traversable route. We validate SPADE through extensive simulation experiments and real-world deployment on a quadrupedal robot, demonstrating its efficacy in handling complex and dynamic scenarios.

13:55-14:00, Paper ThBT17.8
UAV See, UGV Do: Aerial Imagery and Virtual Teach Enabling Zero-Shot Ground Vehicle Repeat

Fisker, Desiree	Universiy of Toronto
Krawciw, Alec	University of Toronto
Lilge, Sven	University of Toronto
Greeff, Melissa	Queen's University
Barfoot, Timothy	University of Toronto
Keywords: Localization, Autonomous Vehicle Navigation, Field Robots Abstract: This paper presents Virtual Teach and Repeat (VirT&R): an extension of the Teach and Repeat (T&R) framework that enables GPS-denied, zero-shot autonomous ground vehicle navigation in untraversed environments. VirT&R leverages aerial imagery captured for a target environment to train a Neural Radiance Field (NeRF) model so that dense point clouds and photo-textured meshes can be extracted. The NeRF mesh is used to create a high-fidelity simulation of the environment for piloting an unmanned ground vehicle (UGV) to virtually define a desired path. The mission can then be executed in the actual target environment by using NeRF-generated point cloud submaps associated along the path and an existing LiDAR Teach and Repeat (LT&R) framework. We benchmark the repeatability of VirT&R on over 12 km of autonomous driving data using physical markings that allow a sim-to-real lateral path-tracking error to be obtained and compared with LT&R. VirT&R achieved measured root mean squared errors (RMSE) of 19.5 cm and 18.4 cm in two different environments, which are slightly less than one tire width (24 cm) on the robot used for testing, and respective maximum errors were 39.4 cm and 47.6 cm. This was done using only the NeRF-derived teach map, demonstrating that VirT&R has similar closed-loop path-tracking performance to LT&R but does not require a human to manually teach the path to the UGV in the actual environment.


ThBT18	210B
Mapping 2	Regular Session
Chair: Chen, Weidong	Shanghai Jiao Tong University
Co-Chair: Nikolakopoulos, George	Luleå University of Technology

13:20-13:25, Paper ThBT18.1
Real-Time Occupancy Grid Mapping Using RMM on Large-Scale and Unstructured Environments

Li, Xingyu	Northeastern University - China
Xu, Haoxuan	Northeastern University
Liu, Xingrui	Northeastern University
Tan, Zhaotong	Northeastern University
Keywords: Mapping, Aerial Systems: Perception and Autonomy, Representation Learning Abstract: Occupancy mapping is crucial for distinguishing between known and unknown regions, which plays a significant role in the autonomous exploration of unmanned aerial vehicles (UAVs). However, the construction of high-quality maps is still a challenge. The challenge comes from the following factors. The vast amount of data captured by UAV exploration in large-scale environments brings computing and storage bottlenecks. Additionally, sensor noise and obstacle occlusions will affect the completeness of the map. To address these issues, this paper applies a lightweight mapping framework based on the Random Mapping Method (RMM) to the challenging task of real-time UAV exploration. This framework employs a linear parametric model, where RMM efficiently maps sensor data into a high-dimensional feature space, enabling rapid learning of occupancy states. We demonstrate that this approach is not only efficient in terms of computation and storage but is also particularly effective at inferring and completing unobserved map regions caused by sensor noise and obstacle occlusions. When the exploration is completed, the global occupancy grid map is stored and implicitly represented with limited parameters. Simulation and real-world experiments are conducted to verify their comprehensive performance compared to the typical methods and state-of-the-art methods.

13:25-13:30, Paper ThBT18.2
Real-Time Spatial-Temporal Traversability Assessment Via Feature-Based Sparse Gaussian Process

Tan, Senming	Huzhou Institute of Zhejiang University
Hou, Zhenyu	Huzhou Institute of Zhejiang University
Zhang, Zhihao	Huzhou Institute of Zhejiang University
Xu, Long	Zhejiang University
Zhang, Mengke	Zhejiang University
He, Zhaoqi	Huzhou Institution (HI) of Zhejiang University (ZJU)
Xu, Chao	Zhejiang University
Gao, Fei	Zhejiang University
Cao, Yanjun	Zhejiang University, Huzhou Institute of Zhejiang University
Keywords: Mapping, Collision Avoidance, Motion and Path Planning Abstract: Terrain analysis is critical for the practical application of ground mobile robots in real-world tasks, especially in outdoor unstructured environments. In this paper, we propose a novel spatial-temporal traversability assessment method, which aims to enable autonomous robots to effectively navigate through complex terrains. Our approach utilizes sparse Gaussian processes (SGP) to extract geometric features (curvature, gradient, elevation, etc.) directly from point cloud scans. These features are then used to construct a high-resolution local traversability map. Then, we design a spatial-temporal Bayesian Gaussian kernel (BGK) inference method to dynamically evaluate traversability scores, integrating historical and real-time data while considering factors such as slope, flatness, gradient, and uncertainty metrics. GPU acceleration is applied in the feature extraction step, and the system achieves real-time performance. Extensive simulation experiments across diverse terrain scenarios demonstrate that our method outperforms SOTA approaches in both accuracy and computational efficiency. Additionally, we develop an autonomous navigation framework integrated with the traversability map and validate it with a differential driven vehicle in complex outdoor environments. Our code will open for further research and development by the community.

13:30-13:35, Paper ThBT18.3
CG-3DGS: Complexity-Guided 3D Gaussian Splatting for High-Fidelity Surgical Scene Reconstruction

Yao, Yao	Hefei University of Technology
Ouyang, Bo	Hefei University of Technology
Zhao, Cancan	Hefei University of Technology
Keywords: Mapping, Computer Vision for Medical Robotics, Deep Learning Methods Abstract: Accurate 3D reconstruction in surgical scenarios is essential for visualizing dynamic tissues with complex anatomical geometries. While 3D Gaussian Splatting (3D-GS) has been explored as an efficient approach to scene modeling, occlusion-induced voids and suboptimal detail optimization have limited its application in surgery. This work introduces a Complexity-Guided 3D Gaussian Splatting (CG-3DGS) framework, in which occlusion regions are globally filled by a state-of-the-art optical flow-based video inpainting method. A frequency–spatial aware refinement (FSAR) mechanism is proposed, allowing spectral signatures and spatial gradients to be jointly analyzed to enhance critical anatomical features (e.g., blood vessels). This mechanism adaptively guides Gaussian densification based on scene-specific anatomical complexity. Experimental results demonstrate that the proposed framework achieves higher reconstruction fidelity while maintaining efficient rendering speeds.

13:35-13:40, Paper ThBT18.4
Enhancing Lane Segment Perception and Topology Reasoning with Crowdsourcing Trajectory Priors

Jia, Peijin	Tsinghua University
Luo, Ziang	TsingHua University
Wen, Tuopu	Tsinghua University
Yang, Mengmeng	Tsinghua University
Jiang, Kun	Tsinghua University
Cui, Le	DIdi Inc
Yang, Diange	Tsinghua University
Keywords: Mapping, Computer Vision for Transportation, Deep Learning for Visual Perception Abstract: In autonomous driving, recent advances in online mapping provide autonomous vehicles with a comprehensive understanding of driving scenarios. Moreover, incorporating prior information input into such perception model represents an effective approach to ensure the robustness and accuracy. However, utilizing diverse sources of prior information still faces three key challenges: the acquisition of high-quality prior information, alignment between prior and online perception, efficient integration. To address these issues, we investigate prior augmentation from a novel perspective of trajectory priors. In this paper, we initially extract crowdsourcing trajectory data from Argoverse2 motion forecasting dataset and encode trajectory data into rasterized heatmap and vectorized instance tokens, then we incorporate such prior information into the online mapping model through different ways. Besides, with the purpose of mitigating the misalignment between prior and online perception, we design a confidence-based fusion module that takes alignment into account during the fusion process. We conduct extensive experiments on OpenLane-V2 dataset. The results indicate that our method's performance significantly outperforms the current state-of-the-art methods.

13:40-13:45, Paper ThBT18.5
360Recon: An Accurate Reconstruction Method Based on Depth Fusion from 360 Images

Yan, Zhongmiao	Shanghai Jiao Tong University
Wu, Qi	Shanghai Jiao Tong University
Xia, Songpengcheng	Shanghai Jiao Tong University
Deng, Junyuan	Shanghai Jiao Tong University
Mu, Xiang	Horizon Robotics
Jin, Renbiao	Shanghai Jiao Tong University
Ye, Changchun	PICO
Pei, Ling	Shanghai Jiao Tong University
Keywords: Mapping, Deep Learning for Visual Perception, Omnidirectional Vision Abstract: Accurate 3D reconstruction is crucial for AR and VR applications. Compared with traditional pinhole camera-based methods, 360° image-based reconstruction can achieve higher precision with fewer input images, making it especially effective in low-texture environments. However, the severe distortion resulting from the wide field of view complicates feature extraction and matching, leading to geometric inconsistencies in multi-view reconstruction. To address these challenges, we propose 360Recon, a novel multi-view stereo (MVS) algorithm specifically designed for equirectangular projection (ERP) images. With the proposed spherical feature extraction module mitigating distortion, 360Recon integrates a 3D cost volume with multi-scale ERP features to deliver high-precision scene reconstruction while preserving local geometric consistency. Experimental results demonstrate that 360Recon outperforms existing methods in terms of accuracy, computational efficiency, and generalization capability. The source code will be released at https://github.com/LeonATP/360Recon.

13:45-13:50, Paper ThBT18.6
SN-LiDAR: Semantic Neural Fields for Novel Space-Time View LiDAR Synthesis

Chen, Yi	Shanghai Jiao Tong University
Deng, Tianchen	Shanghai Jiao Tong University
Zhao, Wentao	Shanghai Jiao Tong University
Xiaoning, Wang	Ruijing Hospital, Shanghai Jiao Tong University School of Medici
Wenqian, Xi	Renji Hospital, ShangHai Jiao Tong University School of Medicine
Chen, Weidong	Shanghai Jiao Tong University
Wang, Jingchuan	Shanghai Jiao Tong University
Keywords: Mapping, Deep Learning for Visual Perception, Semantic Scene Understanding Abstract: Recent research has begun exploring novel view synthesis (NVS) for LiDAR point clouds, aiming to generate realistic LiDAR scans from unseen viewpoints. However, most existing approaches do not reconstruct semantic labels, which are crucial for many downstream applications such as autonomous driving and robotic perception. Unlike images, which benefit from powerful segmentation models, LiDAR point clouds lack such large-scale pre-trained models, making semantic annotation time-consuming and labor-intensive. To address this challenge, we propose SN-LiDAR, a method that jointly performs accurate semantic segmentation, high-quality geometric reconstruction, and realistic LiDAR synthesis. Specifically, we employ a coarse-to-fine planar-grid feature representation to extract global features from multi-frame point clouds and leverage a CNN-based encoder to extract local semantic features from the current frame point cloud. Extensive experiments on SemanticKITTI and KITTI-360 demonstrate the superiority of SN-LiDAR in both semantic and geometric reconstruction, effectively handling dynamic objects and large-scale scenes. Codes will be available on https://github.com/dtc111111/SN-Lidar.

13:50-13:55, Paper ThBT18.7
Voxel Map to Occupancy Map Conversion Using Free Space Projection for Efficient Map Representation for Aerial and Ground Robots

Fredriksson, Scott	Luleå University of Technology
Saradagi, Akshit	Luleå University of Technology, Luleå, Sweden
Nikolakopoulos, George	Luleå University of Technology
Keywords: Mapping, Field Robots, Motion and Path Planning Abstract: This article introduces a novel method for converting 3D voxel maps, commonly utilized by robots for localization and navigation, into 2D occupancy maps for both unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs). The generated 2D maps can be used for more efficient global navigation for both UAVs and UGVs, in enabling algorithms developed for 2D maps to be useful in 3D applications, and allowing for faster transfer of maps between multiple agents in bandwidth-limited scenarios. The proposed method uses the free space representation in the voxel mapping solution to generate 2D occupancy maps. During the 3D to 2D map conversion, the method conducts safety checks and eliminates free spaces in the map with dimensions (in the height axis) lower than the robot's safety margins. This ensures that an aerial or ground robot can navigate safely, relying primarily on the 2D map generated by the method. Additionally, the method extracts the height of navigable free space and a local estimate of the slope of the floor from the 3D voxel map. The height data is utilized in converting paths generated using the 2D map into paths in 3D space for both UAVs and UGVs. The slope data identifies areas too steep for a ground robot to traverse, marking them as occupied, thus enabling a more accurate representation of the terrain for ground robots. The proposed method is compared to the existing state-of-the-art fixed projection method in two different environments, over static maps and with progressively expanding maps. The methods proposed in this article have been implemented in the widely-used robotics frameworks ROS and ROS2, and are open-sourced. The code is available at: https://github.com/LTU-RAI/Map-Conversion-3D-Voxel-Map-to-2D-Occupancy- Map.

13:55-14:00, Paper ThBT18.8
A V-Shaped In-Pipe Robot Capable of Drawing Route Maps for Both 3 in and 4 in Diameters Using Only Low-Cost Internal Sensors

Sugizaki, Yuma	Ritsumeikan University
Kakogawa, Atsushi	Ritsumeikan University
Keywords: Mapping, Field Robots, Search and Rescue Robots Abstract: This paper proposes an in-pipe robot capable of creating route map for narrow pipeline with inner diameters of both 3 in and 4 in. Pipe route drawing often relies on external sensor, such as camera and light detection and ranging (LiDAR). However, due to lighting, dirt, and spatial constraints within the pipeline, miniaturization has been challenging. Therefore, our method uses only internal sensors such as a tiny inertia measurement unit (IMU) for the robot posture acquisition and an encoder for distance measurement. Furthermore, sensor-less joint torque control system was also implemented by using a specially designed printed circuit board with motor current regulation, allowing the robot to traverse the pipeline while suppressing excessive torque generation and slippage. The experiments to verify the performance of our route drawing method were conducted on two types of pipeline with 3 in and 4 in inner diameters. From the experiments, it was revealed that the mean absolute error in the length of the straight sections was within 3% for all pipes, that in the rotational angle of the bend pipes was within 2 deg, and that in the direction of the straight pipes was within 2 deg.


ThBT19	210C
Aerial Systems: Applications 1	Regular Session
Chair: Saska, Martin	Czech Technical University in Prague
Co-Chair: Shang, Huiliang	Fudan University

13:20-13:25, Paper ThBT19.1
Proximal Control of UAVs with Federated Learning for Human-Robot Collaborative Domains

Nóbrega, Lucas	Universidade Federal Da Paraíba
Lopes Silva de Oliveira, Ewerton	Politecnico Di Milano
Saska, Martin	Czech Technical University in Prague
Nascimento, Tiago	Universidade Federal Da Paraiba
Keywords: Aerial Systems: Applications, Gesture, Posture and Facial Expressions, Human Detection and Tracking Abstract: The human-robot interaction (HRI) is a growing area of research. In HRI, complex command (action) classification is still an open problem that usually prevents the real applicability of such a technique. The literature presents some works that use neural networks to detect these actions. However, occlusion is still a major issue in HRI, especially when using uncrewed aerial vehicles (UAVs), since, during the robot's movement, the human operator is often out of the robot's field of view. Furthermore, in multi-robot scenarios, distributed training is also an open problem. In this sense, this work proposes an action recognition and control approach based on Long Short-Term Memory (LSTM) Deep Neural Networks with two layers in association with three densely connected layers and Federated Learning (FL) embedded in multiple drones. The FL enabled our approach to be trained in a distributed fashion, i.e., access to data without the need for cloud or other repositories, which facilitates the multi-robot system's learning. Furthermore, our multi-robot approach results also prevented occlusion situations, with experiments with real robots achieving an accuracy greater than 96%.

13:25-13:30, Paper ThBT19.2
CODE: Complete Coverage UAV Exploration Planner Using Dual-Type Viewpoints for Multi-Layer Complex Environments

Zhu, Huazhang	Fudan University
Zhao, Xuan	Yiwu Research Institute of Fudan University; Fudan University
Lan, Tian	Fudan University
Ma, Shunzheng	Fudan University
Shang, Huiliang	Fudan University
Li, Ruijiao	Fudan Univeristy
Keywords: Aerial Systems: Applications, Motion and Path Planning, Search and Rescue Robots Abstract: We present an autonomous exploration method for unmanned aerial vehicles (UAVs) for three-dimensional (3D) exploration tasks. Our approach, utilizing a cooperation strategy between common viewpoints and frontier viewpoints, fully leverages the agility and flexibility of UAVs, demonstrating faster and more comprehensive exploration than the current state-of-the-art. Common viewpoints, specifically designed for UAVs exploration, are evenly distributed throughout the 3D space for 3D exploration tasks. Frontier viewpoints are positioned at the centroids of clusters of frontier points to help the UAV maintain motivation to explore unknown complex 3D environments and navigate through narrow corners and passages. This strategy allows the UAV to access every corner of the 3D environment. Additionally, our method includes a refined relocation mechanism for UAVs specifically. Experimental comparisons show that our method ensures complete exploration coverage in environments with complex terrain. Our method outperforms TARE DSVP, GBP and MBP by the coverage rate of 64%, 63%, 54% and 49% respectively in garage. In narrow tunnels, ours and DSVP are the only two evaluated methods that achieve complete coverage, with ours outperforming DSVP by 35% in exploration efficiency.

13:30-13:35, Paper ThBT19.3
Aerial Landing of Micro UAVs on Moving Platforms Considering Aerodynamic Interference

Dong, Xin	Beihang University
Li, Huadong	Beihang University
Cui, Yangjie	Beihang University
Xiang, Jinwu	Beihang University
Li, Daochun	Beihang University
Tu, Zhan	Beihang University
Keywords: Aerial Systems: Applications, Multi-Robot Systems, Aerial Systems: Mechanics and Control Abstract: Despite numerous studies of landing on moving ground platforms, landing on midair aerial platforms remains a significant challenge due to aerodynamic interference between Unmanned Aerial Vehicles (UAVs). This letter presents a systematic onboard and real-time trajectory optimization solution for the UAV landing on its midair carrier platform. With numerical calculation, the aerodynamic interference between the micro UAV and its carrier UAV is estimated. Based on the estimation result, we construct a no-fly zone around the carrier UAV to minimize the aerodynamic disturbances. A guided landing cone above the carrier is modeled to guide the micro UAV landing on the carrier. Such a design can minimize the carrier UAV position drift induced by the micro UAV’s down-wash flow, ensuring flight safety and response speed constraints simultaneously. The agile and accurate landing trajectory of the micro UAV landing on either hovering or moving carrier UAV is planned by spatial-temporal joint trajectory optimization. The proposed systematic method is validated in both simulation and experimental tests. As a result, the proposed planner can generate optimal trajectories under both hovering and moving conditions of the carrier UAV in about 50ms with a precision of around 10cm.

13:35-13:40, Paper ThBT19.4
CurviTrack: Curvilinear Trajectory Tracking for High-Speed Chase of a USV

Gupta, Parakh M.	Czech Technical University in Prague
Procházka, Ondřej	Czech Technical University in Prague
Nascimento, Tiago	Universidade Federal Da Paraiba
Saska, Martin	Czech Technical University in Prague
Keywords: Aerial Systems: Applications, Marine Robotics, Optimization and Optimal Control Abstract: Heterogeneous robot teams used in marine environments incur time-and-energy penalties when the marine vehicle has to halt the mission to allow the autonomous aerial vehicle to land for recharging. In this paper, we present a solution for this problem using a novel drag-aware model formulation which is coupled with Model Predictive Control (MPC), and therefore, enables tracking and landing during high-speed curvilinear trajectories of an Unmanned Surface Vehicle (USV) without any communication. Compared to the state-of-the-art, our approach yields 40 % decrease in prediction errors, and provides a 3-fold increase in certainty of predictions. Consequently, this leads to a 30 % improvement in tracking performance and 40 % higher success in landing on a moving USV even during aggressive turns that are unfeasible for conventional marine missions. We test our approach in two different real-world scenarios with marine vessels of two different sizes and further solidify our results through statistical analysis in simulation to demonstrate the robustness of our method.

13:40-13:45, Paper ThBT19.5
SkateDuct: Utilizing Vector Thrust of Ducted Fan UAVs for Terrestrial-Aerial Locomotion

Yin, Zhong	South China University of Technology
Pei, Hai-Long	South China University of Technology
Keywords: Aerial Systems: Applications, Aerial Systems: Mechanics and Control Abstract: Ducted fan UAVs (DFUAVs), characterized by vector thrust, vertical takeoff and landing (VTOL) capabilities, and high safety, have found widespread applications in both military and civilian scenarios. However, their limited endurance remains a significant constraint on their broader applications. To address this challenge, in this letter we explore a novel approach that exploits the vector thrust capabilities of DFUAVs to enable terrestrial-aerial locomotion through simple modifications without the need for additional actuators. The design of a DFUAV employing passive wheels for continuous ground and aerial operation is presented. This configuration allows for unchanged attitude and static stability during ground movement, with only a 10.3% increase in weight. Fluid simulations were conducted to analyze the variation in control vane aerodynamic efficiency under ground effect, leading to the development of a ground-effect-adjusted aerodynamic model based on experimental data. Furthermore, the dynamics of ground movement are analyzed, and a corresponding controller is developed, establishing a complete framework for seamless transition between terrestrial and aerial modes. Extensive real-world flight experiments validate the proposed structural design and control methods. By utilizing terrestrial locomotion, the UAV's energy consumption is reduced to just 33.9% of that during flight, effectively extending its operational duration by more than ten times.

13:45-13:50, Paper ThBT19.6
WireFlie: A Novel Obstacle-Overcoming Mechanism for Autonomous Transmission Line Inspection Drones

Sehgal, Aditya	International Institute of Information Technology, Hyderabad
Mahammad, Zahiruddin	University of Maryland College Park
Sesetti, Poorna Sasank	International Institute of Information Technology Hyderabad
Badhwar, Ankur	International Institute of Information Technology - Hyderabad
Govindan, Nagamanikandan	IIITDM Kancheepuram
Keywords: Aerial Systems: Applications, Grippers and Other End-Effectors, Field Robots Abstract: Robotic inspection of transmission lines presents significant challenges due to the complexity of navigating along the wires. Existing systems often rely on either flight modes for visual inspection or articulated crawling mechanisms for contact-based inspection. However, these approaches face limitations in effectively bypassing in-line obstacles or pylons, which are common in transmission line environments. This paper presents WireFlie, a novel hybrid robotic system that integrates rolling and flight modes to overcome these challenges. The system consists of a pair of underactuated arms mounted on a drone platform, designed for secure, collision-free latching and detaching, enabling seamless transitions between locomotion modes. WireFlie supports both single-arm and dual-arm rolling, allowing it to bypass in-line obstacles such as Stockbridge dampers, dual spacers, and sleeves, and to overcome larger obstacles like pylons using flight. Additionally, we propose a high-level controller for autonomous latching, detaching, and obstacle avoidance. Experiments are conducted on a custom-made setup that closely resembles a transmission wire. We evaluate both the design and control aspects of our system, with results including kinematic analysis, wire detection, autonomous latching and the corresponding trajectory, and obstacle detection and avoidance strategies. This research contributes to the field of robotic infrastructure inspection by merging aerial and wire-based locomotion, providing efficient and autonomous monitoring of power lines.

13:50-13:55, Paper ThBT19.7
Versatile Perching Using a Passive Mechanism with under Actuated Fingers for Multirotor UAV

Kominami, Takamasa	Ritsumeikan University
Shimonomura, Kazuhiro	Ritsumeikan University
Keywords: Aerial Systems: Applications, Actuation and Joint Mechanisms, Grasping Abstract: This research proposes a passively driven perching mechanism designed to land not only on flat surface but also on various shaped objects, such as bar, plate, and spherical object, without the use of additional actuators. The perching mechanism consists of three under actuated fingers, and utilizes two passive drive mechanisms: one for landing without bending the fingers but by pinching, and another for grasping by bending the fingers to land. As a result, a lightweight apparatus capable of landing on various objects without the need for additional CPUs, motors, or batteries for driving and controlling was achieved. The gripping force generated by this mechanism was calculated, and it was found that the gripping force varied depending on the size of the object. Finally, the actual forces exerted by the fingers during landing on objects like safety cones, boards, or pipes were measured and compared with theoretical values.

13:55-14:00, Paper ThBT19.8
TALKER: A Task-Activated Language Model Based Knowledge-Extension Reasoning System

Lou, Jiabin	Beihang University
Shi, Rongye	Beihang University
Lin, Yuxin	Beihang University
Wang, Qunbo	Institute of Automation, Chinese Academy of Sciences
Wu, Wenjun	Beihang University
Keywords: Aerial Systems: Applications, Integrated Planning and Learning, Swarm Robotics Abstract: Training drones to execute complex collective tasks via multi-agent reinforcement learning presents significant challenges. To address these challenges, this paper introduces the Task-Activated Language model-based Knowledge-Extension Reasoning system. Specifically, we trained drones in two fine-grained skills and developed an action primitive library based on these capabilities, enabling a hierarchical approach to managing complex swarm operation. Leveraging this primitive library, we employ large language models to perform task planning, continuously refining the planning outcomes based on external user feedback. Successful task codes are temporarily stored within the action primitive library, with their utilization being authorized based on internal feedback from maintainers. We defined this process as knowledge expansion. In addition, more refined customized prompts are generated based on task descriptions and the action primitive documentation, a mechanism referred to as Task Activation. Our system synergistically integrates task activation and knowledge expansion mechanisms, enabling continuous evolution through human feedback to effectively manage extensive swarms in the execution of complex collective tasks. Experimental results demonstrate the superior performance of our system in various drone swarm tasks, including collaborative search, object tracking, cooperative interception, and aerial patrol.


ThBT20	210D
Perception for Grasping and Manipulation 2	Regular Session
Chair: Zhou, Peng	Great Bay University
Co-Chair: Gao, Yixing	Jilin University

13:20-13:25, Paper ThBT20.1
Variation-Robust Few-Shot 3D Affordance Segmentation for Robotic Manipulation

Hu, Dingchang	Tsinghua University
Sun, Tianyu	Tsinghua University
Xie, Pengwei	Tsinghua University
Chen, Siang	Tsinghua University
Yang, Huazhong	Tsinghua University
Wang, Guijin	Tsinghua University
Keywords: Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation, Recognition Abstract: Traditional affordance segmentation on 3D point cloud objects requires massive amounts of annotated training data and can only make predictions within predefined classes and affordance tasks. To overcome these limitations, we propose a variation-robust few-shot 3D affordance segmentation network (VRNet) for robotic manipulation, which requires only several affordance annotations for novel object classes and manipulation tasks. In particular, we design an orientation-tolerant feature extractor to address pose variation between support and query point cloud objects, and present a multiscale label propagation algorithm for variation in completeness. Extensive experiments on affordance datasets show that VRNet provides the best segmentation performance compared with previous works. Moreover, experiments in real robotic scenarios demonstrate the generalization ability of our method.

13:25-13:30, Paper ThBT20.2
Embodied Multi-Modal Sensing with a Soft Modular Arm Powered by Physical Reservoir Computing

Wang, Jun	Virginia Tech
Suyi, Li	Virginia Tech
Keywords: Perception for Grasping and Manipulation, Sensor Networks, Dynamics Abstract: Soft robots have become increasingly popular for complex manipulation tasks requiring gentle and safe contact. However, their softness makes accurate control challenging, and high-fidelity sensing is a prerequisite to adequate control performance. To this end, many flexible and embedded sensors have been created over the past decade, but they inevitably increase the robot's complexity and stiffness. This study demonstrates a novel approach that uses simple bending strain gauges embedded inside a modular arm to extract complex information regarding its deformation and working conditions. The core idea is based on physical reservoir computing (PRC): A soft body's rich nonlinear dynamic responses, captured by the inter-connected bending sensor network, could be utilized for complex multi-modal sensing with a simple linear regression algorithm. Our results show that the soft modular arm reservoir can accurately predict body posture (bending angle), estimate payload weight, determine payload orientation, and even differentiate two payloads with only minimal difference in weight --- all using minimal digital computing power.

13:30-13:35, Paper ThBT20.3
Bayesian Morphology Optimization for Musculoskeletal Systems

Zhao, Jing	North China Electric Power University
Yang, Yubo	Shanghai Institute of Technology
Wang, Yinsong	North China Electric Power University
Zhang, Shuyuan	Beijing University of Posts and Telecommunications
Luo, Yu	Tsinghua University
Liu, Huaping	Tsinghua University
Huang, Liangjun	Shanghai Institute of Technology
Keywords: Perception for Grasping and Manipulation Abstract: In this study, we focus on enhancing the policy of a musculoskeletal arm to develop grasping abilities for objects of varying weights. The agent is modeled using MyoSuite, a platform with realistic biomechanics where muscles drive skeletal movement. We observed that optimizing only the control policy is insufficient for handling heavy object grasping, highlighting the limitations of traditional control-focused approaches. To address this issue, we shift our focus to muscle development by optimizing the arm's muscle parameters. However, this remains challenging for two main reasons. First, the high dimensionality of the muscle parameter space makes it difficult to find optimal designs. Second, evaluating new muscle configurations requires training a control policy, leading to high computational costs. To tackle these challenges, we adopt two strategies. First, we simplify the problem by optimizing only the stiffness parameters, as they have the greatest impact on grasping performance. Second, we apply the Bayesian Morphology Optimization Method (BMO) to efficiently search the parameter space. Compared to genetic algorithms(GA), BMO finds better solutions with fewer evaluations. Experimental results show that BMO achieves similar rewards with 20% fewer iterations than GA and improves the success rate by 10%. In summary, muscle optimization provides an effective solution for grasping tasks, and BMO demonstrates efficient, robust, and generalizable performance in optimizing muscle parameters for such tasks.

13:35-13:40, Paper ThBT20.4
Free-Form Language-Based Robotic Reasoning and Grasping

Jiao, Runyu	Fondazione Bruno Kessler, University of Trento
Fasoli, Alice	Fondazione Bruno Kessler
Giuliari, Francesco	Fondazione Bruno Kessler
Bortolon, Matteo	Istituto Italiano Di Tecnologia; Fondazione Bruno Kessler; Unive
Povoli, Sergio	Fondazione Bruno Kessler
Mei, Guofeng	Fondazione Bruno Kessler
Wang, Yiming	Fondazione Bruno Kessler
Poiesi, Fabio	Fondazione Bruno Kessler
Keywords: Perception for Grasping and Manipulation Abstract: Performing robotic grasping from a cluttered bin based on human instructions is a challenging task, as it requires understanding both the nuances of free-form language and the spatial relationships between objects. Vision-Language Models (VLMs) trained on web-scale data, such as GPT-4o, have demonstrated remarkable reasoning capabilities across both text and images. But can they truly be used for this task in a zero-shot setting? And what are their limitations? In this paper, we explore these research questions via the free-form language-based robotic grasping task, and propose a novel method, FreeGrasp, leveraging the pre-trained VLMs’ world knowledge to reason about human instructions and object spatial arrangements. Our method detects all objects as keypoints and uses these keypoints to annotate marks on images, aiming to facilitate GPT-4o’s zero-shot spatial reasoning. This allows our method to determine whether a requested object is directly graspable or if other objects must be grasped and removed first. Since no existing dataset is specifically designed for this task, we introduce a synthetic dataset FreeGraspData by extending the MetaGraspNetV2 dataset with human-annotated instructions and ground-truth grasping sequences. We conduct extensive analyses with both FreeGraspData and real-world validation with a gripper-equipped robotic arm, demonstrating state-of-the-art performance in grasp reasoning and execution. Project website: https://tev-fbk.github.io/FreeGrasp/.

13:40-13:45, Paper ThBT20.5
DarkSeg: Infrared-Driven Semantic Segmentation for Garment Grasping Detection in Low-Light Conditions

Zhong, Haifeng	Jilin University
Tang, Fan	Chinese Academy of Sciences
Chang, Hyung Jin	University of Birmingham
Zhu, Xingyu	Jilin University
Gao, Yixing	Jilin University
Keywords: Perception for Grasping and Manipulation, Object Detection, Segmentation and Categorization Abstract: Garment grasping in low-light environments is a critical challenge for domestic intelligent robots, yet existing research has not sufficiently addressed this issue. In low-light conditions, the scarcity of visual features due to insufficient illumination causes different categories of garments to exhibit ambiguous feature similarities, thereby hindering the robot's ability to detect the categories of different garments. Although traditional methods can compensate for visual deficiencies in low-light scenarios by applying preprocessing strategies that fuse infrared multimodal features, their complex computational processes incur significant computational overhead. To address this limitation, we propose a low-light garment detection model based on the student-teacher model. The innovation of DarkSeg lies in its replacement of complex multimodal feature fusion with an indirect feature alignment mechanism between the student and teacher models, thereby circumventing high computational demands. Through feature alignment, DarkSeg enables the student model to learn illumination-invariant structural representations from the infrared features provided by the teacher model, effectively correcting structural deficiencies in low-light environments. Furthermore, to evaluate DarkSeg's feasibility for low-light clothing grasping, we propose a depth-perceptive grasping strategy and build a low-light multimodal garment detection dataset, DarkClothes. Extensive experiments deploying DarkSeg on a Baxter robot demonstrate that DarkSeg achieves a 22% improvement in the grasping success rate while reducing the model parameters by 99.08 million compared to traditional methods, validating the practical viability of DarkSeg for robotic garment grasping in low-light conditions.

13:45-13:50, Paper ThBT20.6
Keypoint-Aware RAG for Robotic Manipulation: In-Context Constraint Learning Via Large-Scale Retrieval

Lin, Jiuzhou	Tsinghua University
Yang, Qi	Tsinghua University
Li, Yizhe	Tsinghua University
Dong, Kangkang	Jianghuai Advanced Technology Center
Liu, Houde	Shenzhen Graduate School, Tsinghua University
Keywords: Perception for Grasping and Manipulation, AI-Based Methods, Learning from Demonstration Abstract: Recent advances in robotic manipulation leverage foundation models pre-trained on internet-scale data, where keypoint-based representations have shown promising results in spatial reasoning. However, existing approaches primarily focus on zero-shot generalization or human-collected demonstrations, with limited exploration of large-scale robotic datasets. In this work, we propose Keypoint-Aware Retrieval Augmented Generation (KARAG), a simple yet novel framework that synergistically integrates visual-language models (VLMs) with robotic datasets through retrieval-augmented generation (RAG). Our framework bridges the retrieval and generation phases via in-context learning with keypoint-aware constraints, enabling simultaneous utilization of internet-scale knowledge and structured robotic datasets. Extensive experiments in both simulated and real-world environments demonstrate that KARAG significantly enhances the stability and accuracy of VLM-generated outputs without requiring human demonstrations or additional training, achieving 10%-20% success rate improvements in real-world scenarios and 12%-46% improvements in simulation over the baseline. Furthermore, we present an algorithm for converting large robotic datasets into Keyframe-Keypoint-Trajectory representations to facilitate retrieval. Our dataset and implementation are publicly available at https://github.com/RobertAckleyLin/KARAG/.

13:50-13:55, Paper ThBT20.7
Measuring Uncertainty in Shape Completion to Improve Grasp Quality

Ferreira Duarte, Nuno	IST-ID
Mohammadi, Seyedsaber	Istituto Italiano Di Tecnologia (IIT)
Moreno, Plinio	IST-ID
Del Bue, Alessio	Istituto Italiano Di Tecnologia
Santos-Victor, José	Instituto Superior Técnico - University of Lisbon
Keywords: Perception for Grasping and Manipulation, RGB-D Perception, Grasping Abstract: Shape completion networks have been used recently in real-world robotic experiments to complete the missing/hidden information in environments where objects are only observed in one or few instances where self-occlusions are bound to occur. Nowadays, most approaches rely on deep neural networks that handle rich 3D point cloud data that lead to more precise and realistic object geometries. However, these models still suffer from inaccuracies due to its nondeterministic/stochastic inferences which could lead to poor performance in grasping scenarios where these errors compound to unsuccessful grasps. We present an approach to calculate the uncertainty of a 3D shape completion model during inference of single view point clouds of an object on a table top. In addition, we propose an update to grasp pose algorithms quality score by introducing the uncertainty of the completed point cloud present in the grasp candidates. To test our full pipeline we perform real world grasping with a 7dof robotic arm with a 2 finger gripper on a large set of household objects and compare against previous approaches that do not measure uncertainty. Our approach ranks the grasp quality better, leading to higher grasp success rate for the rank 5 grasp candidates compared to state of the art.

13:55-14:00, Paper ThBT20.8
Lifelong Morphology Learning for Deformable Embodied Agents

Wang, Yinsong	North China Electric Power University
Zhao, Jing	North China Electric Power University
Yang, Yubo	Shanghai Institute of Technology
Zhang, Shuyuan	Beijing University of Posts and Telecommunications
Liu, Huaping	Tsinghua University
Keywords: Transfer Learning, Evolutionary Robotics Abstract: A deformable agent can continuously adjust its morphology during training, allowing it to discover more suitable structures and outperform fixed-morphology counterparts in terrain-specific tasks. This adaptability is achieved through a joint optimization process consisting of two stages: the Skeleton Transform stage which modifies the agent's morphology and the Execution stage which optimizes the control policy. However, enabling a deformable agent to continuously learn new policies for different terrains without forgetting previous tasks remains a major challenge. Continuous terrain changes can easily disrupt previously learned strategies, making it difficult to adapt to new tasks while maintaining performance on earlier ones. In this work, we focus on lifelong morphology learning for deformable agents that must adaptively traverse a sequence of diverse terrains. We propose Ske-Ex, a lifelong learning framework where both the Skeleton Transform and Execution stages are designed for lifelong adaptation. Unlike existing methods that optimize only control policies under fixed morphologies, Ske-Ex supports joint adaptation of structure and control, making it better suited for deformable agents. We adopt a regularization-based approach as our lifelong learning strategy, as it avoids the need to store large amounts of prior task data. Experimental results show that Ske-Ex exhibits strong resistance to forgetting and superior generalization, and that the joint optimization of both modules outperforms using either stage alone. Additionally, we introduce a flexible MuJoCo terrain benchmark to facilitate future research on lifelong learning for deformable agents. Our demonstration videos are available at https://johncenavsbatista.github.io/Ske-Ex/


ThBT21	101
Machine Learning for Robot Control 2	Regular Session
Chair: Wei, Tianqi	Sun Yat-Sen University
Co-Chair: Xiao, Xuesu	George Mason University

13:20-13:25, Paper ThBT21.1
Low-Latency Privacy-Aware Robot Behavior Guided by Automatically Generated Text Datasets

Irisawa, Yuta	Aoyama Gakuin University
Yamazaki, Tomoaki	Aoyama Gakuin University
Ito, Seiya	National Institute of Information and Communications Technology
Kurita, Shuhei	National Institute of Informatics
Akasaka, Ryota	University of Osaka
Onishi, Masaki	National Inst. of AIST
Ohara, Kouzou	Aoyama Gakuin University
Sakurada, Ken	Kyoto University
Keywords: Machine Learning for Robot Control, Modeling and Simulating Humans Abstract: Humans typically avert their gaze when faced with situations involving another person's privacy, and humanoid robots should exhibit similar behaviors. Various approaches exist for privacy recognition, including an image privacy recognition model and a Large Vision-Language Model (LVLM). The former relies on datasets of labeled images, which raise ethical concerns, while the latter requires more time to recognize images accurately, making real-time responses difficult. To this end, we propose a method of automatically constructing the LLM Privacy Text Dataset (LPT Dataset), a privacy-related text dataset with privacy indicators, and a method of recognizing whether observing a scene violates privacy without ethically sensitive training images. In constructing the LPT Dataset, which consists of both private and public scenes, we use an LLM to define privacy indicators and generate texts scored for each indicator. Our model recognizes whether a given image is private or public by retrieving texts with privacy scores similar to the image in a multi-modal feature space. In our experiments, we evaluated the performance of our model on three image privacy datasets and a realistic experiment with a humanoid robot in terms of accuracy and responsibility. The experiments show that our approach identifies the private image as accurately as the highly tuned LVLM without delay.

13:25-13:30, Paper ThBT21.2
Dyna-LfLH: Learning Agile Navigation in Dynamic Environments from Learned Hallucination

Ghani, Saad Abdul	George Mason University
Wang, Zizhao	University of Texas - Austin
Stone, Peter	The University of Texas at Austin
Xiao, Xuesu	George Mason University
Keywords: Machine Learning for Robot Control, Motion and Path Planning, Deep Learning Methods Abstract: This paper introduces Dynamic Learning from Learned Hallucination (Dyna-LfLH), a self-supervised method for training motion planners to navigate environments with dense and dynamic obstacles. Classical planners struggle with dense, unpredictable obstacles due to limited computation, while learning-based planners face challenges in acquiring high-quality demonstrations for imitation learning or dealing with exploration inefficiencies in reinforcement learning. Building on Learning from Hallucination (LfH), which synthesizes training data from past successful navigation experiences in simpler environments, Dyna-LfLH incorporates dynamic obstacles by generating them through a learned latent distribution. This enables efficient and safe motion planner training. We evaluate Dyna-LfLH on a ground robot in both simulated and real environments, achieving up to a 25% improvement in success rate compared to baselines.

13:30-13:35, Paper ThBT21.3
Uncertainty-Aware Planning with Inaccurate Models for Robotized Liquid Handling

Faroni, Marco	Politecnico Di Milano
Odesco, Carlo	Politecnico Di Milano
Zanchettin, Andrea Maria	Politecnico Di Milano
Rocco, Paolo	Politecnico Di Milano
Keywords: Machine Learning for Robot Control, Motion and Path Planning Abstract: Physics-based simulations and learning-based models are vital for complex robotics tasks like deformable object manipulation and liquid handling. However, these models often struggle with accuracy due to epistemic uncertainty or the sim-to-real gap. For instance, accurately pouring liquid from one container to another poses challenges, particularly when models are trained on limited demonstrations and may perform poorly in novel situations. This paper proposes an uncertainty-aware Monte Carlo Tree Search (MCTS) algorithm designed to mitigate these inaccuracies. By incorporating estimates of model uncertainty, the proposed MCTS strategy biases the search towards actions with lower predicted uncertainty. This approach enhances the reliability of planning under uncertain conditions. Applied to a liquid pouring task, our method demonstrates improved success rates even with models trained on minimal data, outperforming traditional methods and showcasing its potential for robust decision-making in robotics.

13:35-13:40, Paper ThBT21.4
Trajectory Progress-Based Prioritizing and Intrinsic Reward Mechanism for Robust Training of Robotic Manipulations (I)

Liang, Weixiang	University of Macau
Liu, Yinlong	City University of Macau
Wang, Jikun	University of Macau
Yang, Zhi-Xin	University of Macau
Keywords: Machine Learning for Robot Control, Reinforcement Learning, Deep Learning in Grasping and Manipulation Abstract: Training robots by model-free deep reinforcement learning (DRL) to carry out robotic manipulation tasks without sufficient successful experiences is challenging. Hindsight experience replay (HER) is introduced to enable DRL agents to learn from failure experiences. However, the HER-enabled model-free DRL still suffers from limited training performance due to its uniform sampling strategy and scarcity of reward information in the task environment. Inspired by the progress incentive mechanism in human psychology, we propose Progress Intrinsic Motivation-based HER (P-HER) in this work to overcome these difficulties. First, the Trajectory Progress-based Prioritized Experience Replay (TPPER) module is developed to prioritize sampling valuable trajectory data thereby achieving more efficient training. Second, the Progress Intrinsic Reward (PIR) module is introduced in agent training to add extra intrinsic rewards for encouraging the agents throughout the exploration of task space. Experiments in challenging robotic manipulation tasks demonstrate that our P-HER method outperforms original HER and state-of-the-art HER-based methods in training performance. Our code of P-HER and its experimental videos in both virtual and real environments are available at https://github.com/weixiang-smart/P-HER.

13:40-13:45, Paper ThBT21.5
Learning-Based Quadruped Robot Framework for Locomotion on Dynamic Rigid Platforms

Huang, Kaihui	Sun Yat-Sen University
Feng, Heming	Sun Yat-Sen University
Meng, Wei	Guangdong University of Technology
Wei, Tianqi	Sun Yat-Sen University
Hu, Tianjiang	Sun Yat-Sen University
Keywords: Machine Learning for Robot Control, Reinforcement Learning, Legged Robots Abstract: Typical robot controllers assume firm ground, limiting their effectiveness in controlling robots on dynamic platforms such as trucks or ships. To address this limitation, we propose a reinforcement learning framework for robot locomotion on dynamic rigid platforms and a simulation in which 6-DoF dynamic platforms emulating ship oscillation. The framework enables a reinforcement learning model to estimate platform motion during robot locomotion control. In the simulation, our framework significantly reduces the quadruped robot’s fall rate and trajectory deviation compared to baseline controllers. Experiments on a real robot show that our framework enabled a quadruped robot to adapt to platform motions, including those that threw the robot into the air, while baseline models struggled in this case. Thus, our framework can advance the deployment of robots in real-world marine and vehicular applications

13:45-13:50, Paper ThBT21.6
Offline-To-Online Learning Enabled Robust Control for Uncertain Robotic Systems Pursuing Constraint-Following (I)

Zheng, Runze	Hunan University
Cheng, Tianxiang	Hunan University
Zhang, Xinglong	National University of Defense Technology
Zhang, Zheshuo	Hangzhou City University
Jing, Xingjian	City University of Hong Kong
Yin, Hui	Hunan University
Keywords: Machine Learning for Robot Control, Robust/Adaptive Control, Deep Learning Methods Abstract: A major challenge in robust control design of robotic systems is finding a comprehensive uncertainty bound (CUB) with low conservativeness for uncertainty compensation. This study proposes a two-phase learning approach to learn the CUB for robust control of robotic systems, considering uncertainties with unknown bounds. The goal is to drive the system to follow a class of servo constraints that may be nonholonomic, i.e., constraintfollowing control (CFC) design. The first phase trains a deep neural network (DNN) to approximate the ensemble system uncertainty using offline supervised learning. The second phase constructs an adaptive law to learn the CUB online, covering the offline learning error. To our knowledge, this is the first CFC combining DNN with adaptive law to learn a less conservative CUB to save control effort and to eliminate complex manual derivations. The effectiveness and merits of the proposed control are endorsed by theoretical proofs, simulations, and experiments on a quadrotor unmanned aerial vehicle.

13:50-13:55, Paper ThBT21.7
FlowNav: Combining Flow Matching and Depth Priors for Efficient Navigation

Gode, Samiran	University of Technology Nuremberg
Nayak, Abhijeet Kishore	University of Technology Nuremberg
Oliveira, Débora	University of Technology Nuremberg
Krawez, Michael	University of Technology Nuremberg
Schmid, Cordelia	Inria
Burgard, Wolfram	University of Technology Nuremberg
Keywords: Machine Learning for Robot Control, Visual Learning, Vision-Based Navigation Abstract: Effective robot navigation in unseen environments is a challenging task that requires precise control actions at high frequencies. Recent advances have framed it as an image-goal-conditioned control problem, where the robot generates navigation actions using frontal RGB images. Current state-of-the-art methods in this area use diffusion policies to generate these control actions. Despite their promising results, these models are computationally expensive and suffer from weak perception. To address these limitations, we present FlowNav, a novel approach that uses a combination of CFM and depth priors from off-the-shelf foundation models to learn action policies for robot navigation. FlowNav is significantly more accurate and faster at navigation and exploration than state-of-the-art methods. We validate our contributions using real robot experiments in multiple environments, demonstrating improved navigation reliability and accuracy. Code and trained models are publicly available.

13:55-14:00, Paper ThBT21.8
ViT-VS: On the Applicability of Pretrained Vision Transformer Features for Generalizable Visual Servoing

Scherl, Alessandro	University of Alicante
Thalhammer, Stefan	TU Wien
Neuberger, Bernhard	TU Wien
Wöber, Wilfried	UAS Technikum Wien, University of Natural Resources and Life Sci
Garcia Rodriguez, Jose	Universidad De Alicante
Keywords: Machine Learning for Robot Control, Visual Servoing, Sensor-based Control Abstract: Visual servoing enables robots to precisely position their end-effector relative to a target object. While classical methods rely on hand-crafted features and thus are universally applicable without task-specific training, they often struggle with occlusions and environmental variations, whereas learning-based approaches improve robustness but typically require extensive training. We present a visual servoing approach that leverages pretrained vision transformers for semantic feature extraction, combining the advantages of both paradigms while also being able to generalize beyond the provided sample. Our approach achieves full convergence in unperturbed scenarios and surpasses classical image-based visual servoing by up to 18.2% in perturbed scenarios. Even the convergence rates of learning-based methods are matched despite requiring no task- or object-specific training. Real-world evaluations confirm robust performance in end-effector positioning, industrial box manipulation, and grasping of unseen objects using only a reference from the same category. Our code and simulation environment are available at: https://alessandroscherl.github.io/ViT-VS/


ThBT22	102A
Dual Arm Manipulation 2	Regular Session
Chair: Borràs Sol, Júlia	Institut De Robòtica I Informàtica Industrial (CSIC-UPC)
Co-Chair: Li, Shuai	University of Florida

13:20-13:25, Paper ThBT22.1
Dual-Arm Hierarchical Planning for Laboratory Automation: Vibratory Sieve Shaker Operations

Xiao, Haoran	National University of Defense Technology
Wang, Xue	National University of Defense Technology
Lu, Huimin	National University of Defense Technology
Zeng, Zhiwen	National University of Defense Technology
Guo, Zirui	The National University of Defense Technology
Ni, Ziqi	National University of Defense Technology
Ye, Yicong	National University of Defense Technology
Dai, Wei	National University of Defense Technology
Keywords: Dual Arm Manipulation, Cooperating Robots, Collision Avoidance Abstract: This paper addresses the challenges of automating vibratory sieve shaker operations in a materials laboratory, focusing on three critical tasks: 1) dual-arm lid manipulation in 3 cm clearance spaces, 2) bimanual handover in overlapping workspaces, and 3) obstructed powder sample container delivery with orientation constraints. These tasks present significant challenges, including inefficient sampling in narrow passages, the need for smooth trajectories to prevent spillage, and suboptimal paths generated by conventional methods. To overcome these challenges, we propose a hierarchical planning framework combining Prior-Guided Path Planning and Multi-Step Trajectory Optimization. The former uses a finite Gaussian mixture model to improve sampling efficiency in narrow passages, while the latter refines paths by shortening, simplifying, imposing joint constraints, and B-spline smoothing. Experimental results demonstrate the framework's effectiveness: planning time is reduced by up to 80.4%, and waypoints are decreased by 89.4%. Furthermore, the system completes the full vibratory sieve shaker operation workflow in a physical experiment, validating its practical applicability for complex laboratory automation.

13:25-13:30, Paper ThBT22.2
High-Stiffness Path Planning for 7-DOF Cable-Driven Manipulators in Single and Dual-Arm Configurations

Pang, Shunxiang	University of Science and Technology of China
Guo, Fan	Deep Space Exploration Laboratory
Zhang, Bing	National Key Laboratory of Deep Space Exploration
Huang, Hai	Deep Space Exploration Laboratory
Pan, Xiaoyang	Deep Space Exploration Laboratory
Shang, Weiwei	University of Science and Technology of China
Keywords: Dual Arm Manipulation, Manipulation Planning, Human-Robot Collaboration Abstract: Low stiffness in 7-DOF cable-driven humanoid manipulators limits their precision, posing a significant challenge in complex human-robot interaction (HRI) scenarios. This paper presents a motion planning framework to enhance manipulator stiffness for both single and dual-arm configurations. For a single arm, we introduce a novel method that integrates dynamic obstacle avoidance with posture optimization to maximize end-effector stiffness. For dual-arm systems, we develop a coupled stiffness model that addresses inter-arm dynamics to improve performance in coordinated tasks. Experimental results on prototypes confirm that the proposed methods significantly reduce end-effector deviation under load, thereby improving the precision and reliability of these manipulators in sophisticated collaborative applications.

13:30-13:35, Paper ThBT22.3
Evaluating the Pre-Dressing Step: Unfolding Medical Garments Via Imitation Learning

Blanco-Mulero, David	Institut De Robòtica I Informàtica Industrial, CSIC-UPC
Borràs Sol, Júlia	Institut De Robòtica I Informàtica Industrial (CSIC-UPC)
Torras, Carme	Csic - Upc
Keywords: Service Robotics, Dual Arm Manipulation Abstract: Robotic-assisted dressing has the potential to significantly aid both patients as well as healthcare personnel, reducing the workload and improving the efficiency in clinical settings. While substantial progress has been made in robotic dressing assistance, prior works typically assume that garments are already unfolded and ready for use. However, in medical applications gowns and aprons are often stored in a folded configuration, requiring an additional unfolding step. In this paper, we introduce the pre-dressing step, the process of unfolding garments prior to assisted dressing. We leverage imitation learning for learning three manipulation primitives, including both high and low acceleration motions. In addition, we employ a visual classifier to categorise the garment state as closed, partly opened, and fully opened. We conduct an empirical evaluation of the learned manipulation primitives as well as their combinations. Our results show that highly dynamic motions are not effective for unfolding freshly unpacked garments, where the combination of motions can efficiently enhance the opening configuration.

13:35-13:40, Paper ThBT22.4
Bimanual Long-Horizon Manipulation Via Temporal-Context Transformer RL

Oh, Ji-Heon	Kyung Hee University
Espinoza, Ismael	Kyung Hee University
Jung, Danbi	Kyung Hee University
Kim, Tae-Seong	Kyung Hee University
Keywords: Dual Arm Manipulation, Reinforcement Learning, Multifingered Hands Abstract: Dual-arm robots can perform bimanual long-horizon (LH) manipulation, surpassing the capabilities of single-arm robots. However, bimanual LH tasks are challenging for robot intelligence due to the complexity of long sequence variables and multi-agent interactions. While Multi-Agent Reinforcement Learning (MARL) has shown promising results in agent interactions, these models struggle with sequential LH tasks due to limitations in credit assignment, vanishing memory, and the exploration-exploitation trade-off. This paper introduces a novel dual-arm robot intelligence framework, Temporal-Context Transformer Reinforcement Learning (TC-TRL), which integrates both a hybrid offline-online policy and imitation learning. TC-TRL leverages the attention mechanism to identify relevant temporal-context information from the LH observations space, updating the encoder value function and generating an optimal actions sequence using a decoder module, which uses demonstration guidance during online training. TC-TRL is tested on six bimanual tasks, and its performance is compared against five baseline RLs: MAPPO, HAPPO, IPPO, MAT, and DA-MAT. The results show that TC-TRL outperforms the three PPO-based RLs with an average success rate of 63.46%, 42.23% against MAT, and 30.91% for DA-MAT.

13:40-13:45, Paper ThBT22.5
Image-Based Visual Servoing for Enhanced Cooperation of Dual-Arm Manipulation

Zhang, Zizhe	University of Pennsylvania
Yang, Yuan	Northwestern Polytechnical University
Zuo, Wenqiang	Southeast University
Song, Guangming	Southeast University
Song, Aiguo	Southeast University
Shi, Yang	University of Victoria
Keywords: Dual Arm Manipulation, Cooperating Robots, Visual Servoing Abstract: The cooperation of a pair of robot manipulators is required to manipulate a target object without any fixtures. The conventional control methods coordinate the end-effector pose of each manipulator with that of the other using their kinematics and joint coordinate measurements. Yet, the manipulators' inaccurate kinematics and joint coordinate measurements can cause significant pose synchronization errors in practice. This paper thus proposes an image-based visual servoing approach for enhancing the cooperation of a dual-arm manipulation system. On top of the classical control, the visual servoing controller lets each manipulator use its carried camera to measure the image features of the other's marker and adapt its end-effector pose with the counterpart on the move. Because visual measurements are robust to kinematic errors, the proposed control can reduce the end-effector pose synchronization errors and the fluctuations of the interaction forces of the pair of manipulators on the move. Theoretical analyses have rigorously proven the stability of the closed-loop system. Comparative experiments on real robots have substantiated the effectiveness of the proposed control.

13:45-13:50, Paper ThBT22.6
A Task-Adaptive Deep Reinforcement Learning Framework for Dual-Arm Robot Manipulation (I)

Cui, Yuanzhe	Tongji University
Xu, Zhipeng	Tongji University
Zhong, Lou	Tongji University
Xu, Pengjie	Tongji University
Shen, Yichao	Universität Stuttgart
Tang, Qirong	Tongji University
Keywords: Dual Arm Manipulation, Reinforcement Learning, Industrial Robots Abstract: Closed-chain manipulation occurs when several robot arms perform tasks in cooperation. It is complex to control a dual-arm system because it requires flexible and adaptable operation ability to realize closed-chain manipulation. In this study, a deep reinforcement learning (DRL) framework based on actor-critic algorithm is proposed to drive the closed-chain manipulation of a dual-arm robotic system. The proposed framework is designed to train dual robot arms to transport a large object cooperatively. In order to sustain strict constraints of closed-chain manipulation, the actor part of the proposed framework is designed in a leader-follower mode. The leader part consists of a policy trained from the DRL algorithm and works on the leader arm. The follower part consists of an inverse kinematics solver based on Damped Least Squares (DLS) and works on the follower arm. Two experiments are designed to prove the task adaptability, one of which is manipulating an object to a random pose within a defined range, the other is manipulating a delicate structural object within a narrow space.

13:50-13:55, Paper ThBT22.7
Wrench Control of Dual-Arm Robot on Flexible Base with Supporting Contact Surface

Lee, Jeongseob	Seoul National University
Kong, Doyoon	Seoul National University
Cha, Hojun	Seoul National University
Lee, Jeongmin	Seoul National University
Ryu, Dongseok	Texas A&M University-Corpus Christi
Shin, Hocheol	Korea Atomic Research Institute
Lee, Dongjun	Seoul National University
Keywords: Dual Arm Manipulation, Flexible Robots, Force Control, Supporting Contact Abstract: We propose a novel high-force/precision interaction control framework of a dual-arm robot system on a flexible base, with one arm holding, or making contact with, a supporting surface, while the other arm can exert any arbitrary wrench in a certain polytope through a desired pose against environments or objects. Our proposed framework can achieve high-force/precision tasks by utilizing the supporting surface just as we humans do, while taking into account various important constraints and the passive compliance of the flexible base. We first design the control as a combination of a nominal control, active stiffness control, and feedback wrench control. We then sequentially perform optimizations of the nominal configuration and the active stiffness control gain. We also design the PI-type feedback wrench control to improve the precision of the control. The key theoretical enabler for our framework is a novel stiffness analysis of the system with flexibility, which, when combined with certain constraints, provides some peculiar relations, that can effectively be used to simplify the optimization process and facilitate the feedback wrench control design.


ThBT23	102B
Force and Tactile Sensing 5	Regular Session
Co-Chair: Shao, Yitian	Harbin Institute of Technology, Shenzhen

13:20-13:25, Paper ThBT23.1
TacCap: A Wearable FBG-Based Tactile Sensor for Seamless Human-To-Robot Skill Transfer

Xing, Chengyi	Stanford University
Li, Hao	Stanford University
Wei, Yi-Lin	Sun Yat-Sen University
Ren, Tianao	Stanford University
Tu, Tianyu	Stanford University
Lin, Yuhao	Sun Yat-Sen University
Schumann, Elizabeth	Stanford University
Zheng, Wei-Shi	Sun Yat-Sen University
Cutkosky, Mark	Stanford University
Keywords: Force and Tactile Sensing, Deep Learning in Grasping and Manipulation, Learning from Demonstration Abstract: Tactile sensing is essential for dexterous manipulation, yet large-scale human demonstration datasets lack tactile feedback, limiting their effectiveness in skill transfer to robots. To address this, we introduce TacCap, a wearable Fiber Bragg Grating (FBG)-based tactile sensor designed for seamless human-to-robot transfer. TacCap is lightweight, durable, and immune to electromagnetic interference, making it ideal for real-world data collection. We detail its design and fabrication, evaluate its sensitivity, repeatability, and cross-sensor consistency, and assess its effectiveness through grasp stability prediction and ablation studies. Our results demonstrate that TacCap enables transferable tactile data collection, bridging the gap between human demonstrations and robotic execution. To support further research and development, we open-source our hardware design and software.

13:25-13:30, Paper ThBT23.2
UniTac-NV: A Unified Tactile Representation for Non-Vision-Based Tactile Sensors

Hou, Jian	Imperial College London
Zhou, Xin	Imperial College London
Yang, Qihan	Imperial College London
Spiers, Adam	Imperial College London
Keywords: Force and Tactile Sensing, Representation Learning, Deep Learning Methods Abstract: Generalizable algorithms for tactile sensing remain underexplored, primarily due to the diversity of sensor modalities. Recently, many methods for cross-sensor transfer between optical (vision-based) tactile sensors have been investigated, yet little work focus on non-optical tactile sensors. To address this gap, we propose an encoder-decoder architecture to unify tactile data across non-vision-based sensors. By leveraging sensor-specific encoders, the framework creates a latent space that is sensor-agnostic, enabling cross-sensor data transfer with low errors and direct use in downstream applications. We leverage this network to unify tactile data from two commercial tactile sensors: the Xela uSkin uSPa 46 and the Contactile PapillArray. Both were mounted on a UR5e robotic arm, performing force-controlled pressing sequences against distinct object shapes (circular, square, and hexagonal prisms) and two materials (rigid PLA and flexible TPU). Another more complex unseen object was also included to investigate the model's generalization capabilities. We show that alignment in latent space can be implicitly learned from joint autoencoder training with matching contacts collected via different sensors. We further demonstrate the practical utility of our approach through contact geometry estimation, where downstream models trained on one sensor's latent representation can be directly applied to another without retraining.

13:30-13:35, Paper ThBT23.3
Instantaneous Contact Localization on a Magnetically Transduced Tapered Whisker

Dang, Yixuan	Technische Universität München
Zhang, Yichen	Technical University Munich
Huang, Yuhong	Technische Universität München
Wen, Long	Technical University of Munich
Zhang, Yu	Technical University of Munich
Bing, Zhenshan	Technical University of Munich
Roehrbein, Florian	Chemnitz University of Technology
Knoll, Alois	Tech. Univ. Muenchen TUM
Keywords: Force and Tactile Sensing, Biologically-Inspired Robots, Soft Sensors and Actuators Abstract: The whisker-inspired tactile sensor is advantageous for enhancing robotic perception in proximate range and darkness via non-intrusive contacts. Localizing contact along the whisker shaft is however challenging due to the non-injective mapping between tangential contacts and the resulting bending moments at the whisker base. Previous studies suggest that incorporating axial force measurements can resolve this ambiguity. In this work, we develop a magnetically transduced whisker sensor that integrates axial force sensing as an additional mechanical signal. The sensor features a tapered whisker with a custom slope and a 3-DoF suspension mechanism, enabling axial displacement at the base, which is proportional to the applied axial force. To estimate the whisker’s motion, we construct a Penalized Gaussian Process model trained on synthetic data and refined with real-data constraints. The design is compact, low-cost, and validated through both simulations and real-world experiments to differentiate tangential contacts. Furthermore, we propose an optimization-based approach for estimating instantaneous contact locations. Experimental results demonstrate that the proposed method can effectively track contacts in millimeter-level accuracy with a mean error of 7.17 mm, achieving a higher accuracy with only 4.02 mm in large-deflection and close-to-base regions.

13:35-13:40, Paper ThBT23.4
Exploratory Movement Strategies for Texture Discrimination with a Neuromorphic Tactile Sensor

Xu, Xingchen	University of Bristol
Li, Ao	University or Bristol
Ward-Cherrier, Benjamin	University of Bristol
Keywords: Force and Tactile Sensing Abstract: We propose a neuromorphic tactile sensing framework for robotic texture classification that is inspired by human exploratory strategies. Our system utilizes the NeuroTac sensor to capture neuromorphic tactile data during a series of exploratory motions. We first tested six distinct motions for texture classification under fixed environment: sliding, rotating, tapping, as well as the combined motions: sliding+rotating, tapping+rotating, and tapping+sliding. We chose sliding and sliding+rotating as the best motions based on final accuracy and the sample timing length needed to reach converged accuracy. In the second experiment designed to simulate complex real-world conditions, these two motions were further evaluated under varying contact depth and speeds. Under these conditions, our framework attained the highest accuracy of 87.33% with sliding+rotating while maintaining an extremely low power consumption of only 8.04 mW. These results suggest that the sliding+rotating motion is the optimal exploratory strategy for neuromorphic tactile sensing deployment in texture classification tasks and holds significant promise for enhancing robotic environmental interaction.

13:40-13:45, Paper ThBT23.5
High Resolution, Large Area Vision-Based Tactile Sensing Based on a Novel Piezoluminescent Skin

Jiang, Ruxiang	Univerisity of Sussex
Fu, Lanhui	Wuyi University
Li, Yanan	University of Sussex
Godaba, Hareesh	University of Southampton
Keywords: Force and Tactile Sensing, Haptics and Haptic Interfaces, Physical Human-Robot Interaction, Soft Sensors and Actuators Abstract: The ability to precisely perceive external physical interactions would enable robots to interact effectively with the environment and humans. While vision-based tactile sensing has improved robotic grippers, it is challenging to realize high resolution vision-based tactile sensing in robot arms due to presence of curved surfaces, difficulty in uniform illumination, and large distance of sensing area from the cameras. In this paper, we propose a novel piezoluminescent skin that transduces external applied pressures into changes in light intensity on the other side for viewing by a camera for pressure estimation. By engineering elastomer layers with specific optical properties and integrating a flexible electroluminescent panel as a light source, we develop a compact tactile sensing layer that resolves the layout issues in curved surfaces. We achieved multipoint pressure estimation over an expansive area of 502 sq. cm with high spatial resolution, a Two-Point Discrimination distance of 3 mm horizontally and 5 mm vertically which is comparable to that of human fingers as well as a high localization accuracy (RMSE of 1.92 mm). These promising attributes make this tactile sensing technique suitable for use in robot arms and other applications requiring high resolution tactile information over a large area.

13:45-13:50, Paper ThBT23.6
Integrating Human-Like Impedance Regulation and Model-Based Approaches for Compliance Discrimination Via Biomimetic Optical Tactile Sensors

Pagnanelli, Giulia	University of Pisa
Zinelli, Lucia	University of Pisa
Lepora, Nathan	University of Bristol
Catalano, Manuel Giuseppe	Istituto Italiano Di Tecnologia
Bicchi, Antonio	Fondazione Istituto Italiano Di Tecnologia
Bianchi, Matteo	University of Pisa
Keywords: Force and Tactile Sensing, Biomimetics, Perception for Grasping and Manipulation, Soft Sensors and Actuators Abstract: Endowing robots with advanced tactile abilities based on biomimicry involves designing human-like tactile sensors, computational models, and motor control policies to enhance contact information retrieval. Here we consider compliance discrimination with a soft biomimetic tactile optical sensor (TacTip). In previous work, we proposed a vision-based approach derived from a computational model of human tactile perception to discriminate object compliance with the TacTip, based on contact area spread computation over the indenting force. In this work, we first increased the robustness of our vision-based method with a more precise estimation of the initial contact area condition, which enables correct compliance estimation also when the probing direction is other than normal to the specimen surface. Then, we integrated within our validated framework the mechanisms of internal muscular regulation (co-contraction) that humans adopt during object compliance probing, to maximize the information uptake. To this aim, we used human co-contraction patterns extracted during object softness probing to control a Variable Stiffness Actuator (that emulates the agonistic-antagonistic behavior of human muscles), which is used to actuate the indenter system endowed with the TacTip for object compliance exploration. We found that our model-based approach for compliance discrimination, fed with more precisely estimated initial conditions, significantly improves with the human-inspired impedance regulation, with respect to the usage of a rigid actuator.

13:50-13:55, Paper ThBT23.7
NUSense: Shear Based Robust Optical Tactile Sensor

Yergibay, Madina	Nazarbayev University
Mussin, Tleukhan	Nazarbayev University
Kenzhebek, Daryn	Nazarbayev University
Seitzhan, Saltanat	Nazarbayev University
Umurbekov, Ilyas	Nazarbayev University
Spanova, Kamila	Nazarbayev University
Kappassov, Zhanat	Nazarbayev University
Soh, Harold	National University of Singapore
Taunyazov, Tasbolat	National University of Singapore
Keywords: Force and Tactile Sensing Abstract: While most optical tactile sensors rely on measuring surface displacement, insights from continuum mechanics suggest that measuring shear strain provides key information for tactile sensing. In this work, we introduce an optical tactile sensing principle based on shear strain detection. A silicone rubber layer, dyed with color inks, is used to quantify the shear magnitude of the sensing layer. This principle was validated using the NUSense camera-based tactile sensor. The wide-angle camera captures the elongation of the soft pad under mechanical load, a phenomenon attributed to the Poisson effect. We tested the robustness of the sensor by subjecting the outermost layer to multiple load (8~N) cycles using a 5~mm in radius ball head indenter. The physical and optical properties of the inked pad proved essential and remained stable over time, exhibiting only low variance.

13:55-14:00, Paper ThBT23.8
Compact LED-Based Displacement Sensing for Robot Fingers

El-Azizi, Amr	Columbia University
Islam, Sharfin	Columbia University
Piacenza, Pedro	Samsung Research America
Jiang, Kai	Columbia University
Kymissis, Ioannis	Columbia University
Ciocarlie, Matei	Columbia University
Keywords: Force and Tactile Sensing, Soft Sensors and Actuators, Methods and Tools for Robot System Design Abstract: In this paper, we introduce a sensor designed for integration in robot fingers, where it can provide information on the displacements induced by external contact. Our sensor uses LEDs to sense the displacement between two plates connected by a transparent elastomer; when a force is applied to the finger, the elastomer displaces and the LED signals change. We show that using LEDs as both light emitters an receivers in this context provides high sensitivity, allowing such an emitter and receiver pairs to detect very small displacements. We characterize the standalone performance of the sensor by testing the ability of a supervised learning model to predict complete force and torque data from its raw signals, and obtain a mean error between 0.05 and 0.07 N across the three directions of force applied to the finger. Our method allows for finger-size packaging with no amplification electronics, low cost manufacturing, easy integration into a complete hand, and high overload shear forces and bending torques, suggesting future applicability to complete manipulation tasks.


ThBT24	102C
Calibration and Identification 2	Regular Session
Chair: Chen, Silu	Ningbo Institute of Materials Technology and Engineering, CAS

13:20-13:25, Paper ThBT24.1
CLAIM: Camera-LiDAR Alignment with Intensity and Monodepth

Zhang, Zhuo	MEGVII Technology
Liu, Yonghui	Beijing Institute of Technology
Zhang, Meijie	MEGVII Technology
Tan, Feiyang	MEGVII Technology
Ding, Yikang	Megvii
Keywords: Calibration and Identification, Sensor Fusion, Computer Vision for Automation Abstract: In this paper, we unleash the potential of the powerful monodepth model in camera-LiDAR calibration and propose CLAIM, a novel method of aligning data from the camera and LiDAR. Given the initial guess and pairs of images and LiDAR point clouds, CLAIM utilizes a coarse-to-fine searching method to find the optimal transformation minimizing a patched Pearson correlation-based structure loss and a mutual information-based texture loss. These two losses serve as good metrics for camera-LiDAR alignment results and require no complicated steps of data processing, feature extraction, or feature matching like most methods, rendering our method simple and adaptive to most scenes. We validate CLAIM on public KITTI, Waymo, and MIAS-LCEC datasets, and the experimental results demonstrate its superior performance compared with the state-of-the-art methods. The code will be public at github. The code is available at https://github.com/Tompson11/claim.

13:25-13:30, Paper ThBT24.2
Sensor-Free Self-Calibration for Collaborative Robots Using Tri-Sphere End-Effector Toward High Orientation Accuracy

He, Jianhui	Ningbo Institute of Materials Technology and Engineering, Chines
Yang, Guilin	Ningbo Institute of Material Technology and Engineering, Chines
Feng, Yiyang	Ningbo Institute of Material Technology & Engineering, CAS
Luo, Jingbo	Ningbo Institute of Materials Technology and Engineering, CAS
Chen, Silu	Ningbo Institute of Materials Technology and Engineering, CAS
Shen, Wenjun	Ningbo Institute of Material Technology and Engineering, Chinese
Keywords: Calibration and Identification, Industrial Robots Abstract: Collaborative robots often exhibit limited absolute accuracy despite high repeatability, necessitating cost-effective calibration solutions. This paper presents a novel sensor-free self-calibration method for collaborative robots using position and distance constraints. A tri-sphere end-effector with precision balls and magnetic holders enables repeatable Tool Center Point (TCP) positioning (< 0.01mm) through hand-guiding, where the three-sphere configuration crucially enhances the orientation calibration accuracy compared to a single-sphere approach. The proposed device eliminates expensive external sensors while establishing geometric constraints through workspace-wide TCP engagements. By analyzing relative position/distance errors between multiple configurations, the method identifies kinematic parameters via a Local Product of Exponential (Local POE) based error model. Experiments demonstrated a 91.7% position error reduction (7.98mm to 0.66mm) and 69.6% orientation improvement (0.0069rad to 0.0021rad), achieving comparable accuracy to laser-tracker methods at <1% device cost. This approach offers a low-cost, mechanically robust solution for enhancing collaborative robot accuracy in industrial applications.

13:30-13:35, Paper ThBT24.3
ARC-Calib: Autonomous Markerless Camera-To-Robot Calibration Via Exploratory Robot Motions

Chanrungmaneekul, Podshara	Rice University
Chen, Yiting	Rice University
Grace, Joshua	Yale University
Dollar, Aaron	Yale University
Hang, Kaiyu	Rice University
Keywords: Calibration and Identification, Perception for Grasping and Manipulation Abstract: Camera-to-robot (also known as eye-to-hand) calibration is a critical component of vision-based robot manipulation. Traditional marker-based methods often require human intervention for system setup. Furthermore, existing autonomous markerless calibration methods typically rely on pre-trained robot tracking models that impede their application on edge devices and require fine-tuning for novel robot embodiments. To address these limitations, this paper proposes a model-based markerless camera-to-robot calibration framework, bold{ARC-Calib}, that is fully autonomous and generalizable across diverse robots and scenarios without requiring extensive data collection or learning. First, exploratory robot motions are introduced to generate easily trackable trajectory-based visual patterns in the camera's image frames. Then, a geometric optimization framework is proposed to exploit the coplanarity and collinearity constraints from the observed motions to iteratively refine the estimated calibration result. Our approach eliminates the need for extra effort in either environmental marker setup or data collection and model training, rendering it highly adaptable across a wide range of real-world autonomous systems. Extensive experiments are conducted in both simulation and the real world to validate its robustness and generalizability.

13:35-13:40, Paper ThBT24.4
Enhanced Zero-Bias Correction for Fiber Optic Gyroscope Using an Improved Artificial Bee Colony Algorithm

Zhao, Jinyue	Nankai University
He, Kunpeng	Nankai University
Kang, Le	Nankai University
Huang, Sixu	Nankai University
Keywords: Calibration and Identification Abstract: An improved artificial bee colony (IABC) algorithm is presented for bias correction of fiber optic gyroscopes in inertial navigation systems. Its effectiveness is verified through an eight-orientation heading test conducted across two independent trials. In the absence of correction, heading deviations are evident, reaching approximately (0.056^{circ}) and (0.079^{circ}), which reflect substantial bias accumulation. Conventional methods offer limited improvement and struggle to maintain consistency, yielding deviations near (0.06^{circ}) and (0.041^{circ}). By contrast, the IABC algorithm reduces these deviations to below (0.046^{circ}) and around (0.038^{circ}), respectively. The results confirm that the proposed approach provides more reliable compensation and enhances heading accuracy across diverse orientation scenarios.

13:40-13:45, Paper ThBT24.5
Kalib: Easy Hand-Eye Calibration with Reference Point Tracking

Tang, Tutian	Shanghai Jiao Tong University
Liu, MingHao	Shanghai JiaoTong University
Xu, Wenqiang	Shanghai Jiaotong University
Lu, Cewu	ShangHai Jiao Tong University
Keywords: Calibration and Identification, Visual Tracking, Deep Learning Methods Abstract: Hand-eye calibration aims to estimate the transformation between a camera and a robot. Traditional methods rely on fiducial markers, which require considerable manual effort and precise setup. Recent advances in deep learning have introduced markerless techniques but come with more prerequisites, such as retraining networks for each robot, and accessing accurate mesh models for data generation. In this paper, we propose Kalib, an automatic and easy-to-setup hand-eye calibration method that leverages the generalizability of visual foundation models to overcome these challenges. It features only two basic prerequisites, the robot's kinematic chain and a predefined reference point on the robot. During calibration, the reference point is tracked in the camera space. Its corresponding 3D coordinates in the robot coordinate can be inferred by forward kinematics. Then, a PnP solver directly estimates the transformation between the camera and the robot without training new networks or accessing mesh models. Evaluations in simulated and real-world benchmarks show that Kalib achieves good accuracy with a lower manual workload compared with recent baseline methods. We also demonstrate its application in multiple real-world settings with various robot arms and grippers. Kalib's user-friendly design and minimal setup requirements make it a possible solution for continuous operation in unstructured environments.

13:45-13:50, Paper ThBT24.6
Robust Online Calibration for UWB-Aided Visual-Inertial Navigation with Bias Correction

Zhou, Yizhi	George Mason University
Xu, Jie	University of California, Riverside
Xia, Jiawei	Beijing University of Chemical Technology
Hu, Zechen	George Mason University
Li, Weizi	University of Tennessee, Knoxville
Wang, Xuan	George Mason University
Keywords: Calibration and Identification, Localization, Visual-Inertial SLAM Abstract: This paper presents a novel robust online calibration framework for Ultra-Wideband (UWB) anchors in UWB-aided Visual-Inertial Navigation Systems (VINS). Accurate anchor positioning, a process known as calibration, is crucial for integrating UWB ranging measurements into state estimation. While several prior works have demonstrated satisfactory results by using robot-aided systems to autonomously calibrate UWB systems, there are still some limitations: 1) these approaches assume accurate robot localization during the initialization step, ignoring localization errors that can compromise calibration robustness, and 2) the calibration results are highly sensitive to the initial guess of the UWB anchors' positions, reducing the practical applicability of these methods in real-world scenarios. Our approach addresses these challenges by incorporating the impact of robot localization errors into the calibration process through a stochastic optimization (SO) formulation, ensuring robust initialization. To further enhance the robustness of the calibration results against initialization errors, we propose a tightly-coupled Schmidt Kalman Filter (SKF)-based online refinement method, making the system suitable for practical applications. Simulations and real-world experiments validate the improved accuracy and robustness of our approach.

13:50-13:55, Paper ThBT24.7
GLIC-Calib: Targetless and One-Shot Spatial-Temporal Calibration of LiDAR-IMU-Camera for Ground Vehicles

Zhao, Hang	Aerospace Information Research Institute, Chinese Academy of Sci
Ji, Xinchun	Aerospace Information Research Institute, Chinese Academy of Sci
Wei, Dongyan	Aerospace Information Research Institute, Chinese Academy of Sci
Keywords: Calibration and Identification, Sensor Fusion, Autonomous Vehicle Navigation Abstract: Accurate spatial-temporal parameters of LiDAR, IMU and camera, including extrinsic parameter and time offset, is the key to ensure multi-sensor fusion performance for ground vehicles. Compared with target-based calibration method, targetless method does not need artificial targets which is more convenient and flexible. Most existing targetless calibration methods cannot apply to ground vehicles with the LiDAR-IMU-camera combination system. To address this issue, in this paper we propose GLIC-Calib: a carefully designed targetless and one-shot spatial-temporal calibration method of LiDAR-IMU-camera for ground vehicles. First, we recover the real-time camera scale by constant camera height and visual ground points. Then we initialize 6-DoF extrinsic of LiDAR-IMU and camera-IMU as well as their time offsets by raw motion measurements and ground constraints. Finally we refine spatial-temporal parameters of LiDAR-IMU and camera-IMU by high accurate LiDAR-camera extrinsic parameter obtained from environmental association. Experiments conducted on the self-collected datasets with different sensor configurations show the effectiveness, efficiency and robustness of the proposed methods compared with others. We open-sourced our methods on GitHub for the contribution to the community.


ThBT25	103A
Legged Robots 6	Regular Session
Co-Chair: Yi, Jingang	Rutgers University

13:20-13:25, Paper ThBT25.1
Minimizing Acoustic Noise: Enhancing Quiet Locomotion for Quadruped Robots in Indoor Applications

Cao, Zhanxiang	Shanghai Jiao Tong University
Nie, Buqing	Shanghai Jiao Tong University
Zhang, Yang	Shanghai Jiao Tong University
Gao, Yue	Shanghai JiaoTong University
Keywords: Legged Robots, Human-Centered Robotics Abstract: Recent advancements in quadruped robot research have significantly improved their ability to traverse complex and unstructured outdoor environments. However, the issue of noise generated during locomotion is generally overlooked, which is critically important in noise-sensitive indoor environments, such as service and healthcare settings, where maintaining low noise levels is essential. This study aims to optimize the acoustic noise generated by quadruped robots during locomotion through the development of advanced motion control algorithms. To achieve this, we propose a novel approach that minimizes noise emissions by integrating optimized gait design with tailored control strategies. This method achieves an average noise reduction of approximately 8 dBA during movement, thereby enhancing the suitability of quadruped robots for deployment in noise-sensitive indoor environments. Experimental results demonstrate the effectiveness of this approach across various indoor settings, highlighting the potential of quadruped robots for quiet operation in noise-sensitive environments.

13:25-13:30, Paper ThBT25.2
MUSE: A Real-Time Multi-Sensor State Estimator for Quadruped Robots

Nistico, Ylenia	IIT
Soares, João Carlos Virgolino	Istituto Italiano Di Tecnologia
Amatucci, Lorenzo	Istituto Italiano Di Tecnologia
Fink, Geoff	Thompson Rivers University
Semini, Claudio	Istituto Italiano Di Tecnologia
Keywords: Legged Robots, Sensor Fusion, Localization Abstract: This letter introduces an innovative state estimator, MUSE (MUlti-sensor State Estimator), designed to enhance state estimation’s accuracy and real-time performance in quadruped robot navigation. The proposed state estimator builds upon our previous work presented in [1]. It integrates data from a range of onboard sensors, including IMUs, encoders, cameras, and LiDARs, to deliver a comprehensive and reliable estimation of the robot’s pose and motion, even in slippery scenarios. We tested MUSE on a Unitree Aliengo robot, successfully closing the locomotion control loop in difficult scenarios, including slippery and uneven terrain. Benchmarking against Pronto [2] and VILENS [3] showed 67.6% and 26.7% reductions in translational errors, respectively. Additionally, MUSE outperformed DLIO [4], a LiDAR-inertial odometry system in rotational errors and frequency, while the proprioceptive version of MUSE (P-MUSE) outperformed TSIF [5], with a 45.9% reduction in absolute trajectory error (ATE).

13:30-13:35, Paper ThBT25.3
Design and Control of SKATER: A Wheeled-Bipedal Robot with High-Speed Turning Robustness and Terrain Adaptability (I)

Wang, Yu	Shandong University
Chen, Teng	Shandong University
Rong, Xuewen	Shandong University
Zhang, Guoteng	Shandong University
Li, Yibin	Shandong University
Xin, Yaxian	Shandong University
Keywords: Legged Robots, Whole-Body Motion Planning and Control, Mechanism Design Abstract: This article presents the design, control, and implementation of a novel wheeled-bipedal robot: SKATER. The design of the wheeled-leg structure and joint actuators is introduced and a hardware control architecture is developed for state perception and servo control of the robot. A distributed dynamic modeling strategy is employed to reveal the force transfer relationship between the torso and the wheeled-leg system. Furthermore, a hierarchical control framework based on model predictive control (MPC) is proposed, with the incorporation of centrifugal force compensation (CFC) and terrain adaptation control strategies into the whole-body controller to enhance the high-speed turning robustness and terrain adaptability of the robot. The robustness to disturbance and high-speed small-radius turning are verified by anti-disturbance and centrifugal force compensation turning experiments. In addition, the robot exhibits active compliance and adaptability when facing unstructured terrains, as evidenced by the performance in continuous step-down and down stairs with single-leg experiments.

13:35-13:40, Paper ThBT25.4
RoboDuet: Learning a Cooperative Policy for Whole-Body Legged Loco-Manipulation

Pan, Guoping	Tsinghua University
Ben, Qingwei	Tsinghua University
Yuan, Zhecheng	Tsinghua University
Jiang, Guangqi	Sichuan University
Ji, Yandong	UCSD
Li, Shoujie	Tsinghua Shenzhen International Graduate School
Pang, Jiangmiao	Shanghai AI Laboratory
Liu, Houde	Shenzhen Graduate School, Tsinghua University
Xu, Huazhe	Tsinghua University
Keywords: Legged Robots, Whole-Body Motion Planning and Control, Mobile Manipulation Abstract: Fully leveraging the loco-manipulation capabilities of a quadruped robot equipped with a robotic arm is non-trivial, as it requires controlling all degrees of freedom (DoFs) of the quadruped robot to achieve effective whole-body coordination. In this letter, we propose a novel framework RoboDuet, which employs two collaborative policies to realize locomotion and manipulation simultaneously, achieving whole-body control through mutual interactions. Beyond enabling large-range 6D pose tracking for manipulation, we find that the two-policy framework supports zero-shot transfer across quadruped robots with similar morphology and physical dimensions in the real world. Our experiments demonstrate that RoboDuet achieves a 23% improvement in success rate over the baseline in challenging loco-manipulation tasks employing whole-body control. To support further research, we provide open-source code and additional videos on our website: https://locomanip-duet.github.io/.

13:40-13:45, Paper ThBT25.5
GHO-WBC: A Gradient-Based Hierarchical Kinematic Optimization Approach to Enhance the Reachability of a Humanoid Robot

Zhu, Weiliang	Shandong University
Zhang, Guoteng	Shandong University
Qiao, Lichao	School of Control Science and Engineering
Ge, Ligang	Ubtech Robotics Corp
Li, Yibin	Shandong University
Keywords: Legged Robots, Whole-Body Motion Planning and Control Abstract: Humanoid robots are vital tools for substituting humans in various operational scenarios. A sufficiently large stationary reachability is a key factor in ensuring their operational capability. To address this challenge, this paper proposes a whole-body reachability enhancing approach for humanoid robots based on gradient optimization, referred to as Gradient-based Hierarchical Optimization Whole-Body Control (GHO-WBC). The goal of the proposed approach is to extend the end-effector reachability of the humanoid robot while maintaining its stationary state. The proposed approach first derives the gradient of the robot’s whole-body center of mass (CoM) position, ensuring stationary stability across extreme reachable ranges. Next, the gradient of the key joint segment singularity is derived to achieve the stability of the humanoid robot's end effector at extreme operational distances. Finally, a multi-level optimization approach is employed to compute a feasible solution for the whole-body joint kinematics, and experimental validation is conducted on the humanoid robot. Compared to the conventional whole-body control optimization approach, the present approach improves the reachable range by more than 89%.

13:45-13:50, Paper ThBT25.6
A Unified Framework to Learn Collision-Free Loco-Manipulation Via Adversarial Motion Priors

Yin, Huayang	University of Science and Technology of China
Qian, Tangyu	University of Science and Technology of China
Li, Mingrui	University of Science & Technology of China
Lu, Guanchen	University of Science and Technology of China
Mingyu, Cai	University of California Riverside
Kan, Zhen	University of Science and Technology of China
Keywords: Legged Robots, Whole-Body Motion Planning and Control Abstract: Designing a whole-body controller for loco-manipulation in unstructured real-world environments remains a formidable challenge. Previous approaches have primarily focused on extending the workspace of robotic arms while maintaining quadrupedal landing postures. However, these methods fail to fully exploit the mobility of legged robots. To address these limitations, we propose a unified framework for collision-free loco-manipulation in real-world applications. The framework comprises two key modules: (1) a Loco-manipulation Motion Prior, which generates loco-manipulation skill trajectories via Trajectory Optimization (TO), and (2) a Collision-free Manipulation module using a Model Predictive Path Integral (MPPI)-based trajectory generator and a vector-based trajectory follower. Extensive experiments have been conducted in both simulation and real-world scenarios to evaluate our framework’s tracking accuracy, whole-body coordination, and workspace expansion capabilities. Supplementary and videos are available at: https://sites.google.com/view/loco-mani-amp

13:50-13:55, Paper ThBT25.7
Reduced-Dimensional Whole-Body Control Based on Model Simplification for Bipedal Robots with Parallel Mechanisms

Liang, Yunpeng	Shanghai Jiao Tong University
Yin, Fulong	Agibot Technologt Co. Ltd
Li, Zhen	Agibot
Xiong, Zhilin	AgiBot Technology Co. Ltd
Peng, Zhihui	Huawei Research
Zhao, Yanzheng	Robotics Research Institute, Shanghai Jiaotong University
Yan, Weixin	Shanghai Jiaotong University
Keywords: Legged Robots, Whole-Body Motion Planning and Control Abstract: The presence of parallel mechanisms in bipedal robots increases the complexity of modeling and control, making it crucial to manage the trade-off between model accuracy and real-time control. In this letter, we propose a reduced-dimensional whole-body controller for series-parallel bipedal robots, utilizing a floating-base multi-rigid body model with kinematic loops. Notably, we neglect the joint acceleration and closed-loop acceleration constraints of the parallel mechanisms, reducing the dimensionality of variables and constraints in the whole-body optimization problem while ensuring compliance with actuated joint torque limits. Quantitative experiments indicate that, compared to the complete series-parallel model, the impact of inertial forces resulting from the parallel joint acceleration is negligible. Additionally, physical locomotion and disturbance tests demonstrate that our proposed controller can enhance computational efficiency by over 20%, with comparable locomotion performance and disturbance rejection ability.


ThBT26	103B
Cooperating Robots	Regular Session
Chair: Kemsaram, Narsimlu	University College London (UCL)

13:20-13:25, Paper ThBT26.1
Robust Wrench-Feasible Control for Multiple UAVs Aerial Transportation System with Adaptive Cable Configuration

Yang, Yu	Xiamen University
Zhao, Shuaiping	Xiamen University
Zhang, YuChen	Xiamen University
Wu, Liaoni	Xiamen University
Keywords: Cooperating Robots, Aerial Systems: Mechanics and Control Abstract: Due to the bounded thrust, motion acceleration, and external disturbances inherent in quadrotor UAVs, traditional hierarchical control methods for multiple UAVs aerial transportation systems (MUATS) with cable-suspended payloads often struggle to guarantee dynamic performance and payload wrench feasibility. To address these challenges, this paper proposes a robust wrench-feasible control framework. First, we design a payload controller based on an extended state observer to estimate and compensate for the total disturbance acting on the payload. Next, a piecewise wrench adjustment strategy (PWAS) is proposed to ensure that the payload wrench balances feasibility and path-tracking ability. Finally, we propose an adaptive cable configuration strategy (ACCS) inspired by capacity margin. When external disturbances approach the system's wrench capacity, ACCS can dynamically adjust cable configuration to maximize the capacity margin. Experimental results demonstrate the effectiveness and superiority of the proposed method.

13:25-13:30, Paper ThBT26.2
Decentralized Model-Free Monitoring of Multi-UAV-Multi-USV Systems Using Sparse Data and Bayesian Learning

Huang, Jiajie	Huazhong University of Science and Technology
Zheng, Yaozhong	Huazhong Univeristy of Science and Technology
Hu, Binbin	University of Groningen
Ding, Jianing	Huazhong University of Science and Technology
Zhang, Hai-Tao	Huazhong University of Science AndTechnology
Keywords: Cooperating Robots, Aerial Systems: Perception and Autonomy Abstract: Although significant progress has been made in coordinating multi-unmanned surface vehicle (multi-USV or USVs) systems over the past decades, consistent monitoring (or tracking) of such systems remains challenging as they do not share data with monitoring systems, further exacerbated by observed data inaccuracies and sparsity. To tackle the complicated issue, we hereby introduce the multi-unmanned aerial vehicle (multi-UAV or UAVs) system to monitor the multi-USV system. Therein, by introducing a sparse-Bayesian-learning-based (SBL-based) algorithm, the multi-UAV system can identify the potential coordinated dynamics of multi-USV system via only noisy and limited data. Then, by employing the Kalman filter (KF), the proposed approach can predict and update real-time data and optimize trajectory estimation for USVs, and enhance coordination control in the multi-UAV system to achieve coordinated monitoring. Finally, comparative simulations against the traditional control method, conducted under varying noise levels and data availability ratios, demonstrate the effectiveness and superiority of the proposed method.

13:30-13:35, Paper ThBT26.3
A Cooperative Contactless Object Transport with Acoustic Robots

Kemsaram, Narsimlu	University College London (UCL)
Delibasi, Akin	University College London
Hardwick, James	University College London
Gautam, Bonot	University College London
Martinez Plasencia, Diego	University College London
Subramanian, Sriram	University College London
Keywords: Cooperating Robots, Biologically-Inspired Robots, Swarm Robotics Abstract: Cooperative transport, the simultaneous movement of an object by multiple agents, has been widely observed in biological systems such as ant colonies, which improve efficiency and adaptability in dynamic environments. Inspired by these natural phenomena, we present a novel acoustic robotic system for the transport of contactless objects in mid-air. Our system utilizes onboard phased ultrasonic transducers and a robotic control system to generate localized acoustic pressure fields, enabling the precise manipulation of airborne particles and robots. We categorize contactless object-transport strategies into independent transport (uncoordinated) and forward-facing cooperative transport (coordinated), drawing parallels with biological systems to optimize efficiency and robustness. The proposed system is experimentally validated by evaluating levitation stability using a microphone in the measurements lab, transport efficiency through a phase-space motion capture system, and clock synchronization accuracy using an oscilloscope. The results demonstrate the feasibility of both independent and cooperative airborne object transport. This research contributes to the field of acoustophoretic robotics, with potential applications in contactless material handling, microassembly, and biomedical applications.

13:35-13:40, Paper ThBT26.4
Experimental Evaluation of Radio-Aware Semantic Map with 5G-Enabled Mobile Robots

Lendínez Ibáñez, Adrián	University of Bedfordshire
Zanzi, Lanfranco	NEC Laboratories Europe
Xi, Li	NEC Laboratories Europe
Moreno Olivares, Sandra	Robotnik Automation
Gari, Guillem	Robotnik Automation
Lessi, Christina	Hellenic Telecommunications Organization S.A. (OTE)
Guroma, Vladimir	5G-Era, University of Bedfordshire
Qiu, Renxi	University of Bedfordshire
Costa-Perez, Xavier	NEC Laboratories Europe
Keywords: Networked Robots, Wheeled Robots, Software Tools for Robot Programming Abstract: With the rapid development of 5G technology and the increasing demand for autonomous mobile robots, there is a trend to leverage the ultra-low latency, high data rates, and reliable wireless connectivity offered by 5G to improve the perception and navigation of robots in unknown environments. This paper presents a novel approach for creating and exploiting radio-aware semantic maps to empower 5G-enabled mobile robots operating within an unknown envi- ronment. The proposed solution allows for smart offloading of robotic applications and task processing onto the edge systems while facilitating real-time data exchange, and enables robots to gather environment data from both onboard sensors and the mobile network for more efficient robot operation and resource orchestration decisions. A radio-aware semantic mapping framework is introduced, which combines radio signal quality information with semantic mapping techniques to create a comprehensive understanding of the environment, which may evolve over time. The semantic map, enriched with radio quality measurement data, enables mobile robots to make timely informed decisions by considering real-time radio quality variations. Our experimental evaluation demonstrates the effectiveness of adopting radio semantic maps to enhance real-time robot operations on navigation and task offloading in unstructured environments.

13:40-13:45, Paper ThBT26.5
UAV Chain Network Creation in Cluttered Environments with Flocking Rules and Routing Data

Balaguer, Théotime	INSA Lyon
Simonin, Olivier	INSA De Lyon
Guerin Lassous, Isabelle	UMR 5668, ENS Lyon - CNRS - Inria - UCB Lyon 1
Fantoni, Isabelle	CNRS
Keywords: Networked Robots, Multi-Robot Systems, Autonomous Agents Abstract: This paper introduces a novel distributed approach for forming UAV-based multi-hop relay networks by adapting traditional flocking models to create relay chains between remote points. Our method modifies the standard flocking paradigm by incorporating dynamic agent roles, allowing UAVs to self-organize based solely on local state and neighbor information, and integrates networking information such as routing decisions directly into mobility control. A side contribution is the introduction of a Line-of-Sight (LOS) conservation force, which mitigates communication failures due to obstacles and is easily adaptable to the flocking model. The proposed algorithm is evaluated using a joint robotics and network co-simulator that combines realistic multi-rotor physics with {ns-3}-based network simulations. Simulation results across diverse environments and varying agent densities demonstrate that our approach effectively maintains connectivity, enhances Quality of Service (QoS), and scales robustly, thereby bridging the gap between robotic control and aerial wireless network design.

13:45-13:50, Paper ThBT26.6
A Highly Maneuverable Flying Squirrel Drone with Agility-Improving Foldable Wings

Lee, Dohyeon	Pohang University of Science and Technology (POSTECH)
Kang, Jungill	Pohang University of Science and Technology (POSTECH)
Han, Soohee	Pohang University of Science and Technology ( POSTECH )
Keywords: Biomimetics, Aerial Systems: Mechanics and Control, AI-Based Methods Abstract: Drones, like most airborne aerial vehicles, face inherent disadvantages in achieving agile flight due to their limited thrust capabilities. These physical constraints cannot be fully addressed through advancements in control algorithms alone. Drawing inspiration from the winged flying squirrel, this paper proposes a highly maneuverable drone equipped with agility-enhancing foldable wings. The additional air resistance generated by appropriately deploying these wings significantly improves the tracking performance of the proposed "flying squirrel" drone. By leveraging collaborative control between the conventional propeller system and the foldable wings—coordinated through the Thrust-Wing Coordination Control (TWCC) framework—the controllable acceleration set is expanded, enabling the generation of abrupt vertical forces that are unachievable with traditional wingless drones. The complex aerodynamics of the foldable wings are modeled using a physics-assisted recurrent neural network (paRNN), which calibrates the angle of attack (AOA) to align with the real aerodynamic behavior of the wings. The model is trained on real flight data and incorporates flat-plate aerodynamic principles. Experimental results demonstrate that the proposed flying squirrel drone achieves a 13.1% improvement in tracking performance, as measured by root mean square error (RMSE), compared to a conventional wingless drone. A demonstration video is available on YouTube: https://youtu.be/NuuPjoJPUsE.


ThBT27	103C
Energy and Environment-Aware Automation 2	Regular Session
Chair: Walker, Kyle Liam	EPFL
Co-Chair: Saeedi, Sajad	Toronto Metropolitan University

13:20-13:25, Paper ThBT27.1
DEEP-SEA: Deep-Learning Enhancement for Environmental Perception in Submerged Aquatics

Chen, Shuang	Durham University
Thenius, Ronald	Karl Franzens University
Arvin, Farshad	Durham University
Atapour-Abarghouei, Amir	Durham University
Keywords: Environment Monitoring and Management, Deep Learning for Visual Perception Abstract: Continuous and reliable underwater monitoring is essential for assessing marine biodiversity, detecting ecological changes and supporting autonomous exploration in aquatic environments. Underwater monitoring platforms rely on mainly visual data for marine biodiversity analysis, ecological assessment and autonomous exploration. However, underwater environments present significant challenges due to light scattering, absorption and turbidity, which degrade image clarity and distort colour information, which makes accurate observation difficult. To address these challenges, we propose DEEP-SEA, a novel deep learning-based underwater image restoration model to enhance both low- and high-frequency information while preserving spatial structures. The proposed Dual-Frequency Enhanced Self-Attention Spatial and Frequency Modulator aims to adaptively refine feature representations in frequency domains and simultaneously spatial information for better structural preservation. Our comprehensive experiments on EUVP and LSUI datasets demonstrate the superiority over the state of the art in restoring fine-grained image detail and structural consistency. By effectively mitigating underwater visual degradation, DEEP-SEA has the potential to improve the reliability of underwater monitoring platforms for more accurate ecological observation, species identification and autonomous navigation.

13:25-13:30, Paper ThBT27.2
Physics-Based Gas Mapping with Nano Aerial Vehicles: The ADApprox Algorithm

Bösel-Schmid, Nicolaj	EPFL
Jin, Wanting	EPFL
Martinoli, Alcherio	EPFL
Keywords: Environment Monitoring and Management, Mapping, Aerial Systems: Applications Abstract: Gas emissions play a crucial role in many environmental and industrial processes, driving a growing effort to understand their dispersion in air. Nonetheless, gas distribution mapping is inherently challenging due to the complex interplay between gas diffusion and wind flows. Mobile robots provide a compelling alternative to static sensor networks for gas sensing, having greater mobility and minimize the need to permanently deploy assets in the environment. However, robotic platforms typically collect only sparse measurements due to constraints, such as limited battery life, and state-of-the-art methods often fail to accurately interpolate between scattered data. To address this limitation, we introduce ADApprox — a novel gas mapping algorithm. By leveraging the underlying physics which governs gas dispersion, ADAapprox offers superior interpolation capabilities. Our method locally approximates advection-diffusion equation for an entire grid of points and learns the model parameters from gas measurements. The learned parameters are subsequently used to predict gas concentrations across the entire environment. Extensive simulations and physical experiments are conducted using a nano aerial vehicle. The mapping results demonstrate that ADApprox consistently outperforms a state-of-the-art algorithm (Kernel DM+V/W) while being comparable in terms of computational cost. In addition, we evaluate the effectiveness in localizing a gas source based on the predicted gas maps. Our findings indicate that ADApprox effectively localizes the gas source, achieving a median error of 18cm on an area of 12m^2 in physical experiments.

13:30-13:35, Paper ThBT27.3
Cumulative Informative Path Planning for Efficient Gas Source Localization with Mobile Robots

Jin, Wanting	EPFL
Leroy, Hugo	EPFL
Bösel-Schmid, Nicolaj	EPFL
Martinoli, Alcherio	EPFL
Keywords: Environment Monitoring and Management, Motion and Path Planning, Probabilistic Inference Abstract: Localizing gas sources is a challenging task due to the complex nature of gas dispersion. Informative Path Planning (IPP) plays a crucial role in guiding robots to sample at high-information points, thereby accelerating the estimation process. Existing probabilistic gas source localization methods often require robots to halt at sampling positions, averaging gas measurements over time. Consequently, when selecting the next sampling position, information gains are usually computed precisely through computationally heavy procedures, limiting evaluations to a small set of potential positions. In our previous work, we introduced a sense-in-motion strategy that eliminates the need for prolonged stops at sampling points, therefore allowing the incorporation of measurements taken during robot movement. Building upon this advancement, we propose extending information gain evaluation in a more continuous manner, from a point evaluation to a path evaluation. However, existing IPP methods are too computationally expensive when transitioning from goal-based to region-based evaluations. To address this challenge, we first assess three lightweight information extraction metrics. Based on the selected metrics, we propose a novel IPP algorithm that computes cumulative information along the robot's path and dynamically prioritizes exploration or exploitation based on the uncertainty of the source estimation. The proposed method is extensively evaluated through both high-fidelity simulations and physical experiments. Results show that our proposed method consistently outperforms a benchmark state-of-the-art method, achieving a 40% increase in source localization success rate and halving the experimental time in challenging environments.

13:35-13:40, Paper ThBT27.4
SuMag: Suspended Magnetometer Survey for Mineral Data Acquisition with Vertical Take-Off and Landing Fixed-Wing Aircraft

Efrem, Robel	Toronto Metropolitan University
Coutu, Alexandre	Toronto Metropolitan University
Saeedi, Sajad	Toronto Metropolitan University
Keywords: Aerial Systems: Applications, Field Robots, Environment Monitoring and Management Abstract: Multirotor Uncrewed Aerial Vehicle (UAV)s have recently become an important instrument for the magnetic method for mineral exploration (MMME), enabling more effective and accurate geological investigations. This paper explores the difficulties in mounting high-sensitivity sensors on a UAV platform, including electromagnetic interference, payload dynamics, and maintaining stable sensor performance while in flight. It is highlighted how the specific solutions provided to deal with these problems have the potential to alter the collection of data using the MMME, assisted by UAVs. The work also shows experimental findings that demonstrate the creative potential of these solutions in UAV-based data collection for the MMME, leading to improvements in effective mineral exploration through careful design, testing, and assessment of these systems. These innovations resulted in a platform that is quickly deployable in remote areas and able to operate more efficiently compared to traditional crewed aircraft or multirotor UAVs while still producing equal or higher quality results. This allows for much higher efficiency and lower operating costs for high production UAV-based data collection for the MMME.

13:40-13:45, Paper ThBT27.5
GDM-Net++: Multi-Robot 2D and 3D Gas Distribution Mapping Via Deep Q-Learning and Gaussian Process Regression

Kulbaka, Iliya	University of North Florida
Dutta, Ayan	University of North Florida
Kreidl, Patrick	University of North Florida
Bölöni, Ladislau	University of Central Florida
Roy, Swapnoneel	University of North Florida
Keywords: Environment Monitoring and Management, Reinforcement Learning, Deep Learning Methods Abstract: Gas distribution mapping (GDM) refers to the task of mapping the gas concentrations of an airborne chemical over a region of interest. A mobile robot equipped with a gas sensor can be used potentially autonomously to build such a distribution map. However, modern-day robots might not have enough battery power to cover the entire area of interest. Therefore, a group of n such collaborative mobile robots can be used for this purpose. The goal of the robots is to sample concentrations from a fraction of locations and infer the gas intensities in the rest of the area using a supervised machine learning technique, namely the Gaussian Process (GP). To this end, we propose a novel multi-robot gas distribution mapping framework, named GDM-Net++, which works in both 2D and 3D settings. Our proposed framework first divides the environment into n unique regions using Voronoi partitioning. Next, we employ a multi-agent deep Q-learning framework for the robots to learn a joint policy. As GP is a compute-intensive process, during testing, the learned policy is applied without re-training the GP model. The experiments are performed in simulation using Python on six types of Gaussian plumes to validate our proposed technique. Compared to two baselines – greedy and random walk, GDM-Net++ performs by 278% and 852% better in terms of earned rewards, while outperforming them by 34% and 155%, respectively, in terms of the precision of gas distribution modeling across unseen 2D test cases. Our approach can also gracefully handle 2D GDM scenarios where the distribution is consistently affected by wind.

13:45-13:50, Paper ThBT27.6
Persistent Preservation of a Spatio-Temporal Environment under Uncertainty

Docena, Amel Nestor	Dartmouth College
Quattrini Li, Alberto	Dartmouth College
Keywords: Environment Monitoring and Management, Task Planning Abstract: This paper tackles the spatio-temporal areas restoration problem for a single robot when faced with state uncertainty: a robot, with limited battery life, deployed in a known environment, persistently plans a schedule to visit areas of interest and charge its battery as needed. The temporal properties of areas decay over time, wherein the decay is only partially observable and evolves over time, potentially with correlation among areas. The goal is to restore the temporal properties so that the time the measured property values are below a certain threshold is minimized. Our previous work formulated the spatio-temporal areas restoration problem assuming that the decays are known. Instead, in this paper, we relax that assumption and account for the uncertainty, proposing a heuristic to measure the discounted opportunity cost of a visit, which induces risk-aversion to re-visit overlooked areas, and adding a component that learns the decay parameters in each area as well as potential correlation among areas. The learning component can then be used to predict future trends and be incorporated in the heuristic forecast. Moreover, the algorithm learns and constantly adjusts for noise that can happen during mission. We show in experiments using a robotics simulator that our devised approach is able to maintain areas above the critical threshold better than existing state-of-the-art methods from related problems. This contribution enables a robot to come up with an effective schedule efficiently for preserving spatio-temporal properties of an environment considering realistic scenarios--which has markedly impact in important environmental applications.

13:50-13:55, Paper ThBT27.7
EANS: Reducing Energy Consumption for UAV with an Environmental Adaptive Navigation Strategy

Liu, Tian	Sun Yat-Sen University
Liu, Han	Sun Yat-Sen University
Li, Boyang	Sun Yat-Sen University
Cheng, Long	Sun Yat-Sen University
Huang, Kai	Sun Yat-Sen University
Keywords: Energy and Environment-Aware Automation, Embedded Systems for Robotic and Automation, Aerial Systems: Perception and Autonomy Abstract: Unmanned Aerial Vehicles (UAVS) are limited by the onboard energy. Refinement of the navigation strategy directly affects both the flight velocity and the trajectory based on the adjustment of key parameters in the UAVS pipeline, thus reducing energy consumption. However, existing techniques tend to adopt static and conservative strategies in dynamic scenarios, leading to inefficient energy reduction. Dynamically adjusting the navigation strategy requires overcoming the challenges including the task pipeline interdependencies, the environmental-strategy correlations, and the selecting parameters. To solve the aforementioned problems, this paper proposes a method to dynamically adjust the navigation strategy of the UAVS by analyzing its dynamic characteristics and the temporal characteristics of the autonomous navigation pipeline, thereby reducing UAVS energy consumption in response to environmental changes. We compare our method with the baseline through hardware-in-the-loop (HIL) simulation and real-world experiments, showing our method 3.2X and 2.6X improvements in mission time, 2.4X and 1.6X improvements in energy, respectively.

13:55-14:00, Paper ThBT27.8
Noise Fusion-Based Distillation Learning for Anomaly Detection in Complex Industrial Environments

Yu, Jiawen	Fudan University
Ren, Jieji	Shanghai Jiao Tong University
Chang, Yang	Fudan University
Yu, Qiaojun	Shanghai Jiao Tong University
Tong, Xuan	Fudan University
Wang, Boyang	Fudan University
Song, Yan	Fudan University
Li, You	Fudan University
Mai, Xinji	Fudan University
Zhang, Wenqiang	Fudan University
Keywords: Factory Automation, Failure Detection and Recovery, Object Detection, Segmentation and Categorization Abstract: Anomaly detection and localization in automated industrial manufacturing can significantly enhance production efficiency and product quality. Existing methods are capable of detecting surface defects in pre-defined or controlled imaging environments. However, accurately detecting workpiece defects in complex and unstructured industrial environments with varying views, poses and illumination remains challenging. We propose a novel anomaly detection and localization method specifically designed to handle inputs with perturbative patterns. Our approach introduces a new framework based on a collaborative distillation heterogeneous teacher network HetNet, an adaptive local-global feature fusion module, and a local multivariate Gaussian noise generation module. HetNet can learn to model the complex feature distribution of normal patterns using limited information about local disruptive changes. We conducted extensive experiments on mainstream benchmarks. HetNet demonstrates superior performance with approximately 10% improvement across all evaluation metrics on MSC-AD under industrial conditions, while achieving state-of-the-art results on other datasets, validating its resilience to environmental fluctuations and its capability to enhance the reliability of industrial anomaly detection systems across diverse scenarios. Tests in real-world environments further confirm that HetNet can be effectively integrated into production lines to achieve robust and real-time anomaly detection. Codes, images and videos are published on the project website at: https://zihuatanejoyu.github.io/HetNet/


ThBT28	104
Rehabilitation Robotics 2	Regular Session
Co-Chair: Fu, Chenglong	Southern University of Science and Technology (SUSTech)

13:20-13:25, Paper ThBT28.1
Hip-Knee-Ankle Rehabilitation Exoskeleton with Compliant Actuators: From Human-Robot Interaction Control to Clinical Evaluation

Chen, Wanxin	Shenyang Institute of Automation, Chinese Academy of Sciences
Zhang, Bi	Shenyang Institute of Automation, Chinese Academy of Sciences
Tan, Xiaowei	Shenyang Institute of Automation, Chinese Academy of Sciences
Zhao, Yiwen	Robotics Lab., Shenyang Institute of Automation, CAS
Liu, Lianqing	Shenyang Institute of Automation
Zhao, Xingang	Shenyang Institute of Automation, Chinese Academy of Sciences
Keywords: Rehabilitation Robotics, Wearable Robots, Physical Human-Robot Interaction, Exoskeletons Abstract: While rehabilitation exoskeletons have been extensively studied, systematic design principles for effectively addressing heterogeneous bilateral locomotion in hemiplegia patients are poorly understood. In this article, a multi-joint lower exoskeleton driven by series elastic actuators (SEAs) is developed, and the design philosophy of rehabilitation robots for hemiplegia patients is systematically explored. The exoskeleton has six powered joints for both lower limbs in a hip--knee--ankle configuration, and each joint incorporates a custom, lightweight SEA module. A unified interaction-oriented control framework is designed for exoskeleton-assisted walking, including gait generation, task scheduling and advanced joint-level control. The closed-loop design provides methodical solutions to address hemiplegia rehabilitation needs and provides walking assistance for bilateral lower limbs. Moreover, a multi-template gait generation approach is proposed to address the altered kinematics induced by exoskeleton-assisted walking and enhance the exoskeleton's adaptability to patient-specific kinematic variations in an iterative manner. Experiments are conducted with both healthy individuals and hemiplegia patients to verify the effectiveness of the exoskeleton system. The clinical outcomes demonstrate that the exoskeleton can achieve mechanical transparency, facilitate movement and enable coordinated inter-joint locomotion for bilateral gait assistance.

13:25-13:30, Paper ThBT28.2
Design Optimization of a Single-DoF Gait Rehabilitation Robot for a Domestic Environment

Ambros, Julius	Technical University of Munich
Le Mesle, Valentin	Technical University of Munich
Tissari, Laura	Reactive Robotics
Börner, Hendrik	Reactive Robotics
Peyrl, Helfried	Reactive Robotics
Lueth, Tim C.	Technical University of Munich
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Keywords: Rehabilitation Robotics, Mechanism Design, Medical Robots and Systems Abstract: In an aging society, the need for rehabilitation treatment is expected to rise. As current healthcare systems have limited capacity and personnel, access to rehabilitation devices usable in households can help address the demand. A lower-limb rehabilitation robot designed for home use must be adaptable to accommodate acute and chronic rehabilitation phases. Existing devices are mechanically complex and require intricate, patient-specific adjustments. To address this, we propose a single degree of freedom (DoF) mechanism based on a chain drive that can be used in multiple configurations, inside and outside a patient bed. We model the gait pattern and construct a custom cost function that captures key features of natural human walking. This cost function is then used to optimize the design parameters of the robot via a direct-search solver to accommodate patients of varying sizes and achieve effective rehabilitation with a fixed trajectory. The outcome is validated experimentally by comparing two robot configurations with five healthy subjects.

13:30-13:35, Paper ThBT28.3
Crouch Gait Recognition of Children with Cerebral Palsy Based on CNN-LSTM Hybrid Model

Liu, Junhang	Southern University of Science and Technology
Luo, Mingxiang	Chinese Academy of Sciences
Zhang, Shuo	Shenzhen Institutes of Advanced Technology, Chinese Academy of S
Cao, Wujing	Shenzhen Institutes of Advanced Technology, Chinese Academy of S
Wu, Xinyu	CAS
Keywords: Rehabilitation Robotics, Recognition, Datasets for Human Motion Abstract: Crouch gait is one of the key characteristics of children with cerebral palsy, and early detection of gait changes is crucial for subsequent exoskeleton-assisted therapy. This study uses the Vicon 3D motion capture system to collect experimental data on four gait phases of children with cerebral palsy and introduces a CNN-LSTM hybrid model. The model combines the spatial feature extraction strengths of CNN with the temporal sequence modeling capabilities of LSTM, enabling it to effectively identify the complex dynamic changes in gait specific to children with cerebral palsy. By integrating these two components, the model not only accurately extracts key gait features but also captures the temporal dependencies within the gait cycle, allowing for precise recognition of crouch gait. Experimental results demonstrate that the proposed model exhibits good robustness and achieves high accuracy in both overall gait recognition and distinguishing the four individual gait phases. It significantly outperforms traditional machine learning architectures.

13:35-13:40, Paper ThBT28.4
A Learning Quasi-Stiffness Control Framework of a Powered Transfemoral Prosthesis for Adaptive Speed and Incline Walking

Ma, Teng	National University of Singapore
Yin, Shucong	Southern University of Science and Technology
Hou, Zhimin	National University of Singapore
Wang, Yuxuan	The Southern University of Science and Technology
Huang, Binxin	Southern University of Science and Technology
Yu, Haoyong	National University of Singapore
Fu, Chenglong	Southern University of Science and Technology (SUSTech)
Keywords: Rehabilitation Robotics, Prosthetics and Exoskeletons Abstract: Impedance-based control represents a prevalent strategy in the powered transfemoral prostheses because of its ability to reproduce natural walking. However, most existing studies have developed impedance-based prosthesis controllers for specific tasks, while creating a task-adaptive controller for variable-task walking continues to be a significant challenge. This article proposes a task-adaptive quasi-stiffness control framework for powered prostheses that generalizes across various walking tasks, enhancing the gait symmetry between the prosthesis and intact leg. A Gaussian Process Regression (GPR) model is introduced to predict the target features of the human joint's angle and torque in a new task. Subsequently, a Kernelized Movement Primitives (KMP) is employed to reconstruct the torque-angle relationship of the new task from multiple human reference trajectories and estimated target features. Based on the torque-angle relationship of the new task, a quasi-stiffness control approach is designed for a powered prosthesis. Finally, the proposed framework is validated through practical examples, including varying speeds and inclines walking tasks. Notably, the proposed framework not only aligns with but frequently surpasses the performance of a benchmark finite state machine impedance controller (FSMIC) without necessitating manual impedance tuning and has the potential to expand to variable walking tasks in daily life for the transfemoral amputees.

13:40-13:45, Paper ThBT28.5
Development of a 4-DOF Mobile Manipulator for Repetitive Gait Training on the Track for Stroke Patients

Lee, Junyeong	Gwangju Institute of Science and Technology
Lee, Hosu	Gyeongsang National University
Kim, Minkyung	Gwangju Institute of Science and Technology
Jae-young, Han	Chonnam National University Hospital
Yoon, Jungwon	Gwangju Institutue of Science and Technology
Keywords: Rehabilitation Robotics Abstract: An overground track-walking scheme with a body-weight support system can provide task-oriented and repetitive training. Furthermore, it improves gait stability and endurance more effectively than the conventional treadmill walking method. However, it does not improve asymmetry and reduces gait speed. Accordingly, we developed a 4-DOF mobile manipulator in which the position of the handle is controlled to provide continuous somatosensory information (cutaneous & proprioception information), such as a fixed rail to the user’s hand during overground walking for gait enhancement (velocity, symmetry, and balance). The system consists of 3 omni wheel-based mobile robot for robust track following and a 1-DOF revolute joint-based manipulator for gait enhancement during track-based gait guidance. To demonstrate the feasibility of the system, we conducted a pilot experiment with one stroke patient on a 15 m track. The experimental results showed that the robot could guide the patient along the track and enhance symmetry and balance, especially in the curved section of the track. Furthermore, the preferred walking speed of the participant on the track improved. Therefore, the system demonstrated promising potential for providing quantitative, repetitive, and safe track-based overground gait rehabilitation training.

13:45-13:50, Paper ThBT28.6
Lip Geometry-Constrained Smooth Sliding Path Planning for Robotic Negative Pressure Therapy on Extremities

Li, Zihao	Tsinghua University
Nie, Zhenguo	Tsinghua University
Shao, Qi	Tsinghua University
Zhao, Huichan	Tsinghua University
Liu, Xin-Jun	Tsinghua University
Keywords: Rehabilitation Robotics, Motion and Path Planning, Soft Robot Applications Abstract: Negative pressure (NP) therapy with sliding suction is an effective method for limb lymphedema. Due to the caregiver shortage and the patients increase, the robotic NP therapeutic system with a variable-sized suction head can be used to help the lymphedema therapy. However, the varying complexity of different limb regions can affect the accuracy of the suction path. Moreover, the moving suction path should maintain smoothness to ensure therapeutic efficacy. Therefore, finding a smooth sliding path with highly accurate suction poses on the unstructured limb surface poses a significant challenge for robotic therapy. In this paper, a smooth sliding path planning method is proposed for robotic continuous suction in limb lymphedema therapy. The easily-sealed region is identified by comparing point normals to the lip’s suction angle, simplifying path planning to a 2D plane due to lip and limb flexibility. The conjugate gradient method optimizes the path with centroid distance and smoothness constraints. Finally, after the generation of suction poses under the constraints of the lip shape, a smooth sliding path along with lip pressure commands, is obtained to regulate the robot in performing continuous suction therapy. In the experiment, the manipulator with a variable-sized head has been used to finish 10 sliding suctions from different planning path. From the result, the robot could complete 6 times sliding suctions on the phantom arm.

13:50-13:55, Paper ThBT28.7
Exoskeleton Gait Adaptation Framework Via Hm-DMP and PI² Optimization for Dynamic Patient Mobility Matching

Cao, Qiaohuan	Zhejiang University
Liu, Dewei	Zhejiang University
Azam, Hamza	Zhejiang University
Wang, Haoyu	Ningbo Rehabilitation Hospital
Xu, Wenzhu	Hospital Department of Ningbo Rehabilitation Hospital, Ningbo, Z
Fang, Jiongjie	Zhejiang University
Yang, Wei	Zhejiang University
Keywords: Rehabilitation Robotics, Prosthetics and Exoskeletons, Medical Robots and Systems Abstract: Abstract—Repetitive gait training with lower-limb exoskeletons enhances neuroplasticity and reduces muscle atrophy by promoting patient engagement in active rehabilitation training. Importantly, the therapeutic efficacy of such engagement critically depends on providing patients with task difficulty levels matching their real-time walking capacities. To address this, a closed-loop Mobility-Matching Framework is proposed, integrating Hybrid Multi-attractor Dynamic Movement Primitives (Hm-DMP) with Policy Improvement with Path Integral (PI²) optimization, which achieves real-time trajectory adaptation. The Hm-DMP module preserves critical kinematic invariants of normative gait patterns during trajectory deformation through constrained multi-attractor modulation. Simultaneously, the PI²-driven optimizer iteratively adjusts joint trajectory keypoints of Hm-DMP by optimizing a hybrid cost function, enabling dynamic matching between training trajectories and patients’ real-time mobility. Experimental trials on the WEI-EXO platform demonstrate the proposed framework’s robustness to detect and respond to real-time changes in patient’ ambulatory capacity by optimizing assistance trajectories while preserving the normative gait kinematics. This closed-loop adaptation process facilitates personalized gait rehabilitation with exoskeletons, enhancing training efficacy and maintaining comfort across patients with diverse mobility levels. Index Terms - Exoskeleton, Hybrid Multi-attractor Dynamic Movement Primitives (Hm-DMP), Policy Improvement with Path Integral (PI²), Patient Mobility-Matching Framework.

13:55-14:00, Paper ThBT28.8
Human-Robot Coordination Control for Sit-To-Stand Assistance in Hemiparetic Patients with Supernumerary Robotic Leg (I)

Zuo, Jie	Wuhan University of Technology
Huo, Jun	Huazhong University of Science and Technology
Xiao, Xiling	Union Hospital, Tongji Medical College, Huazhong University of S
Zhang, Yanzhao	Wuhan Union Hospital, Huazhong University of Science and Technol
Huang, Jian	Huazhong University of Science and Technology
Keywords: Rehabilitation Robotics, Compliance and Impedance Control, Physical Human-Robot Interaction Abstract: In light of global aging and prevalent stroke-related hemiplegia, this study addresses challenges in robot-assisted Sit-to-Stand (STS) movements, a daily activity prone to falls. Supernumerary Robotic Legs (SRL) serve as independent support, enhancing stability and limb movement range. Existing coordination control methods lack personalization for STS assistance, requiring solutions for human intent transmission and rapidly optimize coordination control challenges in the non-coupled human-robot system. The proposed human-SRL coordination control algorithm, grounded in personalized SRL-human coupling models, incorporates surface electromyography (sEMG) signals to design an intent-driven variable stiffness impedance control. The inclusion of incremental learning enables rapid optimization of impedance parameters, facilitating real-time adjustments in SRL assistance for adaptive coupling with users. Practical experiments involving both healthy participants and hemiparetic patients validate the algorithm's effectiveness during STS. The results validate substantial reductions in STS time (39.54%) and muscle activity (28.01%), highlighting the efficacy of the proposed algorithm-controlled SRL support for hemiparetic individuals.


ThBT29	105
Wearable Robotics 2	Regular Session
Chair: Giorgio-Serchi, Francesco	University of Edinburgh
Co-Chair: Xu, Jiajun	Nanjing University of Aeronautics and Astronautics

13:20-13:25, Paper ThBT29.1
A Variable-Stiffness Neck Exoskeleton with Pneumatic-Driven Actuators for Prolonged Head Flexion Assistance

Li, Tianfang	Tianjin University
Chen, Baojun	Tianjin University
Hua, Zhichao	Tianjin University
Lin, Qiang	Tianjin University
Keywords: Wearable Robotics, Prosthetics and Exoskeletons Abstract: Surgeons are frequently subjected to prolonged unnatural postures, such as sustained neck flexion during surgical procedures, increasing their susceptibility to work-related musculoskeletal disorders (WMSDs). Despite the prevalence of such issues, there remains a scarcity of ergonomic solutions that effectively mitigate neck pain while permitting full operational functionality across diverse environments. In this paper, we present a novel variable-stiffness neck exoskeleton with pneumatic-driven tensile actuators for prolonged head flexion assistance. Its innovative design facilitates unrestricted head movement while providing expected neck support, thereby minimizing strain on antagonist muscle groups. Comprehensive experimental evaluations were conducted to assess the system’s performance. Transitional response times between flexible and rigid states were 0.28 s and 0.46 s, respectively. Experimental trials involving five healthy subjects demonstrated that the average muscular activity reduction of the splenius capitis and the sternocleidomastoid muscles were 38.8±2.0% and 9.7±2.5%, respectively. These experimental results demonstrated great potential of the exoskeleton in practical application for alleviating the physical burden during prolonged head flexion.

13:25-13:30, Paper ThBT29.2
Human-In-The-Loop Optimisation in Robot-Assisted Gait Training

Christou, Andreas	The University of Edinburgh
Sochopoulos, Andreas	The University of Edinburgh
Lister, Elliot	The University of Edinburgh
Vijayakumar, Sethu	University of Edinburgh
Keywords: Wearable Robotics, Human Factors and Human-in-the-Loop, Optimization and Optimal Control Abstract: Wearable robots offer a promising solution for quantitatively monitoring gait and providing systematic, adaptive assistance to promote patient independence and improve gait. However, due to significant interpersonal and intrapersonal variability in walking patterns, it is important to design robot controllers that can adapt to the unique characteristics of each individual. This paper investigates the potential of human-in-the-loop optimisation (HILO) to deliver personalised assistance in gait training. The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) was employed to continuously optimise an assist-as-needed controller of a lower-limb exoskeleton. Six healthy individuals participated over a two-day experiment. Our results suggest that while the CMA-ES appears to converge to a unique set of stiffnesses for each individual, no measurable impact on the subjects' performance was observed during the validation trials. These findings highlight the impact of human-robot co-adaptation and human behaviour variability, whose effect may be greater than potential benefits of personalising rule-based assistive controllers. Our work contributes to understanding the limitations of current personalisation approaches in exoskeleton-assisted gait rehabilitation and identifies key challenges for effective implementation of human-in-the-loop optimisation in this domain.

13:30-13:35, Paper ThBT29.3
Modular Soft Wearable Glove for Real-Time Gesture Recognition and Dynamic 3D Shape Reconstruction

Dong, Huazhi	The University of Edinburgh
Wang, Chunpeng	University of Edinburgh
Jiang, Mingyuan	The University of Edinburgh
Giorgio-Serchi, Francesco	University of Edinburgh
Yang, Yunjie	The University of Edinburgh
Keywords: Wearable Robotics, Soft Sensors and Actuators, Soft Robot Materials and Design Abstract: With the increasing demand for human-computer interaction (HCI), flexible wearable gloves have emerged as a promising solution in virtual reality, medical rehabilitation, and industrial automation. However, the current technology still has problems like insufficient sensitivity and limited durability, which hinder its wide application. This paper presents a highly sensitive, modular, and flexible capacitive sensor based on lineshaped electrodes and liquid metal (EGaIn), integrated into a sensor module tailored to the human hand’s anatomy. The proposed system independently captures bending information from each finger joint, while additional measurements between adjacent fingers enable the recording of subtle variations in inter-finger spacing. This design enables accurate gesture recognition and dynamic hand morphological reconstruction of complex movements using point clouds. Experimental results demonstrate that our classifier based on Convolution Neural Network (CNN) and Multilayer Perceptron (MLP) achieves an accuracy of 99.15% across 30 gestures. Meanwhile, a transformer-based Deep Neural Network (DNN) accurately reconstructs dynamic hand shapes with an Average Distance (AD) of 2.076±3.231 mm, with the reconstruction accuracy at individual key points surpassing SOTA benchmarks by 9.7% to 64.9%. The proposed glove shows excellent accuracy, robustness and scalability in gesture recognition and hand reconstruction, making it a promising solution for next-generation HCI systems.

13:35-13:40, Paper ThBT29.4
Probabilistic Collision Risk Estimation for Pedestrian Navigation

Tourki, Amine	Biped Robotics SA
Prevel, Paul	Biped Robotics SA
Einecke, Nils	Honda Research Institute Europe GmbH
Puphal, Tim	Honda Research Institute Europe GmbH
Alahi, Alexandre	EPFL
Keywords: Wearable Robotics, Vision-Based Navigation, Human Factors and Human-in-the-Loop Abstract: Intelligent devices for supporting persons with vision impairment are becoming more widespread, but they are lacking behind the advancements in intelligent driver assistant system. To make a first step forward, this work discusses the integration of the risk model technology, previously used in autonomous driving and advanced driver assistance systems, into an assistance device for persons with vision impairment. The risk model computes a probabilistic collision risk given object trajectories which has previously been shown to give better indications of an object's collision potential compared to distance or time-to-contact measures in vehicle scenarios. In this work, we show that the risk model is also superior in warning persons with vision impairment about dangerous objects. Our experiments demonstrate that the warning accuracy of the risk model is 67% while both distance and time-to-contact measures reach only 51% accuracy for real-world data.

13:40-13:45, Paper ThBT29.5
Design and Control of Soft Robotic Wearable with SMA-Based Artificial Muscle Fibers for Ankle Assistance

Joo, Eunsung	Ajou University
Kim, Changhwan	Ajou University
Im, Seungbin	Ajou University
Koo, Sumin Helen	Yonsei University
Koh, Je-Sung	Ajou University
Keywords: Wearable Robotics, Soft Sensors and Actuators, Force Control Abstract: Soft robotic wearables, with lightweight and flexible actuation, have shown promising results in assistive applications. However, it remains unclear whether they can be made fully comfortable and suitable for everyday use. In this study, we introduce a clothing-type soft robotic wearable embedded with shape memory alloy (SMA) based artificial muscle fibers for ankle plantarflexion assistance. We conducted force characterization of SMA wires, analyzing the effects of thickness, strain, applied current, and number of wires. The actuator was designed to achieve assistive force of 80 N and implemented a closed-loop controller with a PI controller to enable precise force and displacement control. Last, we demonstrate that the developed system can deliver controlled and repeatable forces in bench tests and modulate peak force according to target assistive levels on the ankle during walking in a user study.

13:45-13:50, Paper ThBT29.6
Personalized Reinforcement Learning Control of Soft Robotic Exosuit for Assisting Human Normative Walking with Reduced Effort

Quiñones Yumbla, Emiliano	University of Puerto Rico at Carolina
Zhong, Junmin	Arizona State University
Soltanian, Seyed Yousef	Arizona State University
Si, Jennie	Arizona State University
Zhang, Wenlong	Arizona State University
Keywords: Wearable Robotics, Soft Robot Applications, Reinforcement Learning Abstract: Wearable lower limb robots are promising technologies to assist human locomotion. Soft robotic exosuits introduce a promising solution for reducing muscle effort and metabolic cost as they are lightweight, transparent and inherently safe. However, it is challenging to effectively control such soft robots and personalize the assistance for individual users. With the difficulty in developing robust dynamic model of the human-soft robot system, especially the interacting dynamics between the human and the robot, traditional control methods have seen limited success in addressing these challenges. Reinforcement learning (RL), a data-driven optimal control method, provides a naturally promising alternative. In this study, we propose an innovative control design approach to enable human normative walking with reduced physical effort. To achieve this goal, we propose to first offline learn an exosuit controller for typical human normative walking which is then used in the online phase of control tuning for individual users. Four participants are recruited to test the exosuit controller in treadmill walking. Our results show that online tuning for individual users reaches convergence quickly, typically in one experimental trial due to using an efficient offline pre-trained policy. Furthermore, the RL control of the soft exosuit results in a physical effort reduction of 8.8% and 2.8% for the vastus lateralis and biceps femoris as measured by electromyography (EMG) sensors. These results provide the first evidence of customizing the soft exosuit assistance for individual users.

13:50-13:55, Paper ThBT29.7
Soft Wearable Robotic Kit for Forearm Rotation and Grasping Motion Tracking Based on Embedded End-Effector-Level Sensor System

Su, Huimin	Technical University of Munich
Masiero, Federico	Technical University of Munich
Missiroli, Francesco	Heidelberg University
El Sidani, Mohamad Marwan	TUM
Piazza, Cristina	Technical University Munich (TUM)
Masia, Lorenzo	Technische Universität München (TUM)
Keywords: Wearable Robotics, Prosthetics and Exoskeletons, Rehabilitation Robotics Abstract: Patients suffering from neurological and musculoskeletal disorders often experience impaired upper limb function, significantly reducing their quality of life. In recent years, wearable robots have emerged as a promising solution to facilitate rehabilitation and assist in daily activities. Among these, tendon-driven actuation has been widely adopted; however, such systems face challenges in achieving precise position control compared to direct motor-driven systems. This is primarily due to the hysteresis and backlash resulting from the high compliance and elasticity of tendons, necessitating effective compensation strategies. In this paper, we implement an embedded compact sensor system for end-effector-level position tracking in a soft wearable robot designed for forearm pronation/supination and grasping motions. By integrating sensors at the end-effector, we enable real-time motion data acquisition and establish a closed-loop feedback mechanism that effectively compensates for the limitations of tendon-driven actuation, thereby enhancing overall control accuracy. Based on the embedded end-effect-level sensing system, we introduce a novel wearable robot kit for motion tracking that comprising two parts: a sensor-only exosuit for real-time capture of user hand and forearm movements, and a motor-equipped exosuit that replicates and assists movements based on the sensor feedback. This Leader-Follower Control Mode allows for accurate capture and rapid response to user motion intent, offering a new solution for applications in tele-control, mirror therapy, and motion synchronization.


ThBT30	106
Wheeled Robots 2	Regular Session
Chair: Deng, Yang	Tsinghua University
Co-Chair: Chen, Zhang	Tsinghua University


ThCT1	401
Gesture, Posture and Facial Expressions 1	Regular Session
Co-Chair: Dai, Zhuangzhuang	Aston University

15:00-15:05, Paper ThCT1.1
Interactive Expressive Motion Generation Using Dynamic Movement Primitives

Hielscher, Till	University of Stuttgart
Bulling, Andreas	University of Stuttgart
Arras, Kai Oliver	University of Stuttgart
Keywords: Gesture, Posture and Facial Expressions, Learning from Demonstration, Social HRI Abstract: Our goal is to enable social robots to interact autonomously with humans in a realistic, engaging, and expressive manner. The 12 Principles of Animation are a well-established framework animators use to create movements that make characters appear convincing, dynamic, and emotionally expressive. This paper proposes a novel approach that leverages Dynamic Movement Primitives (DMPs) to implement key animation principles, providing a learnable, explainable, modulable, online adaptable and composable model for automatic expressive motion generation. DMPs, originally developed for general imitation learning in robotics and grounded in a spring-damper system design, offer mathematical properties that make them particularly suitable for this task. Specifically, they enable modulation of the intensities of individual principles and facilitate the decomposition of complex, expressive motion sequences into learnable and parametrizable primitives. We present the mathematical formulation of the parameterized animation principles and demonstrate the effectiveness of our framework through experiments and application on three robotic platforms with different kinematic configurations, in simulation, on actual robots and in a user study. Our results show that the approach allows for creating diverse and nuanced expressions using a single base model.

15:05-15:10, Paper ThCT1.2
Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation

Xing, Hao	Technical University of Munich (TUM)
Boey, Kai Zhe	Technical University of Munich (TUM)
Wu, Yuankai	TUM
Burschka, Darius	Technische Universitaet Muenchen
Cheng, Gordon	Technical University of Munich
Keywords: Gesture, Posture and Facial Expressions, Deep Learning for Visual Perception, Deep Learning Methods Abstract: Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settings, where a precise understanding of sub-activity labels and their temporal structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of action sequences. To address this, we propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data (skeleton and object detections) to mitigate fragmentation. Our framework introduces two key innovations: (1) a sinusoidal encoding strategy that maps 3D skeleton coordinates into a continuous sin-cos space to enhance spatial representation robustness, and (2) a temporal graph fusion module that aligns multi-modal inputs with differing resolutions via hierarchical feature aggregation. Additionally, motivated by the smooth transitions inherent to human actions, we design SmoothLabelMix, a data augmentation technique that mixes input sequences and labels to generate synthetic training examples with gradual action transitions, enhancing temporal consistency in predictions and reducing over-segmentation artifacts. Extensive experiments on the Bimanual Actions Dataset, a public benchmark for human-object interaction understanding, demonstrate that our approach outperforms state-of-the-art methods, especially in action segmentation accuracy, achieving F1@10: 94.5% and F1@25: 92.8%.

15:10-15:15, Paper ThCT1.3
FABG : End-To-End Imitation Learning for Embodied Affective Human-Robot Interaction

Zhang, Yanghai	University of Shanghai for Science and Technology
Liu, Changyi	Shanghai Droid Robot Co., Ltd. Shanghai 200433，China
Fu, Keting	University of Shanghai for Science and Techology
Zhou, Wenbin	Shanghai Droid Robot Co., Ltd
Li, Qingdu	University of Shanghai for Science and Technology
Zhang, Jianwei	University of Hamburg
Keywords: Emotional Robotics, Gesture, Posture and Facial Expressions, Acceptability and Trust Abstract: This paper proposes FABG (Facial Affective Behavior Generation), an end-to-end imitation learning system for human-robot interaction, designed to generate natural and fluid facial affective behaviors. In interaction, effectively obtaining high-quality demonstrations remains a challenge. In this work, we develop an immersive virtual reality (VR) demonstration system that allows operators to perceive stereoscopic environments. This system ensures "the operator's visual perception matches the robot's sensory input" and "the operator's actions directly determine the robot's behaviors" — as if the operator replaces the robot in human interaction engagements. We propose a prediction-driven latency compensation strategy to reduce robotic reaction delays and enhance interaction fluency. FABG naturally acquires human interactive behaviors and subconscious motions driven by intuition, eliminating manual behavior scripting. We deploy FABG on a real-world 25-degree-of-freedom (DoF) humanoid robot, validating its effectiveness through four fundamental interaction tasks: affective interaction, dynamic tracking, foveated attention, and gesture recognition, supported by data collection and policy training.

15:15-15:20, Paper ThCT1.4
GazeTarget360: Towards Gaze Target Estimation in 360-Degree for Robot Perception

Dai, Zhuangzhuang	Aston University
Zakka, Vincent Gbouna	Aston University
Manso, Luis J.	Aston University
Li, Chen	Aalborg University
Keywords: Gesture, Posture and Facial Expressions, Deep Learning for Visual Perception, Intention Recognition Abstract: Enabling robots to understand human gaze target is a crucial step to allow capabilities in downstream tasks, for example, attention estimation and movement anticipation in real-world human-robot interactions. Prior works have addressed the in-frame target localization problem with data-driven approaches by carefully removing out-of-frame samples. Vision-based gaze estimation methods, such as OpenFace, do not effectively absorb background information in images and cannot predict gaze target in situations where subjects look away from the camera. In this work, we propose a system to address the problem of 360-degree gaze target estimation from an image in generalized visual scenes. The system, named GazeTarget360, integrates conditional inference engines of an eye-contact detector, a pre-trained vision encoder, and a multi-scale-fusion decoder. Cross validation results show that GazeTarget360 can produce accurate and reliable gaze target predictions in unseen scenarios. This makes a first-of-its-kind system to predict gaze targets from realistic camera footage which is highly efficient and deployable. Our source code is made publicly available at: https://github.com/zdai257/DisengageNet.

15:20-15:25, Paper ThCT1.5
EmoRLTalk: Speech-Driven Emotional Facial Animation with Offline Reinforcement Learning

Liu, Gaofeng	Shanghai Jiaotong University
Li, Xuetong	Shanghai Jiao Tong University
Gao, Ruoyu	Shanghai Jiao Tong University
Yuan, Ye	USST
Liu, Jian	University of Shanghai for Science and Technology
Li, Hengsen	ShanghaiJiaoYong University
Huo, Hong	Shanghai Jiao Tong University
Fang, Tao	Shanghai Jiao Tong University
Keywords: Gesture, Posture and Facial Expressions, Deep Learning Methods, Simulation and Animation Abstract: In recent years, significant breakthroughs have been made in speech-driven 3D facial animation technology. However, existing methods mainly focus on lip shape and audio consistency and still face key challenges to achieve alignment between facial emotions and speech emotions. To address this problem, we propose a novel framework, EmoRLTalk, which creatively introduces offline reinforcement learning to implicitly model the complex mapping relationship between 3D facial feature points and blendshape to enhance the fine-grained emotion expression ability of facial representation methods, while leveraging the powerful distribution fitting ability of the conditional diffusion model to generate facial expressions aligned with audio emotions. Additionally, based on the multi-task learning paradigm, we construct a collaborative training framework of a regression main task and a classification sub-task. Specifically, we use emotion classification of blendshape as a sub-task to further improve the model's ability to express facial emotions. To further enhance system controllability, we integrate the ControlNet module, allowing users to achieve precise facial expression control. Comprehensive qualitative and quantitative experimental results show that EmoRLTalk outperforms existing state-of-the-art methods in terms of emotional expressiveness and lip-sync accuracy.

15:25-15:30, Paper ThCT1.6
FSGlove: An Inertial-Based Hand Tracking System with Shape-Aware Calibration

Li, Yutong	Shanghai Jiao Tong University
Zhang, Jieyi	Shanghai Jiao Tong University
Xu, Wenqiang	Shanghai Jiaotong University
Tang, Tutian	Shanghai Jiao Tong University
Lu, Cewu	ShangHai Jiao Tong University
Keywords: Gesture, Posture and Facial Expressions, Human and Humanoid Motion Analysis and Synthesis, Embedded Systems for Robotic and Automation Abstract: Accurate hand motion capture (MoCap) is vital for applications in robotics, virtual reality, and biomechanics, yet existing systems face limitations in capturing high-degree-of-freedom (DoF) joint kinematics and personalized hand shape. Commercial gloves offer up to 21 DoFs, which are insufficient for complex manipulations while neglecting shape variations that are critical for contact-rich tasks. We present FSGlove, an inertial-based system that simultaneously tracks up to 48 DoFs and reconstructs personalized hand shapes via DiffHCal, a novel calibration method. Each finger joint and the dorsum are equipped with IMUs, enabling high-resolution motion sensing. DiffHCal integrates with the parametric MANO model through differentiable optimization, resolving joint kinematics, shape parameters, and sensor misalignment during a single streamlined calibration. The system achieves state-of-the-art accuracy, with joint angle errors of less than 2.7^circ, and outperforms commercial alternatives in shape reconstruction and contact fidelity. FSGlove’s open-source hardware and software design ensures compatibility with current VR and robotics ecosystems, while its ability to capture subtle motions (e.g., fingertip rubbing) bridges the gap between human dexterity and robotic imitation. Evaluated against Nokov optical MoCap, FSGlove advances hand tracking by unifying the kinematic and contact fidelity. Hardware design, software, and more results are available at: url{https://sites.google.com/view/fsglove}.

15:30-15:35, Paper ThCT1.7
HAPI: A Model for Learning Robot Facial Expressions from Human Preferences

Yang, Dongsheng	Kyoto University
Liu, Qianying	National Institute of Informatics
Wataru, Sato	RIKEN
Minato, Takashi	RIKEN
Liu, Chaoran	National Institute of Informatics
Nishida, Shin'ya	Kyoto University
Keywords: Gesture, Posture and Facial Expressions, Emotional Robotics Abstract: Automatic robotic facial expression generation is crucial for human–robot interaction (HRI), as handcrafted methods based on fixed joint configurations often yield rigid and unnatural behaviors. Although recent automated techniques reduce the need for manual tuning, they tend to fall short by not adequately bridging the gap between human preferences and model predictions—resulting in a deficiency of nuanced and realistic expressions due to limited degrees of freedom and insufficient perceptual integration. In this work, we propose a novel learning-to-rank framework that leverages human feedback to address this discrepancy and enhanced the expressiveness of robotic faces. Specifically, we conduct pairwise comparison annotations to collect human preference data and develop the Human Affective Pairwise Impressions (HAPI) model, a Siamese RankNet-based approach that refines expression evaluation. Results obtained via Bayesian Optimization and online expression survey on a 35-DOF android platform demonstrate that our approach produces significantly more realistic and socially resonant expressions of Anger, Happiness, and Surprise than those generated by baseline and expert-designed methods. This confirms that our framework effectively bridges the gap between human preferences and model predictions while robustly aligning robotic expression generation with human affective responses.


ThCT2	402
Industrial Robotics and Control	Regular Session
Chair: Wang, Yuzhe	Singapore Institute of Manufacturing Technology - A*STAR
Co-Chair: Oh, Sehoon	DGIST

15:00-15:05, Paper ThCT2.1
Novel LPV System Identification for a Gantry Stage: A Global Approach with Adjustable Basis Functions

Yoon, Jegwon	DGIST
Jung, Hanul	ETRI
Kong, Taejune	DGIST
Oh, Sehoon	DGIST
Keywords: Motion Control, Industrial Robots, Robust/Adaptive Control Abstract: Robotic gantry stages are a prevalent class of industrial robots used for precise positioning tasks in various fields, including semiconductor manufacturing, 3D printing, and automated assembly. However, these systems often exhibit time-varying dynamics because the position of the end-effector (i.e., the payload) shifts the mass/inertia properties. Such dynamic variations are not captured by conventional Linear Time-Invariant (LTI) models, leading to modeling inaccuracies and degraded control performance. Linear Parameter-Varying (LPV) system identification is a more suitable alternative, but existing approaches typically employ a single, fixed basis-function order for all parameters, resulting in excessive model complexity and poor efficiency. This paper presents a novel global LPV system identification method for multi-axis robotic gantry systems, enabling independent basis-function order selection for each parameter. By eliminating unnecessary high-order terms, the method reduces computational overhead and enhances modeling accuracy. Experimental validation on an industrial gantry testbed confirms superior precision and robustness compared to conventional LPV approaches with uniform polynomial orders.

15:05-15:10, Paper ThCT2.2
A Light-Weight Robotic System for Flexible and Efficient Heavy-Load Palletizing in Cuboid Spaces (I)

Song, Yanshu	CUHK(Chinese University of Hong Kong)
Zhao, Jie	Harbin Institute of Technology, Shenzhen
Huang, Tianyu	The Chinese University of Hong Kong
Lau, Darwin	The Chinese University of Hong Kong
Liu, Yunhui	Chinese University of Hong Kong
Keywords: Logistics, Industrial Robots, Intelligent Transportation Systems Abstract: Loading and unloading goods in narrow cuboid spaces (e.g., truck carriages) is a critical task in logistics transportation. It is often time-consuming, labor-intensive, risky, and therefore is valuable and necessary to be automated. However, due to the complexity of operational scenarios, most existing robots struggle to simultaneously integrate: small size and weight, sufficient load capacity, high robustness, high picking flexibility and efficiency, and good adaptability to narrow cuboid spaces. In this work, we present a light-weight robotic system capable of achieving flexible, efficient, and heavy-load palletizing in constrained cuboid spaces. We propose a novel adsorption-tray hybrid gripper which can significantly reduce the required adsorption area (by 64%) while maintaining high picking flexibility without affecting load capacity. Based on it, we propose the lift-then-pick picking method, and provide the design method of an adjustable picking trajectory, as well as an optimization strategy for the quantity of suction nozzles. Furthermore, we successfully introduce the time-optimal control idea into the picking process based on its unique dynamic constraints, which improves efficiency by 46%. We also implement a compliance control mechanism that significantly enhances the robustness (i.e., nearly doubling the picking success rate). In addition, we develop a specially designed manipulator to carry the gripper. The horizontally arranged degree of freedom configuration and the new rotational joint construction method enable the arm to maintain light-weight while sufficient load capacity. And its coverage of cuboid spaces far exceeds that of traditional industrial arms (100% vs. 53%). Finally, we manufacture a prototype and validate the above advantages through both simulation and real-world tests, confirming its potential for applications.

15:10-15:15, Paper ThCT2.3
Gravity Compensation with Dual Quaternions

Arjonilla García, Francisco Jesús	Shizuoka University
Kobayashi, Yuichi	Shizuoka University
Keywords: Methods and Tools for Robot System Design, Kinematics, Motion Control Abstract: Static balancing is essential for effective control of robots. The typical approach to obtain the gravity compensation terms is to construct an ad-hoc trigonometric model of the robot. This paper proposes a dual-quaternion representation of centroids that enables streamlined calculation of gravity compensation terms on rigid robots with arbitrary kinematic trees and arbitrary base orientations, valid for revolute and prismatic joints in kinematic chains with rigid links. The method was successfully tested on a six degrees-of-freedom manipulator model and evaluated against third-party software.

15:15-15:20, Paper ThCT2.4
Spherical Scissor-Like Reconfigurable Palm Design in Robotic Hands: Insights from Human Hand Functionality

Wang, Jiaxing	Fudan University
Zhang, Fang	Fudan University
Chen, Kai	Fudan University
Liu, Chang	Fudan University
Zhu, Guo-Niu	Fudan Unversity
Lu, Qiujie	Fudan University
Gan, Zhongxue	Fudan University
Keywords: Mechanism Design, Multifingered Hands Abstract: The human palm demonstrates spatial reconfigurability during the gripping process and forms a spherical grasping envelope. Based on these observations, this study designs a reconfigurable spherical palm that incorporates a spatial scissor mechanism, which only requires a single actuator to reshape the palm into a range of spherical forms. We conduct a kinematic analysis and modelling of the structure, abstracting three key parameters and analysing their influence on the motion characteristics of the palm. Through multi-objective optimisation, a set of dimensional parameters is derived to balance workspace, human-like motion, and mechanical performance. The performance of the reconfigurability and the grasping capability of the proposed palm is compared to a planar folding palm by superquadrics, and the results show that the spherical design and the reconfigurable characteristics provide larger grasping arrangement and stronger grasping capability of the palm on most of the testing surfaces.

15:20-15:25, Paper ThCT2.5
Robust Model-Free Path Tracking Algorithm for Hydraulic Center-Articulated Scooptrams

Zhang, Zihan	Northeastern University
Fang, CunGuang	Shenyang LiGong University
Xu, Pu	Northeastern University
Fang, Zheng	Northeastern University
Keywords: Mining Robotics, Industrial Robots, Motion Control Abstract: This paper proposes a model-free steering control method to address the path tracking challenges of Hydraulic Center-articulated Scooptrams (HCS) in narrow underground mining environments. Due to the nonlinear and time-delay characteristics of the hydraulic steering system, the HCS exhibits response lag when executing control commands. The lag time demonstrates dynamic uncertainty influenced by operating conditions, hydraulic pressure, and load variations. To address this challenge, an adaptive steering control strategy is designed. This strategy leverages the geometric relationship between the HCS and the reference path to dynamically adjust the look-ahead distance, thereby compensating for the uncertainty caused by the hydraulic system lag. Additionally, the error is mapped to the actual control input in real-time through a feedback error controller, effectively correcting control errors caused by lag without relying on a complex hydraulic system model. The proposed method was experimentally validated in a full-scale simulated mining tunnel, demonstrating considerable robustness and precise path tracking performance under uneven terrain, heavy loads, significant initial error, and bidirectional movement. This method provides a viable solution for the autonomous navigation of the HCS.

15:25-15:30, Paper ThCT2.6
A New Double-Integration-Enhanced RNN Algorithm for Discrete Time-Variant Equation Systems with Robot Manipulator Applications (I)

Shi, Yang	Yangzhou University
Chong, Wei	Yangzhou University
Cao, Xinwei	Jiangnan University
Jiang, Chao	Yangzhou University
Zhao, Ruxin	Yangzhou University
Gerontitis, Dimitrios K.	International Hellenic University
Keywords: Neural and Fuzzy Control, Redundant Robots Abstract: Discrete time-variant equation systems represent a typical and complex problem across various disciplines. With the increasing complexity of systems in various fields, traditional methods have been unable to effectively deal with the current discrete time-variant equation systems, especially in the dynamic engineering problem. Generally speaking, traditional methods are typically limited to considering discrete time-variant equation systems in ideal state, and there is a lack of deep research about more intricate disturbance states. This paper introduces a new recurrent neural network (RNN) algorithm, termed the discrete-time double-integration-enhanced RNN (DT-DIE-RNN) algorithm, for handling discrete time-variant equation systems (including discrete time-variant linear and nonlinear equation system) under discrete square-time-variant disturbance. Firstly, the continuous-time double-integration-enhanced RNN (CT-DIE-RNN) algorithm is presented for solving discrete time-variant linear and nonlinear equation systems by using double-integral-type error function. Secondly, the corresponding discrete-time RNN algorithm is presented, and the convergence and precision of such an algorithm are theoretically analyzed. Finally, the effectiveness and superiority of the proposed DT-DIE-RNN algorithm for solving discrete time-variant linear and nonlinear equation systems are supported by comparative numerical experiments, and these theoretical results are further verified by robot manipulator applications.

15:30-15:35, Paper ThCT2.7
Sparse Bayesian Learning-Based Interval Type-2 Fuzzy Logic Control for Electrospinning Processes (I)

Sun, Hongwei	Huazhong University of Science and Technology
Zhang, Hai-Tao	Huazhong University of Science AndTechnology
Xing, Ning	Huazhong University of Science and Technology
Wang, Yasen	Huazhong University of Science and Technology
Shi, Yang	University of Victoria
Keywords: Neural and Fuzzy Control Abstract: This paper develops a closed-loop electro-spinning process control system composed of a high-speed industrial camera, an interval type-2 (IT2) fuzzy logic controller (FLC) and a high-precision programmable micropump. A pure data-driven IT2 T-S fuzzy model with a micropump flow input and a fiber diameter output is established by a sparse Bayesian learning (SBL) method, and the closed-loop IT2 FLC is thereby proposed to finely tune the electrospinning fiber diameter according to the technical requirement of the circuit electrospinning process suffered by external disturbances and system uncertainties. Sufficient conditions are derived to guarantee the asymptotical stability of the closed-loop system with the assistance of Lyapunov theory. Experiments on bead-chain structure electrospinning process are conducted to show the effectiveness and superiority of the present SBL-based fuzzy controller.

15:35-15:40, Paper ThCT2.8
Automatic Machinability Evaluation and Recommendation for Reconfigurable Manufacturing Systems

Wang, Yuzhe	Singapore Institute of Manufacturing Technology - A*STAR
Sun, Haining	Agency for Science, Technology and Research (A*STAR)
King, Matthew Francis	Advanced Remanufacturing and Technology Centre
Ng, Teck Chew	Singapore Institute of Manufacturing Technology
Keywords: Manufacturing, Maintenance and Supply Chains, Intelligent and Flexible Manufacturing, Factory Automation Abstract: In the current era of rapid-changing market and supply chain fluctuations, manufacturing industries face significant challenges in maintaining agility and resilience. Machinability evaluation is one of the critical steps in manufacturing planning to ensure the industry can respond to rapid market changes or customization demands. However, this process is currently conducted manually and highly relies on engineers’ knowledge and experience. This paper presents a systematic approach that employs fuzzy logic, which can be automated by a software program, for evaluating the machinability of product features and recommending reconfiguration options for computer numerical control (CNC) machines. This approach assesses each product feature against the current machine kinematics. The validated results demonstrate that the proposed fuzzy logic can generate comprehensive results that reflect varying degrees of machinability. For product features that cannot be machined with the existing machine configurations, this approach identifies specific limitations and provides data-driven recommendations for reconfiguration. This will assist machine operators in making informed decisions, thereby reducing reliance on manual evaluation and planning. Furthermore, this automated evaluation process will enable a shorter turnaround time for new production line setups and enhance the overall operational efficiency, especially in high-variety, low-volume manufacturing environments.


ThCT3	403
Autonomous Agents 1	Regular Session
Chair: Feng, Youyang	China Automotive Innovation Corporation
Co-Chair: Milojevic, Dejan	ETH Zürich

15:00-15:05, Paper ThCT3.1
Real-Time Incremental Mapping and Degeneration-Awared Localization for Multi-Floor Parking Lots Based on IPM Image

Feng, Youyang	China Automotive Innovation Corporation
Qu, Weiming	Peking Universitiy
Wang, Wei	China Automotive Innovation Corporation
Wang, Chenchen	China Automative Innovation Corporation
Wang, Hongyao	China Automotive Innovation Corporation
He, Shizheng	China Automotive Innovation Corporation
Luo, Dingsheng	Peking University
Keywords: Autonomous Agents, Semantic Scene Understanding, AI-Enabled Robotics Abstract: In indoor parking lots, the use of RTK/GNSS for vehicle localization is often impractical due to the significantly smaller space compared to outdoor roads, which demands higher precision in both mapping and localization. Although feature point based visual SLAM algorithms have achieved high localization accuracy, they impose significant storage demands on embedded systems, and the visual feature point maps are not time-stable and are sensitive to lighting conditions. In this paper, we propose a real-time mapping and localization system for multi-floor parking lot. For the mapping part, we introduce a map-free SLAM method for precise ego-pose estimation, along with an efficient incremental map update framework that supports loop closure and multi-session mapping tasks. In the localization part, a semantic map is reused for vehicle localization based on bidirectional incentive descriptors. We incorporate degenerate cases into our optimization process, which greatly enhances the localization results. To the best of our knowledge, this is the first comprehensive system proposed for multi-floor parking lots. Experimental results demonstrate that our approach achieves state-of-the-art mapping and localization accuracy in multi-floor environments on embedded platforms.

15:05-15:10, Paper ThCT3.2
AGENTS-LLM: Augmentative Generation of Challenging Traffic Scenarios with an Agentic LLM Framework

Yao, Yu	Bosch Center for Artificial Intelligence
Bhatnagar, Salil	University of Erlangen–Nuremberg
Mazzola, Markus	Robert Bosch GmbH
Belagiannis, Vasileios	Friedrich-Alexander-Universität Erlangen-Nürnberg
Gilitschenski, Igor	University of Toronto
Palmieri, Luigi	Robert Bosch GmbH
Razniewski, Simon	3ScaDS.AI, TU Dresden
Hallgarten, Marcel	University of Tübingen, Robert Bosch GmbH
Keywords: Autonomous Agents, Task and Motion Planning Abstract: Rare, yet critical, scenarios pose a significant challenge in testing and evaluating autonomous driving planners. Relying solely on real-world driving scenes requires collecting massive datasets to capture these scenarios. While automatic generation of traffic scenarios appears promising, data-driven models require extensive training data and often lack fine-grained control over the output. Moreover, generating novel scenarios from scratch can introduce a distributional shift from the original training scenes which undermines the validity of evaluations especially for learning-based planners. To sidestep this, recent work proposes to generate challenging scenarios by augmenting original scenarios from the test set. However, this involves the manual augmentation of scenarios by domain experts. An approach that is unable to meet the demands for scale in the evaluation of self-driving systems. Therefore, this paper introduces a novel LLM-agent based framework for augmenting real-world traffic scenarios using natural language descriptions, addressing the limitations of existing methods. A key innovation is the use of an agentic design, enabling fine-grained control over the output and maintaining high performance even with smaller, cost-effective LLMs. Extensive human expert evaluation demonstrates our framework's ability to accurately adhere to user intent, generating high quality augmented scenarios comparable to those created manually.

15:10-15:15, Paper ThCT3.3
An Intelligent Tennis Training Robot with Timely Motion Feedback

Qu, Weiming	Peking Universitiy
Chen, Weizheng	Jilin University
Luo, Dingsheng	Peking University
Keywords: Autonomous Agents, Control Architectures and Programming, Agent-Based Systems Abstract: Tennis is widely popular across various age groups. However, mastering fluid stroke mechanics remains a significant challenge, requiring substantial time and practice. The lack of scientifically grounded tools for skill acquisition and training tools further hampers the development of tennis proficiency. In this paper, we investigate the application of an intelligent tennis robot equipped with advanced human-machine interaction capabilities, designed to serve as an effective tool for tennis learners, especially for enhancing motion geometry and coordination. First, we detail the design and development of the intelligent tennis robot, highlighting its core functions and operational principles, including ball serving, human motion sensing, and motion feedback mechanisms. Subsequently, we introduce a novel human motion analysis method that integrates motion geometry and kinematic analyses. Following this, we present a real-time motion feedback system that identifies deficiencies in players' movements, thereby facilitating the enhancement of their motion memory. Finally, we conduct experiments over players of varying skill levels, analyzing their motion patterns and providing practical examples. The proposed human-machine interaction framework offers a pioneering solution for intelligent tennis training, enabling players to understand their movements and correct errors immediately after each stroke.

15:15-15:20, Paper ThCT3.4
An Actionable Hierarchical Scene Representation Enhancing Autonomous Inspection Missions in Unknown Environments

Kottayam Viswanathan, Vignesh	Lulea University of Technology
Valdes Saucedo, Mario Alberto	Lulea University of Technology
Satpute, Sumeet	Luleå University of Technology
Kanellakis, Christoforos	LTU
Nikolakopoulos, George	Luleå University of Technology
Keywords: Autonomous Agents, Field Robots, Semantic Scene Understanding Abstract: In this article, we present the Layered Semantic Graphs (LSG), a novel actionable hierarchical scene graph, fully integrated with a multi-modal mission planner, the FLIE: A First-Look based Inspection and Exploration planner. The novelty of this work stems from aiming to address the task of maintaining an intuitive and multi-resolution scene representation, while simultaneously offering a tractable foundation for planning and scene understanding during an ongoing inspection mission of apriori unknown targets-of-interest in an unknown environment. The proposed LSG scheme is composed of locally nested hierarchical graphs, at multiple layers of abstraction, with the abstract concepts grounded on the functionality of the integrated FLIE planner. Furthermore, LSG encapsulates real-time semantic segmentation models that offer extraction and localization of desired semantic elements within the hierarchical representation. This extends the capability of the inspection planner, which can then leverage LSG to make an informed decision to inspect a particular semantic of interest. We also emphasize the hierarchical and semantic path-planning capabilities of LSG, which could extend inspection missions by improving situational awareness for human operators in an unknown environment. The validity of the proposed scheme is proven through extensive evaluations of the proposed architecture in simulations, as well as experimental field deployments on a Boston Dynamics Spot quadruped robot in urban outdoor environment settings.

15:20-15:25, Paper ThCT3.5
On Learning Closed-Loop Probabilistic Multi-Agent Simulator

Lu, Juanwu	Purdue University
Gupta, Rohit	Toyota Motor North America R&D
Moradipari, Ahmadreza	Toyota Motor North America R&D
Han, Kyungtae	Toyota Motor North America R&D
Zhang, Ruqi	Purdue University
Wang, Ziran	Purdue University
Keywords: Deep Learning Methods, Autonomous Agents, Probability and Statistical Methods Abstract: The rapid iteration of autonomous vehicle (AV) deployments leads to increasing needs for building realistic and scalable multi-agent traffic simulators for efficient evaluation. Recent advances in this area focus on closed-loop simulators that enable generating diverse and interactive scenarios. This paper introduces textit{Neural Interactive Agents} (NIVA), a probabilistic framework for multi-agent simulation driven by a hierarchical Bayesian model that enables closed-loop, observation-conditioned simulation through autoregressive sampling from a latent, finite mixture of Gaussian distributions. We demonstrate how NIVA unifies preexisting sequence-to-sequence trajectory prediction models and emerging closed-loop simulation models trained on Next-token Prediction (NTP) from a Bayesian inference perspective. Experiments on the Waymo Open Motion Dataset demonstrate that NIVA attains competitive performance compared to the existing method while providing embellishing control over intentions and driving styles.

15:25-15:30, Paper ThCT3.6
Is the House Ready for Sleeptime? Generating and Evaluating Situational Queries for Embodied Question Answering

Dorbala, Vishnu Sashank	University of Maryland, College Park
Goyal, Prasoon	Amazon
Piramuthu, Robinson	Amazon
Johnston, Michael	Amazon
Ghanadan, Reza	Amazon
Manocha, Dinesh	University of Maryland
Keywords: Autonomous Agents, AI-Based Methods, AI-Enabled Robotics Abstract: We present and tackle the problem of Embodied Question Answering (EQA) with Situational Queries (S-EQA) in a household environment. Unlike prior EQA work tackling simple queries that directly reference target objects and properties ("What is the color of the car?"), situational queries (such as "Is the house ready for sleeptime?") are challenging as they require the agent to correctly identify multiple object-states (Doors: Closed, Lights: Off, etc.) and reach a consensus on their states for an answer. Towards this objective, we first introduce a novel Prompt-Generate-Evaluate (PGE) scheme that wraps around an LLM's output to generate unique situational queries and corresponding consensus object information. PGE is used to generate 2K datapoints in the VirtualHome simulator, which is then annotated for ground truth answers via a large scale user-study conducted on M-Turk. With a high rate of answerability (97.26%) on this study, we establish that LLMs are good at generating situational data. However, in evaluating the data using an LLM, we observe a low correlation of 46.2% with the ground truth human annotations; indicating that while LLMs are good at generating situational data, they struggle to answer them according to consensus. Further, we qualitatively analyze PGE's performance in generating situational data for a real-world environment, exposing LLM hallucination in generating reliable object-states when a structured scene graph is unavailable. To the best of our knowledge, this is the first work to introduce EQA in the context of situational queries and also the first to present a generative approach for query creation. We aim to foster research on improving the real-world usability of embodied agents through this work.

15:30-15:35, Paper ThCT3.7
CODEI: Resource-Efficient Task-Driven Co-Design of Perception and Decision Making for Mobile Robots Applied to Autonomous Vehicles

Milojevic, Dejan	ETH Zürich
Zardini, Gioele	Massachusetts Institute of Technology
Elser, Miriam	Empa
Censi, Andrea	ETH Zurich
Frazzoli, Emilio	ETH
Keywords: Autonomous Agents, Sensor-based Control, Optimization and Optimal Control, Resource-aware task-driven robot co-design Abstract: This paper discusses integration challenges and strategies in mobile robot design, focusing on task-driven hardware and software selection to balance safety, efficiency, and resources such as cost, energy, computation, and weight. We emphasize the interplay between perception and motion planning, introducing the concept of occupancy queries to quantify perception requirements for sampling-based motion planners. Sensor and algorithm performance are evaluated using FNR and FPR across factors such as geometry, object properties, sensor resolution, and environment. By integrating perception requirements with performance, an ILP approach is proposed for sensor and algorithm selection and placement. This forms the basis for a co-design optimization of the robot body, motion planner, perception pipeline, and computing unit, forming the CODEI (Co-design of Embodied Intelligence) framework. A case study on AV design for urban scenarios shows complex tasks escalate resource demands, with task performance influencing autonomy stack choices. Cameras are preferred for cost-effective, lightweight designs, while lidar offers better energy and computational efficiency.


ThCT4	404
AI-Enabled Robotics 3	Regular Session
Chair: Su, Jinya	Southeast University
Co-Chair: Zhang, Hong	Southern University of Science and Technology

15:00-15:05, Paper ThCT4.1
Advancing Object-Goal Navigation through LLM-Enhanced Object Affinities Transfer

Lin, Mengying	Georgia Institute of Technology
Liu, Shugao	Institute of Automation, Chinese Academy of Sciences
Zhang, Dingxi	ETH Zurich, Zurich, CH
Chen, Yaran	Institute of Automation, Chinese Academy of Sciense
Wang, Zhaoran	Northwestern University
Li, Haoran	Institute of Automation, Chinese Academy of Sciences
Zhao, Dongbin	Chinese Academy of Sciences
Keywords: AI-Based Methods, AI-Enabled Robotics, Semantic Scene Understanding Abstract: Object-goal navigation requires mobile robots to efficiently locate targets with visual and spatial information, yet existing methods struggle with generalization in unseen environments. Heuristic approaches with naive metrics fail in complex layouts, while graph-based and learning-based methods suffer from environmental biases and limited generalization. Although Large Language Models (LLMs) as planners or agents offer a rich knowledge base, they are cost-inefficient and lack targeted historical experience. To address these challenges, we propose the LLM-enhanced Object Affinities Transfer (LOAT) framework, integrating LLM-derived semantics with learning-based approaches to leverage experiential object affinities for better generalization in unseen settings. LOAT employs a dual-module strategy: one module accesses LLMs' vast knowledge, and the other applies learned object semantic relationships, dynamically fusing these sources based on context. Evaluations in AI2-THOR and Habitat simulators show significant improvements in navigation success and efficiency, and real-world deployment demonstrates the zero-shot ability of LOAT to enhance object-goal navigation systems.

15:05-15:10, Paper ThCT4.2
Monocular Person Localization under Camera Ego-Motion

Zhan, Yu	Southern University of Science and Technology
Ye, Hanjing	Southern University of Science and Technology
Zhang, Hong	Southern University of Science and Technology
Keywords: Robot Companions, Human-Centered Automation, Surveillance Robotic Systems Abstract: Localizing a person from a moving monocular camera is critical for Human-Robot Interaction (HRI). To estimate the 3D human position from a 2D image, existing methods either depend on the geometric assumption of a fixed camera or use a position regression model trained on datasets containing little camera ego-motion. These methods are vulnerable to fierce camera ego-motion, resulting in inaccurate person localization. We consider person localization as a part of a pose estimation problem. By representing a human with a four-point model, our method jointly estimates the 2D camera attitude and the person's 3D location through optimization. Evaluations on both public datasets and real robot experiments demonstrate our method outperforms baselines in person localization accuracy. Our method is further implemented into a person-following system and deployed on an agile quadruped robot.

15:10-15:15, Paper ThCT4.3
Pet-NODE: Embedding Priors and Time-Series Features into Neural ODE

Chen, Jia	Southeast University
Xu, Yongyue	Southeast University
Su, Jinya	Southeast University
Gu, Kun	Nanjing Sciyon Wisdom Technology Group CO., LTD
Wang, Fuyou	Nanjing Sciyon Wisdom Technology Group CO., LTD
Li, Shihua	Southeast University
Keywords: Machine Learning for Robot Control, Model Learning for Control, Wheeled Robots Abstract: Accurate modeling of dynamic systems is essential for robotics, enhancing system perception and control performance. This work tackles causal modeling challenges for mobile robots under complex uncertainties, including internal model inaccuracies and external environmental disturbances. Unlike first‐principle or purely data‐driven methods, we propose Pet-NODE, an advanced Neural Ordinary Differential Equation (NODE) framework that integrates physical priors with temporal features for high-fidelity system modeling. To further embed domain knowledge, we introduce a novel loss function with self-prediction objectives, ensuring adherence to physical principles. Extensive experiment evaluations, including ablation studies and comparisons against Nominal model, K-NODE and PI-TCN methods, demonstrate Pet-NODE’s robustness, interpretability, and superior localization accuracy on a self-collected wheeled robot dataset.

15:15-15:20, Paper ThCT4.4
Towards Efficient Image-Goal Navigation: A Self-Supervised Transformer-Based Reinforcement Learning Approach

Weng, Qizhen	Sun Yat-Sen University
Hu, Jiaocheng	Sun Yat-Sen University
Wu, Zhijie	Sun Yat-Sen University
Zhu, Xiangwei	Sun Yat-Sen University
Keywords: Reinforcement Learning, Vision-Based Navigation, Representation Learning Abstract: Image-goal navigation is a crucial yet challenging task that requires an agent to navigate to a goal location specified by an image. Modular methods decompose the problem into distinct subtasks and often involve explicit map construction, which can struggle in complex, unstructured environments. In contrast, end-to-end deep reinforcement learning (DRL)-based methods directly output actions from visual input, with recent improvements focusing on enhancing embedding fusion between the current observation and the goal image. However, both approaches fail to fully leverage the rich temporal relationships present in the agent's visual-action history. In this paper, we address this limitation by employing a self-supervised transformer to predict masked portions of the agent's visual-action embeddings. To promote spatio-temporal reasoning, a dual-attention shared transformer is utilized for both masked representation learning and policy generation. Our method demonstrates superior performance and generalization ability compared to 12 existing baselines across the Gibson, MP3D, and HM3D datasets. Code and trained models are available at https://github.com/hujch23/DaMVA.

15:20-15:25, Paper ThCT4.5
VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

Grigorev, Danil	Moscow Institute of Physics and Technology
Kovalev, Alexey	AIRI
Panov, Aleksandr	AIRI
Keywords: Deep Learning Methods, AI-Enabled Robotics, Task Planning Abstract: In the field of robotics, researchers face a critical challenge in ensuring reliable and efficient task planning. Verifying high-level task plans before execution significantly reduces errors and enhance the overall performance of these systems. In this paper, we propose an architecture for automatically verifying high-level task plans before their execution in simulator or real-world environments. Leveraging Large Language Models (LLMs), our approach consists of two key steps: first, the conversion of natural language instructions into Linear Temporal Logic (LTL), followed by a comprehensive analysis of action sequences. The module uses the reasoning capabilities of the LLM to evaluate logical coherence and identify potential gaps in the plan. Rigorous testing on datasets of varying complexity demonstrates the broad applicability of the module to household tasks. We contribute to improving the reliability and efficiency of task planning and addresses the critical need for robust pre-execution verification in autonomous systems. The project page is available at https://verifyllm.github.io.

15:25-15:30, Paper ThCT4.6
Dual-Level Open-Vocabulary 3D Scene Representation for Instance-Aware Robot Navigation

Zheng, Tianlu	Northeastern University
Yang, Kaicheng	DeepGlint
Dou, Yilong	Northeastern University
Feng, Ziyong	DeepGlint
Ding, Qichuan	Northeastern University, China
Keywords: AI-Enabled Robotics, Vision-Based Navigation, Semantic Scene Understanding Abstract: Advanced scene understanding is crucial for robots to navigate robustly in complex 3D environments. Recent works utilize large Vision-Language Models (VLMs) to embed semantic information into reconstructed maps, thereby creating open-vocabulary scene representations for instance-aware robot navigation. However, existing methods primarily generate point-wise feature vectors for maps, which inadequately capture the intricate scene contents necessary for navigation tasks, including holistic and relational object information. To address this limitation, we propose a novel Dual-Level Open-Vocabulary 3D (DLOV-3D) scene representation framework to improve robot navigation performance. Our framework integrates both pixel-level and image-level features into spatial scene representations, facilitating a more comprehensive understanding of the scene. By incorporating an adaptive revalidation mechanism, DLOV-3D achieves precise instance-aware navigation based on free-form queries that describe object properties such as color, shape, and relational references. Notably, when combined with Large Language Models (LLMs), DLOV-3D supports long-sequence multi-instance robot navigation guided by natural language instructions. Extensive experimental results demonstrate that DLOV-3D achieves new state-of-the-art performance in instance-aware robot navigation.

15:30-15:35, Paper ThCT4.7
AugInsert: Learning Robust Visual-Force Policies Via Data Augmentation for Object Assembly Tasks

Diaz, Ryan	Rice University
Imdieke, Adam	University of Minnesota
Veeriah, Vivek	Google DeepMind
Desingh, Karthik	University of Minnesota
Keywords: Learning from Demonstration, Representation Learning, Sensorimotor Learning Abstract: Operating in unstructured environments like households requires robotic policies that are robust to out-of-distribution conditions. Although much work has been done in evaluating robustness for visuomotor policies, the robustness evaluation of a multisensory approach that includes force-torque sensing remains largely unexplored. This work introduces a novel, factor-based evaluation framework with the goal of assessing the robustness of multisensory policies in a peg-in-hole assembly task. To this end, we develop a multisensory policy framework utilizing the Perceiver IO architecture to learn the task. We investigate which factors pose the greatest generalization challenges in object assembly and explore a simple multisensory data augmentation technique to enhance out-of-distribution performance. We provide a simulation environment enabling controlled evaluation of these factors. Our results reveal that multisensory variations such as Grasp Pose present the most significant challenges for robustness, and naive unisensory data augmentation applied independently to each sensory modality proves insufficient to overcome them. Additionally, we find force-torque sensing to be the most informative modality for our contact-rich assembly task, with vision being the least informative. Finally, we briefly discuss supporting real-world experimental results. For additional experiments and qualitative results, we refer to the project webpage https://rpm-lab-umn.github.io/auginsert/ .

15:35-15:40, Paper ThCT4.8
KiteRunner: Language-Driven Cooperative Local-Global Navigation Policy with UAV Mapping in Outdoor Environments

Huang, Shibo	East China Normal University
Shi, Chenfan	East China Normal University
Yang, Jian	Information Engineering University
Dong, Hanlin	East China Normal University
Mi, Jinpeng	USST
Li, Ke	Information Engineering University
Zhang, Jianfeng	East China Normal University
Ding, Miao	Liaoning Technical University
Peidong, Liang	Harbin Institute of Technology
You, Xiong	Information Engineering University
Wei, Xian	East China Normal University
Keywords: AI-Enabled Robotics, Deep Learning Methods, Vision-Based Navigation Abstract: Autonomous navigation in open-world outdoor environments faces challenges in integrating dynamic conditions, long-distance spatial reasoning, and semantic understanding. Traditional methods struggle to balance local planning, global planning, and semantic task execution, while existing large language models (LLMs) enhance semantic comprehension but lack spatial reasoning capabilities. Although diffusion models excel in local optimization, they fall short in large-scale longdistance navigation. To address these gaps, this paper proposes KiteRunner, a language-driven cooperative local-global navigation strategy that combines UAV orthophoto-based global planning with diffusion model-driven local path generation for long-distance navigation in open-world scenarios. Our method innovatively leverages real-time UAV orthophotography to construct a global probability map, providing traversability guidance for the local planner, while integrating large models like CLIP and GPT to interpret natural language instructions. Experiments demonstrate that KiteRunner achieves 5.6% and 12.8% improvements in path efficiency over state-of-the-art methods in structured and unstructured environments, respectively, with significant reductions in human interventions and execution time.


ThCT5	407
Force Control	Regular Session
Co-Chair: Calinon, Sylvain	Idiap Research Institute

15:00-15:05, Paper ThCT5.1
Wrench-Guided and Velocity-Field-Based Geometric Impedance Control

Huang, Yuancan	Beijing Institute of Technology
Shao, Nianfeng	Beijing Institute of Technology
Hong, Da	Beijing Institute of Technology
Keywords: Compliance and Impedance Control, Force Control, Contact Modeling Abstract: Taking inspiration from the principle of locality in the field theories of physics cite{Giachetta09}, we aim to generate an impedance-related velocity field on (SE(3)) that characterizes the local interaction behaviors associated with specific tasks. Thanks to its locality, the desired impedance at each point of (SE(3)) can be effectively rendered by following the velocity field, without invoking the geometrical inconsistency of active stiffness cite{Villani16}. First, we will introduce a nonlinear impedance equipped with a guided wrench and a velocity output to portray the global interaction behaviors. The guided wrench will be conceived to ensure that the velocity output at each point of (SE(3)) accurately reflect the local interaction behaviors. Next, we will employ model-matching approaches to regulate the robot's end-effector based on the velocity output. Finally, we will conduct a case study on the Peg-in-Hole (PiH) task to demonstrate this impedance control technique.

15:05-15:10, Paper ThCT5.2
Safety-Aware Geometric Force-Impedance Control for Manipulators

Zeng, Danping	Hunan University
Wang, Yaonan	Hunan University
Jiang, Yiming	Hunan University
Jiang, Jiao	Hunan University
Yang, Chenguang	University of Liverpool
Zhang, Hui	Hunan University
Keywords: Compliance and Impedance Control, Force Control Abstract: Since its inception, impedance control has emerged as a fundamental framework for robotic interaction control. Recent advancements in geometric impedance control have demonstrated certain advantages over traditional Cartesian impedance control. However, existing geometric impedance control approaches generally lack force regulation capabilities or rigorous stability guarantees. In this paper, we propose a safety-aware geometric force-impedance controller that addresses these limitations. By incorporating an energy tank mechanism, the proposed approach enables precise force tracking while preserving full compatibility with the impedance behavior. Furthermore, an energy injection and freezing mechanism is introduced, allowing dynamic regulation of energy exchange between the tank and the robotic system. Notably, the proposed method eliminates the need for an offline estimation of the initial energy stored in the tank, facilitating real-time adjustments of force controller parameters. To validate the effectiveness of the proposed framework, we conduct extensive polishing experiments on a real robotic platform. The results demonstrate the capability of the proposed controller to achieve stable and precise force regulation.

15:10-15:15, Paper ThCT5.3
Learning Object Compliance Via Young's Modulus from Single Grasps Using Camera-Based Tactile Sensors

Burgess, Michael	Massahcusetts Institute of Technology
Zhao, Jialiang	Massachusetts Institute of Technology
Willemet, Laurence	TU Delft
Keywords: Contact Modeling, Force and Tactile Sensing, Perception for Grasping and Manipulation Abstract: Compliance is a useful parametrization of tactile information that humans often utilize in manipulation tasks. It can be used to inform low-level contact-rich actions or characterize objects at a high-level. In robotic manipulation, existing approaches to estimate compliance have struggled to generalize across both object shape and material. Using camera-based tactile sensors, proprioception, and force measurements, we present a novel approach to estimate object compliance as Young's modulus E from parallel grasps. We evaluate our method over a novel dataset of 285 common objects, including a wide array of shapes and materials with Young's moduli ranging from 5.0 kPa to 250 GPa. Combining analytical and data-driven approaches, we develop a hybrid system using a multi-tower neural network to analyze a sequence of tactile images from grasping. This system is shown to estimate the Young's modulus of unseen objects within an order of magnitude at 74.2% accuracy across our dataset. This is an improvement over purely analytical and data-driven baselines which exhibit 28.9% and 65.0% accuracy respectively. Importantly, this estimation system performs irrespective of object geometry and demonstrates increased robustness across material types. Code is available on GitHub and collected data is available on HuggingFace.

15:15-15:20, Paper ThCT5.4
Learning-Based Predictive Impedance Control towards Safe Predefined-Time Physical Robotic Interaction

Xue, Junyuan	National University of Singapore
Liang, Wenyu	Institute for Infocomm Research, A*STAR
Xu, Yilan	National University of Singapore
Wu, Yan	A*STAR Institute for Infocomm Research
Lee, Tong Heng	National University of Singapore
Keywords: Compliance and Impedance Control, Force Control, Machine Learning for Robot Control Abstract: Impedance control can be achieved within a model predictive control (MPC) framework for optimization and constraint compliance. However, user-defined or optimization-derived impedance models can be too conservative to achieve a timely convergence, or too aggressive to ensure safety. To address this, an MPC-based impedance control framework with learning-based tuning for predefined-time convergence is proposed. On the low level, this framework dynamically selects between a task-oriented and a safety-oriented impedance model based on real-time interaction force modeling and safety assessments, ensuring optimal performance and maintaining safety while interacting with unknown and complex environments. On the high level, the framework achieves predefined-time convergence via reinforcement learning for meta-parameter tuning, allowing users to specify the desired convergence time upper bound. Through physical experiments, the superiority of the proposed framework is validated on interaction safety and predefined-time convergence.

15:20-15:25, Paper ThCT5.5
Tracking Control of 7-DOF Redundant Manipulators with Enhanced Null Space Compliance

Tian, Xinyang	Beihang University
Keywords: Compliance and Impedance Control Abstract: In this paper, the problem of controlling the 7 degree-of-freedom (DOF) redundant manipulator accurately executing tasks along a desired trajectory with time-varying position and orientation while addressing a constrained compliant behavior within the null space is considered. The objective of this work is to extend null space impedance control from the traditional fixed point to any pose to meet the human-robot physical interaction in practical applications, such as service and medical robotics. To track the desired trajectory, a Cartesian impedance controller containing desired task variables is derived. Redundancy is then exploited to handle human-robot interaction behavior by using the designed null space impedance controller while constraining the range of elbow motion. In addition, an analytical inverse kinematics (IK) solution is employed to guarantee the compliant behavior of the null space is balanced to any elbow configuration. Finally, the performance of the proposed approach is verified through various experiments on a torque-controlled 7-DOF redundant manipulator.

15:25-15:30, Paper ThCT5.6
A Smooth Analytical Formulation of Collision Detection and Rigid Body Dynamics with Contact

Beker, Onur	University of Tübingen
Gürtler, Nico	Max Planck Institute for Intelligent Systems
Shi, Ji	University of Tuebingen, Max Planck Institute for Intelligent Sy
Geist, Andreas René	University of Tübingen
Razmjoo, Amirreza	Idiap Research Institute
Martius, Georg	Max Planck Institute for Intelligent Systems
Calinon, Sylvain	Idiap Research Institute
Keywords: Contact Modeling Abstract: Generating intelligent robot behavior in contact-rich settings is a research problem where zeroth-order methods currently prevail. A major contributor to the success of such methods is their robustness in the face of non-smooth and discontinuous optimization landscapes that are characteristic of contact interactions, yet zeroth-order methods remain computationally inefficient. It is therefore desirable to develop methods for perception, planning and control in contact-rich settings that can achieve further efficiency by making use of first and second order information (i.e., gradients and Hessians). To facilitate this, we present a joint formulation of collision detection and contact modelling which, compared to existing differentiable simulation approaches, provides the following benefits: i) it results in forward and inverse dynamics that are entirely analytical (i.e. do not require solving optimization or root-finding problems with iterative methods) and smooth (i.e. twice differentiable), ii) it supports arbitrary collision geometries without needing a convex decomposition, and iii) its runtime is independent of the number of contacts. Through simulation experiments, we demonstrate the validity of the proposed formulation as a "physics for inference" that can facilitate future development of efficient methods to generate intelligent contact-rich behavior.

15:30-15:35, Paper ThCT5.7
Flexi-SEA: Flexible-Shaft-Driven Series Elastic Actuator for Wearable Robots (I)

Kong, Kyoungchul	Korea Advanced Institute of Science and Technology
Choi, Sanguk	Korea Advanced Institute of Science and Technology
Kim, Jongwon	KAIST
Ko, Chanyoung	Korea Advanced Institute of Science and Technology
Keywords: Actuation and Joint Mechanisms, Force Control, Wearable Robotics Abstract: Series elastic actuators (SEAs) are widely used in wearable robots due to their precise torque control, backdrivability, and low mechanical impedance. However, SEAs that assist distal joints like the ankle often have a large moment of inertia due to components such as springs and frames, negatively affecting biomechanics and energy efficiency during walking. Therefore, cable-driven SEAs have been introduced to distribute mass, but they face inherent limitations, including human discomfort from joint stress, shear forces against the skin, and force transmission delays due to cable slack. This paper proposes the Flexible-shaft-driven SEA (Flexi-SEA), an innovative SEA designed to reduce its moment of inertia without the drawbacks of cable-driven SEAs. The Flexi-SEA features a heavy motor at the proximal part, connected via a flexible shaft to a spring-loaded end-effector at the distal joint. While the flexible shaft can transmit bidirectional torque with low inertia, its nonlinear, asymmetric, and spatially-varying torsional stiffness poses control challenges. To ensure high torque precision and robustness, a controller based on system linearization and a disturbance observer is implemented. The performance of the Flexi-SEA was evaluated through a series of tests, including human-involved experiments.

15:35-15:40, Paper ThCT5.8
Smooth Surface-To-Surface Contact Control for Rope-Base Soft-Tip Manipulator (I)

Sun, Guangli	The Chinese University of Hong Kong
Zhong, Fangxun	The Chinese University of Hong Kong, Shenzhen
Yue, Linzhu	The Chinese University of Hong Kong
Li, Peng	Harbin Institute of Technology ShenZhen
Chen, Zhi	Hefei University
Li, Xiang	Tsinghua University
Liu, Yunhui	Chinese University of Hong Kong
Keywords: Compliance and Impedance Control, Robust/Adaptive Control, Sensor-based Control Abstract: A new control pipeline has been proposed for the Rope-Base Soft-tip Manipulator (RBSM) to execute the surface contact task to prevent the jamming and slipping problems. The control pipeline enables smooth surface-to-surface contact for the RBSM using only force sensors, eliminating the dependence on additional pose measurement of the window surface plane and soft-tip deformation information. The pipeline consists of three steps: free contact step implemented by an exponential force shape controller to avoid force overshoot to the window surface; orientation refinement step implemented by a force and torque combined controller to make the RBSM cleaning head surface stable adapt to the smooth window surface; and finally, a release normal force step to reduce head jamming and region covering with a pre-defined vibration-less cleaning trajectory for smooth cleaning on the slippery window surface. The proposed pipeline has been validated in a Rope base Cleaning manipulator prototype to clean a common window surface. The force and velocity curves during the cleaning experiment show that the proposed method achieves smooth scraping and cleaning under unknown initial significant errors in surface orientation.


ThCT6	301
Data Sets for Robotics 1	Regular Session
Chair: Chen, Hao	Fujian Agriculture and Forestry University
Co-Chair: Albu-Schäffer, Alin	DLR - German Aerospace Center

15:00-15:05, Paper ThCT6.1
MV2: A Large-Scale 360-Degree Multi-View Maritime Vision Dataset for Object Detection and Segmentation

Lee, Junseok	GIST(Gwangju Institute of Science and Technology)
Kim, Jong-Won	GIST(Gwangju Institute of Science and Technology)
Lee, Seongju	Gwangju Institue of Science and Technology (GIST)
Kim, Taeri	Gwangju Institute of Science and Technology(GIST)
Lee, Kyoobin	Gwangju Institute of Science and Technology
Keywords: Data Sets for Robotic Vision, Computer Vision for Automation, Computer Vision for Transportation Abstract: Reliable navigation of autonomous vessels critically depends on robust situational awareness, particularly object detection. For this, an accurate, 360-degree perception of the surrounding environment is essential. However, most existing datasets lack the comprehensive multi-view data required for this full environmental coverage. This absence of large-scale, multi-view image datasets specifically designed for maritime situational awareness on vessels presents a significant challenge. To address this, we introduce the Multi-View Maritime Vision (MV2) dataset, comprising 159,386 visible-light images captured from six distinct viewpoints around a vessel. MV2 provides a complete 360-degree omnidirectional perspective, offering critical support for maritime situational awareness applications. The dataset includes object bounding boxes, along with semantic, instance, and panoptic segmentation labels, and encompasses a wide range of environmental conditions, supporting diverse computer-vision tasks. Additionally, we benchmarked state-of-the-art object-detection and panoptic-segmentation models on MV2, demonstrating its contribution to advancing maritime autonomy research. The dataset is available at href{https://sites.google.com/view/multi-view-maritime-vis ion}{https://sites.google.com/view/multi-view-maritime-visi on}.

15:05-15:10, Paper ThCT6.2
The Common Objects Underwater (COU) Dataset for Robust Underwater Object Detection

Mukherjee, Rishi	University of Minnesota
Singh, Sakshi	University of Minnesota
McWilliams, Jack	University of Minnesota, Twin Cities
Sattar, Junaed	University of Minnesota
Keywords: Data Sets for Robotic Vision, Object Detection, Segmentation and Categorization, Recognition Abstract: We introduce COU: Common Objects Underwater, an instance-segmented image dataset of commonly found man-made objects in multiple aquatic and marine environments. COU contains approximately 10K segmented images, annotated from images collected during a number of underwater robot field trials in diverse locations. COU has been created to address the lack of datasets with robust class coverage curated for underwater instance segmentation, which is particularly useful for training light-weight, real-time capable detectors for Autonomous Underwater Vehicles (AUVs). In addition, COU addresses the lack of diversity in object classes since the commonly available underwater image datasets focus only on marine life. Currently, COU contains images from both closed-water (pool) and open-water (lakes and oceans) environments, of 24 different classes of objects including marine debris, dive tools, and AUVs. To assess the efficacy of COU in training underwater object detectors, we use three state-of-the-art models to evaluate its performance and accuracy, using a combination of standard accuracy and efficiency metrics. The improved performance of COU-trained detectors over those solely trained on terrestrial data demonstrates the clear advantage of training with annotated underwater images. We make COU available for broad use under open-source licenses.

15:10-15:15, Paper ThCT6.3
Articulation-Gen: 3D Part Segmentation and Articulated Object Generation

Xu, Zhuoqun	Extreme Science AgentTech
Liu, Yang	Extreme Science AgentTech
Keywords: Data Sets for Robotic Vision, Simulation and Animation, Visual Learning Abstract: Recent advances in 3D content generation, particularly 3D Gaussian Splatting (3DGS) and diffusion models, have significantly improved the synthesis of static shapes and textures. However, the modeling of dynamic articulations remains a significant challenge. Existing datasets lack physics-aware joint annotations, segmentation methods overlook kinematic constraints, and procedural generation techniques often prioritize space coverage over physical plausibility and visual realism. Motivated by these challenges, we propose Articulation-Gen, a scalable and robust framework for generating physically compliant, multi-joint 3D objects. Our approach comprises three components: (1) a 3D semantic segmentation module that integrates 2D visual models (SAM2 and DINO) to achieve 91.4% part segmentation accuracy by resolving occlusions via multi-view fusion with semantic consistency; (2) a physics-guided joint optimizer that combines spatial sampling with heuristic search to reach 93.7% axis alignment accuracy, representing a 20.6% improvement; and (3) an LLM-augmented URDF synthesis mechanism that automatically produces physically plausible kinematic descriptions with language annotations, thereby improving generation accuracy by 87.5%. Leveraging existing 3D asset datasets and generation techniques, we further construct a large-scale articulation asset dataset comprising 10.6K articulated objects with 45.2K validated joints. This dataset enables faster articulated asset generation while ensuring URDF compliance. By proposing our pipeline and dataset, this work provides foundational tools for physics-based computer graphics and embodied AI, advancing the frontiers of 3D content creation and robotic simulation.

15:15-15:20, Paper ThCT6.4
Low-Effort Iterative Dataset Generation Pipeline for Unknown Object Instance Segmentation

Jordan, Florian	Fraunhofer IPA
Lindermayr, Jochen	Fraunhofer IPA
Bormann, Richard	Fraunhofer IPA
Huber, Marco F.	University of Stuttgart
Keywords: Data Sets for Robotic Vision, Object Detection, Segmentation and Categorization, AI-Based Methods Abstract: Robots operating in everyday environments encounter a wide variety of previously unseen objects. Deep Learning methods simplify unknown object and scene segmentation by structuring inherent real-world complexities, improving visual scene understanding. However, they need vast amounts of labeled high-variance data for training. Acquiring these labels for rich real-world data requires significant manual effort, especially for segmentation masks. Although interactive segmentation accelerates this process, these methods still require substantial manual interaction, and the creation of large datasets remains labor-intensive. Consequently, there is a lack of diverse, high-quality datasets for unknown object instance segmentation in everyday environments. This research proposes a semi-automatic, RGB-only algorithmic pipeline for annotating novel objects, reducing manual effort to iteratively placing objects in the scene. We investigate several change detection-based approaches, including remote sensing change detection methods (TTP model), the DeepBackgroundMattingV2 image matting model, and the Segment Anything Model (SAM1 + SAM2) prompted with automatically extracted change regions. We propose the novel ILIS dataset to evaluate these methods in challenging everyday scenes, displaying reliable automatic mask proposal performance of up to 0.9549 mIoU and 0.9565 boundary F1 score. This highlights the potential of this method to accelerate large-scale dataset creation, saving at least 27.27 hours per 1,000 images by eliminating manual annotations.

15:20-15:25, Paper ThCT6.5
Indoor FireRescue Radar: 4D Indoor Millimeter Wave Dataset and Analysis for Hazardous Environment Perception

Duan, Kangkang	The University of British Columbia
Zhu, Zehao	University of British Columbia
Zou, Zhengbo	Columbia University
Keywords: Data Sets for Robotic Vision, Data Sets for Robot Learning, Object Detection, Segmentation and Categorization Abstract: Fire-induced indoor environments, characterized by smoke, glare, and dimness, critically challenge rescue safety. While LiDAR and cameras suffer from signal attenuation, millimeter-wave (mmWave) radar exhibits robust imaging performance. Radar-based building mapping and object detection in indoor environments are thus required to facilitate situational awareness specified by firefighting standards. Prior radar datasets mostly focus on outdoor object detection and the few existing indoor datasets remain insufficient in several aspects: (1) lacking adverse scenario analysis; (2) lacking raw analog-to-digital converter (ADC) data for dense point cloud generation; and (3) lacking 3D object annotations for building layout understanding. This work introduces the Indoor FireRescue Radar (IFR) dataset, a novel large-scale multimodal benchmark for indoor situational awareness. It includes 27K frames of 4D radar point cloud, co-calibrated with LiDAR, RGB camera, and IMU streams, alongside 3D objects annotations across 10 buildings. This dataset also provides raw ADC data and sensor configuration metadata. We applied voxel-based and pillar-based object detectors to 4D radar-based indoor object detection. We also demonstrated the robustness of radar perception in fire-induced indoor environments by real smoke tests at a firefighter training facility. Dataset is available at: https://huggingface.co/datasets/yysd123/indoor_mmwave

15:25-15:30, Paper ThCT6.6
FIReStereo: Forest InfraRed Stereo Dataset for UAS Depth Perception in Visually Degraded Environments

Dhrafani, Devansh	Carnegie Mellon University
Liu, Yifei	Carnegie Mellon University
Jong, Andrew	Carnegie Mellon University
Shin, Ukcheol	CMU(Carnegie Mellon University)
He, Yao	Stanford University
Harp, Tyler	Carnegie Mellon University
Hu, Yaoyu	Carnegie Mellon University
Oh, Jean	Carnegie Mellon University
Scherer, Sebastian	Carnegie Mellon University
Keywords: Data Sets for Robotic Vision, Deep Learning for Visual Perception, Robotics and Automation in Agriculture and Forestry Abstract: Robust depth perception in visually-degraded environments is crucial for autonomous aerial systems. Thermal imaging cameras, which capture infrared radiation, are robust to visual degradation. However, due to lack of a large-scale dataset, the use of thermal cameras for unmanned aerial system (UAS) depth perception has remained largely unexplored. This paper presents a stereo thermal depth perception dataset for autonomous aerial perception applications. The dataset consists of stereo thermal images, LiDAR, IMU and ground truth depth maps captured in urban and forest settings under diverse conditions like day, night, rain, and smoke. We benchmark representative stereo depth estimation algorithms, offering insights into their performance in degraded conditions. Models trained on our dataset generalize well to unseen smoky conditions, highlighting the robustness of stereo thermal imaging for depth perception. We aim for this work to enhance robotic perception in disaster scenarios, allowing for exploration and operations in previously unreachable areas. The dataset and source code are available at firestereo.github.io.

15:30-15:35, Paper ThCT6.7
Automatically Prepare Training Data for YOLO Using Robotic In-Hand Observation and Synthesis (I)

Chen, Hao	Fujian Agriculture and Forestry University
Wan, Weiwei	Osaka University
Matsushita, Masaki	H.U. Group Research Inst. G. K., Japan
Kotaka, Takeyuki	H.U. Group Research Inst. G. K., Japan
Harada, Kensuke	Osaka University
Keywords: Data Sets for Robotic Vision Abstract: Deep learning methods have recently exhibited impressive performance in object detection. However, such methods needed much training data to achieve high recognition accuracy, which was time-consuming and required considerable manual work like labeling images. In this paper, we automatically prepare training data using robots. Considering the low efficiency and high energy consumption in robot motion, we proposed combining robotic in-hand observation and data synthesis to enlarge the limited data set collected by the robot. We first used a robot with a depth sensor to collect images of objects held in the robot’s hands and segment the object pictures. Then, we used a copy-paste method to synthesize the segmented objects with rack backgrounds. The collected and synthetic images are combined to train a deep detection neural network. We conducted experiments to compare YOLOv5x detectors trained with images collected using the proposed method and several other methods. The results showed that combined observation and synthetic images led to comparable performance to manual data preparation. They provided a good guide on optimizing data configurations and parameter settings for training detectors. The proposed method required only a single process and was a low-cost way to produce the combined data. Interested readers may find the data sets and trained models from the following GitHub repository:github.com/wrslab/tubedet.


ThCT7	307
Human-Aware Motion Planning 3	Regular Session
Co-Chair: Yi, Jingang	Rutgers University

15:00-15:05, Paper ThCT7.1
Human Implicit Preference-Based Policy Fine-Tuning for Multi-Agent Reinforcement Learning in USV Swarm

Kim, Hyeonjun	Korea Military Academy
Lee, Kanghoon	Korea Advanced Institute of Science and Technology
Park, Junho	Kaist
Li, Jiachen	University of California, Riverside
Park, Jinkyoo	Korea Advanced Institute of Science and Technology
Keywords: Reinforcement Learning, Human Factors and Human-in-the-Loop, Multi-Robot Systems Abstract: Multi-Agent Reinforcement Learning (MARL) has shown promise in solving complex problems involving cooperation and competition among agents, such as an Unmanned Surface Vehicle (USV) swarm used in search and rescue, surveillance, and vessel protection. However, aligning system behavior with user preferences is challenging due to the difficulty of encoding expert intuition into reward functions. To address the issue, we propose a Reinforcement Learning with Human Feedback (RLHF) approach for MARL that resolves credit-assignment challenges through an Agent-Level Feedback system categorizing feedback into intra-agent, inter-agent, and intra-team types. To overcome the challenges of direct human feedback, we employ a Large Language Model (LLM) evaluator to validate our approach using feedback scenarios such as region constraints, collision avoidance, and task allocation. Our method effectively refines USV swarm policies, addressing key challenges in multi-agent systems while maintaining fairness and performance consistency.

15:05-15:10, Paper ThCT7.2
Integrating Offline Pre-Training with Online Fine-Tuning: A Reinforcement Learning Approach for Robot Social Navigation

Su, Run	Wuhan University of Science and Technology
Fu, Hao	Wuhan University of Science and Technology
Zhou, Shuai	Wuhan University of Science and Technology
Fu, Yingao	Wuhan University of Science and Technology
Keywords: Human-Aware Motion Planning, Reinforcement Learning, Collision Avoidance Abstract: Offline reinforcement learning (RL) has emerged as a promising framework for addressing robot social navigation challenges. However, inherent uncertainties in pedestrian behavior and limited environmental interaction during training often lead to suboptimal exploration and distributional shifts between offline training and online deployment. To overcome these limitations, this paper proposes a novel offline-to-online fine-tuning RL algorithm for robot social navigation by integrating Return-to-Go (RTG) prediction into a causal Transformer architecture. Our algorithm features a spatiotem- poral fusion model designed to precisely estimate RTG values in real-time by jointly encoding temporal pedestrian motion patterns and spatial crowd dynamics. This RTG prediction framework mitigates distribution shift by aligning offline policy training with online environmental interactions. Furthermore, a hybrid offline-online experience sampling mechanism is built to stabilize policy updates during fine-tuning, ensuring balanced integration of pre-trained knowledge and real-time adaptation. Extensive experiments in simulated social navigation environments demonstrate that our method achieves a higher success rate and lower collision rate compared to state-of-the-art baselines. These results underscore the efficacy of our algorithm in enhancing navigation policy robustness and adaptability. This work paves the way for more reliable and adaptive robotic navigation systems in real-world applications.

15:10-15:15, Paper ThCT7.3
Multi-Agent Inverse Reinforcement Learning in Real World Unstructured Pedestrian Crowds

Chandra, Rohan	University of Virginia
Karnan, Haresh	The University of Texas at Austin
Mehr, Negar	University of California Berkeley
Stone, Peter	The University of Texas at Austin
Biswas, Joydeep	The University of Texas at Austin
Keywords: Human-Aware Motion Planning Abstract: Social robot navigation in crowded public spaces such as university campuses, restaurants, grocery stores, and hospitals, is an increasingly important area of research. One of the core strategies for achieving this goal is to understand humans' intent--underlying psychological factors that govern their motion--by learning how humans assign rewards to their actions, typically via inverse reinforcement learning (IRL). Despite significant progress in IRL, learning reward functions of multiple agents simultaneously in dense unstructured pedestrian crowds has remained intractable due to the nature of the tightly coupled social interactions that occur in these scenarios e.g. passing, intersections, swerving, weaving, etc. In this paper, we present a new multi-agent maximum entropy inverse reinforcement learning algorithm for real world unstructured pedestrian crowds. Key to our approach is a simple, but effective, mathematical trick which we name the so-called "tractability-rationality trade-off" trick that achieves tractability at the cost of a slight reduction in accuracy. We compare our approach to the classical single-agent MaxEnt IRL as well as state-of-the-art trajectory prediction methods on several datasets including the ETH, UCY, SCAND, JRDB, and a new dataset, called Speedway, collected at a busy intersection on a University campus focusing on dense, complex agent interactions. Our key findings show that, on the dense Speedway dataset, our approach ranks 1^st among top 7 baselines with > 2× improvement over single-agent IRL, and is competitive with state-of-the-art large transformer-based encoder-decoder models on sparser datasets such as ETH/UCY (ranks 3^rd among top 7 baselines).

15:15-15:20, Paper ThCT7.4
Safe Probabilistic Planning for Human-Robot Interaction Using Conformal Risk Control

Gonzales, Jake	University of Washington
Mizuta, Kazuki	University of Washington
Leung, Karen	University of Washington
Ratliff, Lillian	University of Washington
Keywords: Robot Safety, Machine Learning for Robot Control, Human-Aware Motion Planning Abstract: In this paper, we present a novel probabilistic safe control framework for human-robot interaction that combines control barrier functions (CBFs) with conformal risk control to provide formal safety guarantees while considering complex human behavior. The approach uses conformal risk control to quantify and control the prediction errors in CBF safety values and establishes formal guarantees on the probability of constraint satisfaction during interaction. We introduce an algorithm that dynamically adjusts the safety margins produced by conformal risk control based on the current interaction context. Through experiments on human-robot navigation scenarios, we demonstrate that our approach significantly reduces collision rates and safety violations as compared to baseline methods while maintaining high success rates in goal-reaching tasks and efficient control.

15:20-15:25, Paper ThCT7.5
Development of a Cleaning Robot Capable of Self-Propelled Cleaning for Ducts in Real-World Environments Employing a Planetary Gear Mechanism

Ono, Yuki	Chuo University
Monma, Yosuke	Chuo University
Ito, Fumio	Chuo University
Nakamura, Taro	Chuo University
Keywords: Human-Centered Automation, Robotics in Hazardous Fields, Biomimetics Abstract: This study develops an autonomous cleaning robot designed to remove accumulated grease in restaurant kitchen ducts, where human access and manual cleaning are not feasible. Prior studies have developed cleaning mechanisms for round ducts employing planetary gear systems, demonstrating their efficiency in grease removal. However, these systems lack propulsion mechanisms, and cleaning experiments have been limited to short distance, small-diameter pipes (140 mm,100A). Therefore, no system has been developed for cleaning grease in long-distance, large-diameter ducts in real-world environments. To address this limitation, we developed a self-propelled cleaning robot integrating a planetary gear-based cleaning mechanism and an inchworm-inspired propulsion mechanism. The design of the propulsion mechanism involved modeling brush rotational torque, gripping torque, and gripping force. Based on this model, a duct inspection and cleaning robot equipped with both propulsion and cleaning mechanisms was developed. Subsequently, the developed robot was tested in a 9 m mock-up duct to evaluate its self-propelled cleaning performance. The robot removed an average of over 85% of the grease under all test conditions while operating autonomously. Finally, a cleaning experiment was conducted in a butcher shop duct, where the robot removed most of the adhered grease. These experiments demonstrated that the developed robot can autonomously clean and inspect ducts in real-world environments where human entry is impractical.

15:25-15:30, Paper ThCT7.6
PrefMMT: Modeling Human Preferences in Preference-Based Reinforcement Learning with Multimodal Transformers

Zhao, Dezhong	Beijing University of Chemical Technology
Wang, Ruiqi	Purdue University
Suh, Dayoon	Purdue University
Kim, Taehyeon	Purdue University
Yuan, Ziqin	Purdue University
Min, Byung-Cheol	Purdue University
Chen, Guohua	Beijing University of Chemical Technology
Keywords: Human-Centered Robotics, Representation Learning, Reinforcement Learning Abstract: Preference-based reinforcement learning (PbRL) shows promise in aligning robot behaviors with human preferences, but its success depends heavily on the accurate modeling of human preferences through reward models. Most methods adopt Markovian assumptions for preference modeling (PM), which overlook the temporal dependencies within robot behavior trajectories that impact human evaluations. While recent works have utilized sequence modeling to mitigate this by learning sequential non-Markovian rewards, they ignore the multimodal nature of robot trajectories, which consist of elements from two distinctive modalities: state and action. As a result, they often struggle to capture the complex interplay between these modalities that significantly shapes human preferences. In this paper, we propose a multimodal sequence modeling approach for PM by disentangling state and action modalities. We introduce a multimodal transformer network, named PrefMMT, which hierarchically leverages intra-modal temporal dependencies and inter-modal state-action interactions to capture complex preference patterns. Our experimental results demonstrate that PrefMMT consistently outperforms state-of-the-art PM and direct preference-based policy learning baselines on locomotion tasks from the D4RL benchmark and manipulation tasks from the MetaWorld benchmark. Source code and supplementary information are available at https://sites.google.com/view/prefmmt.


ThCT8	308
Human-Centered Robotics 1	Regular Session
Chair: Zhu, Yaonan	University of Tokyo
Co-Chair: Christen, Sammy	Disney Research

15:00-15:05, Paper ThCT8.1
ICCO: Learning an Instruction-Conditioned Coordinator for Language-Guided Task-Aligned Multi-Robot Control

Yano, Yoshiki	Nara Institute of Science and Technology
Shibata, Kazuki	Nara Institute of Science and Technology
Kokshoorn, Maarten (Martinus Hendrik Johannes Louis)	Technical University of Delft
Matsubara, Takamitsu	Nara Institute of Science and Technology
Keywords: Human-Centered Robotics, Multi-Robot Systems, Reinforcement Learning Abstract: Recent advances in Large Language Models (LLMs) have enabled language-guided multi-robot systems, allowing robots to execute tasks based on natural language instructions. However, achieving effective coordination in distributed multi-agent environments remains challenging due to (1) misalignment between instructions and task requirements and (2) inconsistency in robot behaviors when interpreting ambiguous instructions independently. To address these challenges, we propose Instruction-Conditioned Coordinator (ICCO), a Multi-Agent Reinforcement Learning (MARL) framework designed to enhance coordination in language-guided multi-robot systems. ICCO consists of a Coordinator agent and multiple Local Agents, where the Coordinator generates Task-Aligned and Consistent Instructions (TACI) by integrating language instructions with environmental states, ensuring task alignment and behavioral consistency. The Coordinator and Local Agents are jointly trained to optimize a reward function that balances task efficiency and instruction following. A Consistency Enhancement Term is added to the learning objective to maximize mutual information between instructions and robot behaviors, further improving coordination. Simulation and real-world experiments validate ICCO's effectiveness in achieving language-guided task-aligned multi-robot control.

15:05-15:10, Paper ThCT8.2
TR-LLM: Integrating Trajectory Data for Scene-Aware LLM-Based Human Action Prediction

Takeyama, Kojiro	Toyota Motor North America, University of California Santa Barba
Liu, Yimeng	University of California, Santa Barbara
Sra, Misha	University of California Santa Barbara
Keywords: Human-Centered Robotics, Human-Centered Automation, Human-Aware Motion Planning Abstract: Accurate prediction of human behavior is crucial for AI systems to effectively support real-world applications, such as autonomous robots anticipating and assisting with human tasks. Real-world scenarios frequently present challenges such as occlusions and incomplete scene observations, which can compromise predictive accuracy. Thus, traditional video-based methods often struggle due to limited temporal and spatial perspectives. Large Language Models (LLMs) offer a promising alternative. Having been trained on a large text corpus describing human behaviors, LLMs likely encode plausible sequences of human actions in a home environment. However, LLMs, trained primarily on text data, lack inherent spatial awareness and real-time environmental perception. They struggle with understanding physical constraints and spatial geometry. Therefore, to be effective in a real-world spatial scenario, we propose a multimodal prediction framework that enhances LLM-based action prediction by integrating physical constraints derived from human trajectories. Our experiments demonstrate that combining LLM predictions with trajectory data significantly improves overall prediction performance. This enhancement is particularly notable in situations where the LLM receives limited scene information, highlighting the complementary nature of linguistic knowledge and physical constraints in understanding and anticipating human behavior.

15:10-15:15, Paper ThCT8.3
LBAP: Improved Uncertainty Alignment of LLM Planners Using Bayesian Inference

Mullen, James	University of Maryland
Manocha, Dinesh	University of Maryland
Keywords: Human-Centered Robotics, Robot Companions, AI-Enabled Robotics Abstract: Large language models (LLMs) showcase many desirable traits for intelligent and helpful robots. However, they are also known to hallucinate predictions. This issue is exacerbated in robotics where LLM hallucinations may result in robots confidently executing plans that are contrary to user goals or relying more frequently on human assistance. In this work, we present LBAP, a novel approach for utilizing off-the-shelf LLMs, alongside Bayesian inference for uncertainty Alignment in robotic Planners that minimizes hallucinations and human intervention. Our key finding is that we can use Bayesian inference to more accurately calibrate a robots confidence measure through accounting for both scene grounding and world knowledge. This process allows us to mitigate hallucinations and better align the LLM's confidence measure with the probability of success. Through experiments in both simulation and the real world on tasks with a variety of ambiguities, we show that LBAP significantly increases success rate and decreases the amount of human intervention required relative to prior art. For example, in our real-world testing paradigm, LBAP decreases the human help rate of previous methods by over 33% at a success rate of 70%.

15:15-15:20, Paper ThCT8.4
Autonomous Human-Robot Interaction Via Operator Imitation

Christen, Sammy	Disney Research
Mueller, David	Disney Research Robotics
Serifi, Agon	ETH Zurich
Grandia, Ruben	Disney Research
Wiedebach, Georg	Disney
Hopkins, Michael Anthony	Walt Disney Imagineering
Knoop, Espen	The Walt Disney Company
Bächer, Moritz	Disney Research
Keywords: Human-Centered Robotics, Autonomous Agents, Machine Learning for Robot Control Abstract: Teleoperated robotic characters can perform expressive interactions with humans, relying on the operators' experience and social intuition. In this work, we propose to create autonomous interactive robots, by training a model to imitate operator data. Our model is trained on a dataset of human-robot interactions, where an expert operator is asked to vary the interactions and mood of the robot, while the operator commands as well as the pose of the human and robot are recorded. Our approach learns to predict continuous operator commands through a diffusion process and discrete commands through a classifier, all unified within a single transformer architecture. We evaluate the resulting model in simulation and with a user study on the real system. We show that our method enables simple autonomous human-robot interactions that are comparable to the expert-operator baseline, and that users can recognize the different robot moods as generated by our model. Finally, we demonstrate a zero-shot transfer of our trained model onto a different robotic platform with the same operator interface.

15:20-15:25, Paper ThCT8.5
Exo-ViHa: A Cross-Platform Exoskeleton System with Visual and Haptic Feedback for Efficient Dexterous Skill Learning

Chao, Xintao	Tsinghua University
Mu, Shilong	Tsinghua University
Liu, Yushan	Tsinghua University
Li, Shoujie	Tsinghua Shenzhen International Graduate School
Lyu, Chuqiao	Tsinghua Shenzhen International Graduate School
Zhang, Xiao-Ping	Tsinghua University
Ding, Wenbo	Tsinghua University
Keywords: Human-Centered Robotics, Imitation Learning, Datasets for Human Motion Abstract: Imitation learning has emerged as a powerful paradigm for robot skills learning. However, traditional data collection systems for dexterous manipulation face challenges, including a lack of balance between acquisition efficiency, consistency, and accuracy. To address these issues, we introduce Exo-ViHa, an innovative 3D-printed exoskeleton system that enables users to collect data from a first-person perspective while providing real-time haptic feedback. This system combines a 3D-printed modular structure with a slam camera, a motion capture glove, and a wrist-mounted camera. Various dexterous hands can be installed at the end, enabling it to simultaneously collect the posture of the end effector, hand movements, and visual data. By leveraging the first-person perspective and direct interaction, the exoskeleton enhances the task realism and haptic feedback, improving the consistency between demonstrations and actual robot deployments. In addition, it has cross-platform compatibility with various robotic arms and dexterous hands. Experiments show that the system can significantly improve the success rate and efficiency of data collection for dexterous manipulation tasks.

15:25-15:30, Paper ThCT8.6
Task Planning for a Factory Robot Using Large Language Model

Tsushima, Yosuke	Toyota Motor East Japan, Inc., Tohoku University
Yamamoto, Shu	Tohoku University
Ravankar, Ankit A.	Tohoku University
Salazar Luces, Jose Victorio	Tohoku University
Hirata, Yasuhisa	Tohoku University
Keywords: Human-Centered Automation, Autonomous Agents, Factory Automation Abstract: In recent years, automation has significantly advanced the automobile manufacturing industry. However, many tasks still involve human intervention, so there is a demand for the development of robots to support workers. Additionally, as the human-centric approach, such as Industry 5.0, is gaining attention, it is expected that support robots like these will become necessary in the future. This study aims to develop a system that can support workers by utilizing robots that anyone can easily use and flexibly respond to various tasks. This system adopts a large language model (LLM) for work planning and generates tasks that robots can execute by making bidirectional and interactive suggestions and modifications through natural language dialogue in response to human demands, aiming to improve further productivity and the working environment in automobile manufacturing factories. The proposed system was tested in a simulated factory environment and then the performance was confirmed in an actual factory setting. And it was confirmed that various tasks can be executed by robot through work planning and dialogue with the LLM.

15:30-15:35, Paper ThCT8.7
Digitalization and the Future of Employment: A Case Study on the Canadian Offshore Oil and Gas Drilling Occupations (I)

Wanasinghe, Thumeera Ruwansiri	Memorial University of Newfoundland
Gosine, Raymond G.	Memorial University of Newfoundland
Petersen, Bui K	Saint Mary's University
Warrian, Peter	University of Toronto
Keywords: Human-Centered Automation, Ethics and Philosophy, Acceptability and Trust Abstract: This paper presents a novel approach to identifying reskilling requirements, job merging pathways, and a tentative timeline for transforming offshore oil and gas drilling occupations amid the fourth industrial revolution (industry 4.0). The proposed algorithm focuses on potential job merging due to technological adoption. It introduces a scaling factor named digital readiness level to incorporate modulation factors (e.g., cost of development and deployment of new technologies, labour market dynamics, economic benefits, regulatory readiness, and social acceptance) that act as catalysts or hindrances for technology adoption. A feature-based approach is developed to assess the similarities between occupations, while a mathematical model is developed to project automation trajectories for each job under investigation. These facilitate the consideration of potential job merging scenarios and the associated timeline. Since technology adoption depends on the industry, region, occupation, and stakeholder’s ability to manage the transformation, the proposed algorithm is presented as a case study on Canadian offshore oil and gas drilling occupations. However, this algorithm and approach can be applied to other industries or occupation structures. The proposed algorithm projects that the total number of personnel on board (POB) in a typical offshore drilling platform will be reduced to six by 2058. A sensitivity analysis was conducted to assess the robustness of the proposed algorithm against variations in the feature values and weighting factors. It was found that when changing feature values and weighting factors up to ±20% of their original values, only one job that remains after 2058 follows three different job merging pathways, while others remain unchanged. Even the job that followed three different pathways was composed of the same source jobs compared to the corresponding job in the baseline results.

15:35-15:40, Paper ThCT8.8
A Digital Twin Driven Human-Centric Ecosystem for Industry 5.0 (I)

Villani, Valeria	University of Modena and Reggio Emilia
Picone, Marco	University of Modena and Reggio Emilia
Mamei, Marco	University of Modena and Reggio Emilia
Sabattini, Lorenzo	University of Modena and Reggio Emilia
Keywords: Human-Centered Automation, Modeling and Simulating Humans, Human Factors and Human-in-the-Loop Abstract: Industry 5.0 embodies the vision for the future of factories, emphasizing the importance of sustainable industrialization and the role of industry in society, through the key concept of placing the well-being of workers at the center of the production process. Building upon this vision, we propose a new paradigm to design human-centric industrial applications. To this end, we exploit Digital Twin (DT) technology to build a digital replica for each entity on the shop floor and support and augment interaction among workers and machines. While so far DTs in automation have been proposed for machine digitalization, the core element of the proposed approach is the Operator Digital Twin (ODT). In this scenario, biometrics allows to build a reliable model of those operator’s characteristics that are relevant in working contexts. Biometric traits are measured and processed to detect physical, emotional, and mental conditions, which are used to define the operator’s state. Perspectively, this allows to manage and monitor production and processes in an operator-in-the-loop manner, where not only is the operator aware of the state of the plant, but also any technological agent in the plant acts and reacts according to the operator’s needs and conditions. In this paper, we define the modeling of the envisioned ecosystem, present the designed DT’s blue-print architecture, discuss its implementation in relevant application scenarios, and report an example of implementation in a collaborative robotics scenario.Note to Practitioners—This paper was motivated by the problem of designing human-cyber-physical systems, where production processes are managed by concurrently taking into account operators, machines and plant status. This answers the needs of the novel Industry 5.0 paradigm, which aims to enhance social sustainability of modern factories. To this end, we propose an architecture based on digital twins that allows to develop a digital layer, detached from the physical one, where the plant can be monitored and managed. This allows the creation of a digital ecosystem where machines, operators, and the interactions among them are represented, augmented, and managed. We discuss how the proposed architecture can be applied to three relevant scenarios: remote training and maintenance, line operation and line supervision. Moreover, the implementation in a collaborative robotics scenario is presented, to provide an example of the proposed architecture can be implemented in industrial scenarios.


ThCT9	309
Vision for Automation	Regular Session
Chair: Wan, Weiwei	Osaka University
Co-Chair: Wu, Lan	University of Technology Sydney

15:00-15:05, Paper ThCT9.1
Omni-Scan: Creating Visually-Accurate Digital Twin Object Models Using a Bimanual Robot with Handover and Gaussian Splat Merging

Qiu, Tianshuang	University of California, Berkeley
Ma, Zehan	University of California, Berkeley
El-Refai, Karim	University of California, Berkeley
Shah, Hiya	University of California, Berkeley
Kim, Chung Min	University of California, Berkeley
Kerr, Justin	University of California, Berkeley
Goldberg, Ken	UC Berkeley
Keywords: Computer Vision for Automation, Computer Vision for Manufacturing, Simulation and Animation Abstract: 3D Gaussian Splats (3DGSs) are 3D object models derived from multi-view images. Such “digital twins” are useful for simulations, virtual reality, marketing, robot policy fine-tuning, and part inspection. 3D object scanning usually requires multi-camera arrays, precise laser scanners, or robot wrist-mounted cameras, which have restricted workspaces. We propose Omni-Scan, a pipeline for producing high-quality 3D Gaussian Splat models using a bi-manual robot that grasps an object with one gripper and rotates the object with respect to a stationary camera. The object is then re-grasped by a second gripper to expose surfaces that were occluded by the first gripper. We present the Omni-Scan robot pipeline using DepthAny-thing, Segment Anything, as well as RAFT optical flow models to identify and isolate objects held by a robot gripper while removing the gripper and the background. We then modify the 3DGS training pipeline to support concatenated datasets with gripper occlusion, producing an omni-directional (360◦) model of the object. We apply Omni-Scan to part defect inspection, finding that it can identify visual or geometric defects in 12 different industrial and household objects with an average accuracy of 83%. Interactive videos of Omni-Scan 3DGS models can be found at https://berkeleyautomation.github.io/omni-scan/

15:05-15:10, Paper ThCT9.2
Layer Decomposition and Morphological Reconstruction for Task-Oriented Infrared Image Enhancement

Chai, Si Yuan	Beijing Institute of Technology
Guo, Xiaodong	Beijing Institute of Technology
Liu, Tong	Beijing Institute of Technology
Keywords: Computer Vision for Automation, Autonomous Vehicle Navigation, Object Detection, Segmentation and Categorization Abstract: Infrared image helps improve the perception capabilities of autonomous driving in complex weather conditions such as fog, rain, and low light. However, infrared image often suffers from low contrast, especially in non-heat-emitting targets like bicycles, which significantly affects the performance of downstream high-level vision tasks. Furthermore, achieving contrast enhancement without amplifying noise and losing important information remains a challenge. To address these challenges, we propose a task-oriented infrared image enhancement method. Our approach consists of two key components: layer decomposition and saliency information extraction. First, we design an l0-l1 layer decomposition method for infrared images, which enhances scene details while preserving dark region features, providing more features for subsequent saliency information extraction. Then, we propose a morphological reconstruction-based saliency extraction method that effectively extracts and enhances target information without amplifying noise. Our method improves the image quality for object detection and semantic segmentation tasks. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods.

15:15-15:20, Paper ThCT9.4
Unidirectional Point-Voxel Fusion for Enhanced 3D Single Object Tracking

Jiang, Yuyu	Nanjing University of Posts and Telecommunications
Fan, Baojie	Nanjing University of Posts and Telecommunications
Du, Jinrong	Nanjing University of Posts and Telecommunications
Yao, Ying	Nanjing University of Posts and Telecommunications
Yang, Yu Shi	Southeast University
Keywords: Computer Vision for Automation, Deep Learning for Visual Perception, Autonomous Vehicle Navigation Abstract: Sparse point-based trackers struggle with textureless and incomplete point clouds. Conversely, dense voxel-based trackers have richer spatial and semantic information, but filtering out interference from complex backgrounds remains a challenge. Additionally, there is still a gap between point and voxel-based trackers in exploiting their complementary strengths. To address these issues, we propose UTracker, which uses unidirectional point-voxel fusion to construct a bridge between point and voxel tracking features, enabling them to complement and enhance each other. Specifically, we design template-enhanced unidirectional attention (TEUA) and historical template fusion (HTF), which enable unidirectional interaction from historical templates to the search area in the point branch, retaining the pure template features. Then, a point-guided adaptive feature transformer (PGAFT) is developed to unidirectionally enhance the interaction between point and voxel features. Extensive experiments demonstrate that UTracker achieves superior performance, reaching an average accuracy of 89.5%, 72.58%, and 63.4% on the KITTI, NuScenes, and Waymo Open Dataset, respectively.

15:20-15:25, Paper ThCT9.5
ETO+: Revisit the Refinement Stage in Efficient Feature Matching

Ni, Junjie	Zhejiang University
Shen, Yichen	Zhejiang University
Li, Yijin	Zhejiang University
Zhai, Hongjia	Zhejiang University
Bao, Hujun	Zhejiang University
Zhang, Guofeng	Zhejiang University
Keywords: Computer Vision for Automation, Vision-Based Navigation, Computer Vision for Transportation Abstract: Recent feature matching approaches like ETO have focused on developing lightweight matching algorithms for real-time applications. However, their lack of cross-image feature interaction and sufficient refinement often lead to a decline in matching accuracy. To address these challenges, we propose ETO+, a novel and accurate feature matching algorithm that incorporates a lightweight yet efficient bidirectional interaction module and multi-stage refinement. Specifically, we introduce Trans-CNN, a bidirectional feature interaction module that integrates CNN- and transformer-based techniques to enhance both intra-image feature refinement and inter-image feature fusion, all while maintaining a comparable computational cost. Furthermore, by leveraging the inherent sparsity of local feature matching, we propose an efficient strategy to adaptively reallocate computational resources within the network. Additionally, we design an adaptive loss function that mitigates the impact of large matching errors, thereby improving overall robustness. Extensive experiments on widely used datasets demonstrate that our approach achieves a strong balance between accuracy and computational efficiency. It outperforms ETO by 7.9 in AUC@5 on MegaDepth, respectively, while being about 40% faster than E-LoFTR.

15:25-15:30, Paper ThCT9.6
Rapid and Simultaneous Visual-Based Estimation of Kinematic and Hand-Eye Parameters of Industrial Mobile Manipulators

Mutti, Stefano	SUPSI ISTePS
Pedrocchi, Nicola	National Research Council of Italy (CNR)
Renò, Vito	Stiima - Cnr
Valente, Anna	SUPSI-ISTePS
Keywords: Calibration and Identification, Computer Vision for Automation, Mobile Manipulation Abstract: Manufacturing applications increasingly integrate visually aided robotic systems. Such systems must rely on excellent kinematic parameter calibration and a hand~–~eye matrix estimation to perform according to standards. The latter is as precise as the camera pose estimation capability and the robotic forward kinematic precision. To enhance the overall system's precision, one must simultaneously act and improve the robot's kinematic parameters and hand~–~eye transformation due to mutual inference. This work exploits standard 2D camera systems to simultaneously estimate the kinematic parameters and the hand~–~eye transformation matrix through a method based on the Unscented Kalman Filter (UKF) and the parameters uncertainty transportation trough the robot's kinematic . The method employs data gathered during the robot movements and camera readings and iteratively improves the system parameters' estimate. The method is applied to industrial mobile manipulators and tested on both synthetic data and real experiment data, showing a great improvement in the kinematic parameters estimation.

15:30-15:35, Paper ThCT9.7
Edge-Guided Lighting Adaptation: Real-Time Detection of Transparent Objects for Cell Culture Robot

Qingze, Huang	Beijing University of Posts and Telecommunications
Wang, Peng	Beijing University of Posts and Telecommunications
Zhang, Xiangyan	Beijing Information Science and Technology University
Wang, Sikai	Beijing University of Posts and Telecommunications
Li, Jian	Beihang University & National Research Center for Rehabilitation
Wei, Shimin	Beijing University of Posts and Telecommunications
Keywords: Computer Vision for Automation, Data Sets for Robotic Vision, Biological Cell Manipulation Abstract: In robot-assisted cell culture tasks, fluctuations in lighting conditions can result in blurred boundaries, intensified reflections, and pronounced refractions of transparent objects. These optical phenomena collectively escalate the complexity of image processing and target recognition. To address these challenges, this paper takes a dual-strategy approach. Firstly, it utilizes the Unity platform to construct a synthetic dataset (STTO-9k) containing 9,000 images of six types of transparent objects, providing abundant training samples for the detection and recognition of transparent objects. Secondly, it proposes an improved YOLOv8 visual detection algorithm (YOLO-Edge-Guided Lighting Adaptation, YL-EGLA). The algorithm realizes feature fusion by dynamically extracting the high-dimensional features of the input through the self-attention mechanism combined with the enhanced edge features extracted by the edge detection operator, and is equipped with adaptive image enhancement module to ensure stable detection under different lighting conditions. Algorithm comparison results demonstrate that the YL-EGLA can be fully trained on the synthetic dataset and directly applied to real-world scenarios without additional fine-tuning. Furthermore, physical experiments further validate the efficiency and practicality of this algorithm in transparent object manipulation, fully showcasing its significant value in practical applications.

15:35-15:40, Paper ThCT9.8
SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation

Li, Xin	Shanghai Jiao Tong University
Huang, Siyuan	Shanghai Jiao Tong University
Yu, Qiaojun	Shanghai Jiao Tong University
Jiang, Zhengkai	Tencent
Hao, Ce	University of California, Berkeley
Zhu, Yimeng	Yuandao AI
Li, Hongsheng	Chinese University of Hong Kong
Gao, Peng	Shanghai AI Lab
Lu, Cewu	ShangHai Jiao Tong University
Keywords: Computer Vision for Automation, Data Sets for Robotic Vision, Deep Learning for Visual Perception Abstract: Automating garment manipulation poses a significant challenge for assistive robotics due to the diverse and deformable nature of garments. Traditional approaches typically require separate models for each garment type, which limits scalability and adaptability. In contrast, this paper presents a unified approach using vision-language models (VLMs) to improve keypoint prediction across various garment categories. By interpreting both visual and semantic information, our model enables robots to manage different garment states with a single model. We created a large-scale synthetic dataset using advanced simulation techniques, allowing scalable training without extensive real-world data. Experimental results indicate that the VLM-based method significantly enhances keypoint detection accuracy and task success rates, providing a more flexible and general solution for robotic garment manipulation. In addition, this research also underscores the potential of VLMs to unify various garment manipulation tasks within a single framework, paving the way for broader applications in home automation and assistive robotics in the future.


ThCT10	310
Visual Tracking	Regular Session
Chair: Yu, Jingjin	Rutgers University
Co-Chair: Yang, Tao	Northwestern Polytechnical University

15:00-15:05, Paper ThCT10.1
Tracking Any Point with Frame-Event Fusion Network at High Frame Rate

Liu, Jiaxiong	National University of Defense Technology
Wang, Bo	National University of Defense Technology
Tan, Zhen	National University of Defense Technology
Zhang, Jinpu	National University of Defense Technology
Shen, Hui	National University of Defense Technology
Hu, Dewen	National University of Defense Technology
Keywords: Visual Tracking, Localization, Deep Learning Methods Abstract: Tracking any point based on image frames is constrained by frame rates, leading to instability in high-speed scenarios and limited generalization in real-world applications. To overcome these limitations, we propose an image-event fusion point tracker, FE-TAP, which combines the contextual information from image frames with the high temporal resolution of events, achieving high frame rate and robust point tracking under various challenging conditions. Specifically, we designed an Evolution Fusion module (EvoFusion) to model the image generation process guided by events. This module can effectively integrate valuable information from both modalities, operating at different frequencies. To achieve smoother point trajectories, we employed a transformer-based refinement strategy that updates the points' trajectories and features iteratively. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, particularly improving expected feature age by 24% on EDS datasets. Finally, we qualitatively validated the robustness of our algorithm in real driving scenarios using our custom-designed high-resolution image-event synchronization device.

15:05-15:10, Paper ThCT10.2
SEI3D: CPU-Only 3D Object Tracking Fusing Sparse-Flow-Filtered Edge and Interior Alignment

Chen, Jixiang	Beijing Institute of Technology
Chen, Jing	Beijing Institutue of Technology
Liu, Kai	Beijing Institute of Technology
Lei, Ting	Beijing Institute of Technology
Wang, Leshan	Beijing Institute of Technology
Keywords: Visual Tracking, Computer Vision for Automation, Computer Vision for Manufacturing Abstract: Monocular 3D object tracking methods are widely employed in robotic applications, however, they often struggle with low-contrast image sequences. In this paper, we introduce a novel approach to filtering redundant edges in images by leveraging sparse interior correspondences. Our method features a sparse-flow-based probability segmentation model that comprises both coarse and fine components. The coarse model evaluates the ratio of interior correspondences within a circular region centered on each pixel, while the fine model employs a binary Gaussian kernel based on the nearest interior correspondences. This probability framework facilitates the identification of control points for object edges. Additionally, we implement a robust gradient consistency-based edge connection algorithm to generate refined object edges. Utilizing these filtered edges, we formulate an edge-based energy function that accounts for object contour shape and noise uncertainty, seamlessly integrating into a multi-feature pose optimization framework. Our multi-feature fusion strategy achieves state-of-the-art performance in both public datasets and real-world applications, operating at 60 Hz using only CPU.

15:10-15:15, Paper ThCT10.3
Multiple Object Tracking with Dynamic Adaptive Object Motion Estimation

Cheng, Borui	Northeastern University
Zhang, Yunzhou	Northeastern University
Keywords: Visual Tracking, Sensor Fusion Abstract: One indicator for evaluating autonomous vehicles’ capability is the accuracy of perceiving the surrounding environment. As an essential part of perception, MOT (multiobject tracking) algorithms provide vital guarantees for safe driving. However, many MOT algorithms based on the motion model only consider the information from previous frame when predicting the motion state of objects without taking into account the long-term motion state. Moreover, their motion model generally uses constant speed or acceleration models, which may cause tracking loss when the object suddenly changes its motion or is occluded by other objects. In this paper, we propose DA-MOT (Multiple Object Tracking with Dynamic Adaptive Object Motion Estimation), which utilizes information from lidar and camera sensors to calculate objects’ dynamic and static states under different sensor information. We modify the KF motion model parameters based on the object’s motion for better tracking performance. Furthermore, we design a re-association mechanism to re-assign IDs for inaccurate associations. We conducted experiments on the KITTI dataset, and the results show a significant improvement in accuracy. DA-MOT algorithm has about 1.5% improvement in MOTA metrics compared to other MOT algorithms in scenes with large changes in object state and can run about 1000 fps on the Intel Core Intel® Xeon(R) Gold 5217 CPU.

15:15-15:20, Paper ThCT10.4
Multi-Target Association and Localization with Distributed Drone Following: A Factor Graph Approach

Ye, Kaixiao	Northwestern Polytechnical University
Shao, Weiyu	Northwestern Polytechnical University
Zheng, Yuhang	Northwestern Polytechnical University
Fang, Bohui	Northwestern Polytechnical University
Yang, Tao	Northwestern Polytechnical University
Keywords: Visual Tracking, Distributed Robot Systems, Aerial Systems: Perception and Autonomy Abstract: Vision-based multi-drone multi-object tracking technology enables autonomous target situational awareness for unmanned aerial systems. Distributed observer drones dynamically estimate the spatio-temporal states of multiple targets through collaborative sensor fusion, enabling simultaneous localization and persistent following of the target of interest in cluttered airspaces. The challenge lies in distinguishing targets in different drones’ views and keeping the target of interest within the field of view. This paper proposes a factor graph method for joint multi-target association and localization with distributed drone following. Sensor measurements and control constraints are integrated into a probabilistic factor graph to solve the bundle adjustment and model predictive control,respectively. Both simulation and real-world experiments prove the effectiveness and robustness of our proposed approach.The source code will be available at:https://github.com/npu-ius-lab/MLMF.

15:20-15:25, Paper ThCT10.5
MambaNUT: Nighttime UAV Tracking Via Mamba-Based Adaptive Curriculum Learning

Wu, You	Guilin University of Technology
Yang, Xiangyang	Guilin University of Technology
Wang, Xucheng	Fudan University
Ye, Hengzhou	Guilin University of Technology
Zeng, Dan	Sun Yat-Sen University
Li, Shuiwang	Guilin University of Technology
Keywords: Visual Tracking, Deep Learning Methods, Performance Evaluation and Benchmarking Abstract: Harnessing low-light enhancement and domain adaptation, nighttime UAV tracking has made substantial strides. However, over-reliance on image enhancement, limited high-quality nighttime data, and a lack of integration between daytime and nighttime trackers hinder the development of an end-to-end trainable framework. Additionally, current ViT-based trackers demand heavy computational resources due to their reliance on the self-attention mechanism. In this paper, we propose a novel pure Mamba-based tracking framework (MambaNUT) that employs a state space model with linear complexity as its backbone, incorporating a single-stream architecture that integrates feature learning and template-search coupling within Vision Mamba. We introduce an adaptive curriculum learning (ACL) approach that dynamically adjusts sampling strategies and loss weights, thereby improving the model's ability of generalization. Our ACL is composed of two levels of curriculum schedulers: (1) sampling scheduler that transforms the data distribution from imbalanced to balanced, as well as from easier (daytime) to harder (nighttime) samples; (2) loss scheduler that dynamically assigns weights based on the size of the training set and IoU of individual instances. Exhaustive experiments on multiple nighttime UAV tracking benchmarks demonstrate that the proposed MambaNUT achieves state-of-the-art performance while requiring lower computational costs. The code will be available at: https://github.com/wuyou3474/MambaNUT.

15:25-15:30, Paper ThCT10.6
RGBTrack: Fast, Robust Depth-Free 6D Pose Estimation and Tracking

Guo, Teng	Rutgers University
Yu, Jingjin	Rutgers University
Keywords: Visual Tracking, RGB-D Perception, Computer Vision for Automation Abstract: We introduce a robust framework, RGBTRACK, for real-time 6D pose estimation and tracking that operates solely on RGB data, thereby eliminating the need for depth input for such dynamic and precise object pose tracking tasks. Building on the FoundationPose architecture, we devise a novel binary search strategy combined with a render-and-compare mechanism to efficiently infer depth and generate robust pose hypotheses from true-scale CAD models. To maintain stable tracking in dynamic scenarios, including rapid movements and occlusions, RGBTRACK integrates state-of-the-art 2D object tracking (XMem) with a Kalman filter and a state machine for proactive object pose recovery. In addition, RGBTRACK’s scale recovery module dynamically adapts CAD models of unknown scale using an initial depth estimate, enabling seamless integration with modern generative reconstruction techniques. Extensive evaluations on benchmark datasets demonstrate that RGBTRACK’s novel depth-free approach achieves competitive accuracy and real-time performance, making it a promising practical solution candidate for applications areas including robotics, augmented reality, computer vision. The source code for our implementation will be made publicly available at https://github.com/GreatenAnoymous/ RGBTrack.git. A demonstration video can be found at https://youtu.be/uCd3zJl0b6A.

15:30-15:35, Paper ThCT10.7
6-DoF Object Tracking with Event-Based Optical Flow and Frames

Li, Zhichao	Istituto Italiano Di Tecnologia
Glover, Arren	Istituto Italiano Di Tecnologia
Bartolozzi, Chiara	Istituto Italiano Di Tecnologia
Natale, Lorenzo	Istituto Italiano Di Tecnologia
Keywords: Visual Tracking Abstract: Tracking the position and orientation of objects in space (i.e., in 6-DoF) in real time is a fundamental problem in robotics for environment interaction. It becomes more challenging when objects move at high-speed due to frame rate limitations in conventional cameras and motion blur. Event cameras are characterized by high temporal resolution, low latency and high dynamic range, that can potentially overcome the impacts of motion blur. Traditional RGB cameras provide rich visual information that is more suitable for the challenging task of single-shot object pose estimation. In this work, we propose using event-based optical flow combined with an RGB based global object pose estimator for 6-DoF pose tracking of objects at high-speed, exploiting the core advantages of both types of vision sensors. Specifically, we propose an event-based optical flow algorithm for object motion measurement to implement an object 6-DoF velocity tracker. By integrating the tracked object 6-DoF velocity with low frequency estimated pose from the global pose estimator, the method can track pose when objects move at high-speed. The proposed algorithm is tested and validated on both synthetic and real world data, demonstrating its effectiveness, especially in high-speed motion scenarios.


ThCT11	311A
Reinforcement Learning 11	Regular Session
Chair: Ding, Jiatao	University of Trento
Co-Chair: Nie, Qiang	Hong Kong University of Science and Technology (Guangzhou)

15:00-15:05, Paper ThCT11.1
Energy-Efficient Motion Planner for Legged Robots

Schperberg, Alexander	Mitsubishi Electric Research Laboratories
Menner, Marcel	Mitsubishi Electric Research Laboratories (MERL)
Di Cairano, Stefano	Mitsubishi Electric Research Laboratories
Keywords: Legged Robots, Motion and Path Planning, Reinforcement Learning Abstract: We propose an online motion planner for legged robot locomotion with the primary objective of achieving energy efficiency. The conceptual idea is to leverage a placement set of footstep positions based on the robot's body position to determine when and how to execute steps. In particular, the proposed planner uses virtual placement sets beneath the hip joints of the legs and executes a step when the foot is outside of such placement set. Furthermore, we propose a parameter design framework that considers both energy-efficiency and robustness measures to optimize the gait by changing the shape of the placement set along with other parameters, such as step height and swing time, as a function of walking speed. We show that the planner produces trajectories that have a low Cost of Transport (CoT) and high robustness measure, and evaluate our approach against model-free Reinforcement Learning (RL) and motion imitation using biological dog motion priors as the reference. Overall, within low to medium velocity range, we show a 50.4% improvement in CoT and improved robustness over model-free RL, our best performing baseline. Finally, we show ability to handle slippery surfaces, gait transitions, and disturbances in simulation and hardware with the Unitree A1 robot.

15:05-15:10, Paper ThCT11.2
Learning Symmetric Legged Locomotion Via State Distribution Symmetrization

Zhu, Chengrui	Zhejiang University
Zhang, Zhen	Zhejiang University
Li, Siqi	Zhejiang University
Li, Qingpeng	Zhejiang University
Liu, Yong	Zhejiang University
Keywords: Legged Robots, Humanoid and Bipedal Locomotion, Reinforcement Learning Abstract: Morphological symmetry is a fundamental characteristic of legged animals and robots. Most existing Deep Reinforcement Learning approaches for legged locomotion neglect to exploit this inherent symmetry, often producing unnatural and suboptimal behaviors such as dominant legs or non-periodic gaits. To address this limitation, we propose a novel learning-based framework to systematically optimize symmetry by state distribution symmetrization. First, we introduce the Degree of Asymmetry (DoA), a quantitative metric that measures the discrepancy between original and mirrored state distributions. Second, we develop an efficient computation method for DoA using gradient ascent with a trained discriminator network. This metric is then incorporated into a reinforcement learning framework by introducing it to the reward function, explicitly encouraging symmetry during policy training. We validate our framework with extensive experiments on quadrupedal and humanoid robots in simulated and real-world environments. Results demonstrate the efficacy of our approach for improving policy symmetry and overall locomotion performance.

15:10-15:15, Paper ThCT11.3
Explosive Jumping with Rigid and Articulated Soft Quadrupeds Via Example Guided Reinforcement Learning

Apostolides, Georgios	TU Delft
Pan, Wei	The University of Manchester
Kober, Jens	TU Delft
Della Santina, Cosimo	TU Delft
Ding, Jiatao	University of Trento
Keywords: Legged Robots, Motion Control, Reinforcement Learning Abstract: Achieving controlled jumping behaviour for aquadruped robot is a challenging task, especially when introducing passive compliance in mechanical design. This study addresses this challenge via imitation-based deep reinforcement learning with a progressive training process. To start, we learn the jumping skill by mimicking a coarse jumping example generated by model-based trajectory optimization. Subsequently, we generalize the learned policy to broader situations, including various distances in both forward and lateral directions, and then pursue robust jumping in unknown ground unevenness. In addition, without tuning the reward much, we learn the jumping policy for a quadruped with parallel elasticity. Results show that using the proposed method, i) the robot learns versatile jumps by learning only from a single demonstration, ii) the robot with parallel compliance reduces the landing error by 11.1%, saves energy cost by 15.2% and reduces the peak torque by 15.8%, compared to the rigid robot without parallel elasticity, iii) the robot can perform jumps of variable distances with robustness against ground unevenness (maximal ±4cm height perturbations) using only proprioceptive perception.

15:15-15:20, Paper ThCT11.4
Transferable Latent-To-Latent Locomotion Policy for Efficient and Versatile Motion Control of Diverse Legged Robots

Zheng, Ziang	Tsinghua University
Zhan, Guojian	Tsinghua University
Shuai, Bin	The School of Vehicle and Mobility, Tsinghua University
Qin, Shengtao	Tsinghua University
Li, Jiangtao	SunRising AI Ltd
Zhang, Tao	Autonavi
Li, Shengbo Eben	Tsinghua University
Keywords: Legged Robots, Reinforcement Learning, Learning from Experience Abstract: Reinforcement learning (RL) has demonstrated remarkable capability in acquiring robot skills, but learning each new skill still requires substantial data collection for training. The pretrain-and-finetune paradigm offers a promising approach for efficiently adapting to new robot entities and tasks. Inspired by the idea that acquired knowledge can accelerate learning new tasks with the same robot and help a new robot master a trained task, we propose a latent training framework where a transferable latent-to-latent locomotion policy is pretrained alongside diverse task-specific observation encoders and action decoders. This policy in latent space processes encoded latent observations to generate latent actions to be decoded, with the potential to learn general abstract motion skills. To retain essential information for decision-making and control, we introduce a diffusion recovery module that minimizes information reconstruction loss during pretrain stage. During fine-tune stage, the pretrained latent-to-latent locomotion policy remains fixed, while only the lightweight task-specific encoder and decoder are optimized for efficient adaptation. Our method allows a robot to leverage its own prior experience across different tasks as well as the experience of other morphologically diverse robots to accelerate adaptation. We validate our approach through extensive simulations and real-world experiments, demonstrating that the pretrained latent-to-latent locomotion policy effectively generalizes to new robot entities and tasks with improved efficiency.

15:20-15:25, Paper ThCT11.5
Energy-Efficient Omnidirectional Locomotion for Wheeled Quadrupeds Via Predictive Energy-Aware Nominal Gait Selection

Yang, Xu	Tsinghua University
Yang, Wei	Tsinghua University
He, Kaibo	Tsinghua University
Yang, Bo	Tsinghua University
Sui, Yanan	Tsinghua University
Mo, Yilin	Tsinghua University
Keywords: Legged Robots, Motion Control, Reinforcement Learning Abstract: Wheeled-legged robots combine the efficiency of wheels with the versatility of legs, but face significant energy optimization challenges when navigating diverse environments. In this work, we present a hierarchical control framework that integrates predictive power modeling with residual reinforcement learning to optimize omnidirectional locomotion efficiency for wheeled quadrupedal robots. Our approach employs a novel power prediction network that forecasts energy consumption across different gait patterns over a 1-second horizon, enabling automatic selection of the most energy-efficient augmented nominal gait. A reinforcement learning policy then generates residual adjustments to this nominal gait, fine-tuning the robot's actions for task objectives. Comparative analysis shows our method reduces energy consumption up to 34.1% compared to existing fixed-gait approaches while maintaining velocity tracking performance. We validate our framework through extensive simulations and real-world experiments on a modified Unitree Go1 platform, demonstrating robust performance even under external disturbances. Videos and implementation details are available at https://sites.google.com/view/switching-wpg.

15:25-15:30, Paper ThCT11.6
Integrating Trajectory Optimization and Reinforcement Learning for Quadrupedal Jumping with Terrain-Adaptive Landing

Wang, Renjie	Westlake University
Lyu, Shangke	Westlake University
Lang, Xin	Westlake University
Xiao, Wei	Westlake University
Wang, Donglin	Westlake University
Keywords: Legged Robots, Optimization and Optimal Control, Reinforcement Learning Abstract: Jumping constitutes an essential component of quadruped robots' locomotion capabilities, which includes dynamic take-off and adaptive landing. Existing quadrupedal jumping studies mainly focused on the stance and flight phase by assuming a flat landing ground, which is impractical in many real world cases. This work proposes a safe landing framework that achieves adaptive landing on rough terrains by combining Trajectory Optimization (TO) and Reinforcement Learning (RL) together. The RL agent learns to track the reference motion generated by TO in the environments with rough terrains. To enable the learning of compliant landing skills on challenging terrains, a reward relaxation strategy is synthesized to encourage exploration during landing recovery period. Extensive experiments validate the accurate tracking and safe landing skills benefiting from our proposed method in various scenarios.


ThCT12	311B
Vision-Based Navigation 3	Regular Session
Chair: Spyrakos-Papastavridis, Emmanouil	King's College London
Co-Chair: Sun, Rongchuan	Soochow University

15:00-15:05, Paper ThCT12.1
CTSG: Integrating Context and Way Topology into Scene Graph for Zero-Shot Navigation

Ma, Ruifei	University of Science and Technology Beijing
Xu, Yifan	Beihang University
Li, Yuze	Beijing University of Posts and Telecommunications
Fang, YanPing	Beijing Shuyuan Digital City Research Center
Yang, Zhifei	Peking University
Qi, Jiaxing	Beihang University
Zhao, Xinyu	Beihang University
Zhang, Chao	Beijing Digital Native Digital City Research Center
Keywords: Vision-Based Navigation, Semantic Scene Understanding, Task Planning Abstract: A robust environment representation is critical for enabling robot systems to accomplish embodied navigation tasks. While offering efficient and sparse representations of environments compared to dense semantic maps, traditional 3D Scene Graphs often rely on multi-level semantic hierarchies that risk semantic discrepancies between high-level nodes and objects. Furthermore, separating semantic context from way topological relationships creates a disconnect between scene interpretation and actionable navigation strategies. To address these challenges, we propose CTSG, a hierarchical 3D scene graph mapping framework for zero-shot object navigation that supports both visual and textual queries. CTSG features a dual-layer structure: an object layer and a novel conway layer (contextual information + way topology). The conway layer integrates topological waypoints with rich multi-modal context, enhancing the continuity of environmental semantics. By aligning observations with navigation-centric perspectives, CTSG bridges the gap between scene understanding and task execution. We validate our method through simulation and real-world experiments across diverse environments, demonstrating robust performance in both visual target and language-guided navigation scenarios.

15:05-15:10, Paper ThCT12.2
A Recursive Total Least Squares Solution for Bearing-Only Target Motion Analysis and Circumnavigation

Li, Lin	Sun Yat-Sen University
Liu, Xueming	Sun Yat-Sen University
Qiu, Zhoujingzi	University of Electronic Science and Technology of China, Shenzh
Hu, Tianjiang	Sun Yat-Sen University
Zhang, Qingrui	Sun Yat-Sen University
Keywords: Surveillance Robotic Systems, Localization, Aerial Systems: Perception and Autonomy Abstract: Bearing-only Target Motion Analysis (TMA) is a promising technique for passive tracking in various applications as a bearing angle is easy to measure. Despite its advantages, bearing-only TMA is challenging due to the nonlinearity of the bearing measurement model and the lack of range information, which impairs observability and estimator convergence. This paper addresses these issues by proposing a Recursive Total Least Squares (RTLS) method for online target localization and tracking using mobile observers. The RTLS approach, inspired by previous results on Total Least Squares (TLS), mitigates biases in position estimation and improves computational efficiency compared to pseudo-linear Kalman filter (PLKF) methods. Additionally, we propose a circumnavigation controller to enhance system observability and estimator convergence by guiding the mobile observer in orbit around the target. Extensive simulations and experiments are performed to demonstrate the effectiveness and robustness of the proposed method. The proposed algorithm is also compared with the state-of-the-art approaches, which confirms its superior performance in terms of both accuracy and stability.

15:10-15:15, Paper ThCT12.3
OpenNav: Open-World Navigation with Multimodal Large Language Models

Yuan, Mingfeng	University of Toronto
Wang, Letian	University of Toronto
Waslander, Steven Lake	University of Toronto
Keywords: Vision-Based Navigation, Multi-Modal Perception for HRI, Task and Motion Planning Abstract: Pre-trained large language models (LLMs) have demonstrated strong common-sense reasoning abilities, making them promising for robotic navigation and planning tasks. However, despite recent progress, bridging the gap between language descriptions and actual robot actions in the open-world, beyond merely invoking limited predefined motion primitives, remains an open challenge. In this work, we aim to enable robots to interpret and decompose complex language instructions, ultimately synthesizing a sequence of trajectory points to complete diverse navigation tasks given open-set instructions and open-set objects. We observe that multi-modal large language models (MLLMs) exhibit strong cross-modal understanding when processing free-form language instructions, demonstrating robust scene comprehension. More importantly, leveraging their code-generation capability, MLLMs can interact with vision-language perception models to generate compositional 2D bird-eye-view value maps, effectively integrating semantic knowledge from MLLMs with spatial information from maps to reinforce the robot’s spatial understanding. To further validate our approach, we effectively leverage large-scale autonomous vehicle datasets (AVDs) to validate our proposed zero-shot vision-language navigation framework in outdoor navigation tasks, demonstrating its capability to execute a diverse range of free-form natural language navigation instructions while maintaining robustness against object detection errors and linguistic ambiguities. Furthermore, we validate our system on a Husky robot in both indoor and outdoor scenes, demonstrating its real-world robustness and applicability. Supplementary videos are available at https://trailab.github.io/OpenNav-website/.

15:15-15:20, Paper ThCT12.4
Multimodal Point Cloud Registration Method Based on Centerline-Guided Expansion and Contraction: An Optimization Strategy Applied in Bronchial Lumen Map Building

Ren, Le	Soochow University
Yu, Tingyu	Soochow University
Sun, Rongchuan	Soochow University
Li, Peng	Harbin Institute of Technology ShenZhen
Yu, Shumei	Soochow University
Sun, Lining	Soochow University
Keywords: Vision-Based Navigation Abstract: In this work, a multimodal point cloud registration method using CT and video frames is proposed to optimize the modeling of the bronchial cavity environment. Preoperative CT data improve the quality of point clouds acquired from intraoperative video frames. Initially, preoperative CTscans are used to obtain bronchial point clouds and airway centerlines, while intraoperative bronchial point clouds and endoscope trajectories are captured in real-time using SLAM. Given that intraoperative frame-by-frame mapping cannot be directly globally registered, multi-modal point clouds undergo local segmentation. Subsequently, the preoperative bronchial airway centerlines guide iterative scaling and adjustment of the preoperative CT point clouds, achieving precise registration between the CT and the video frame point clouds. Experimental results demonstrate a rapid and accurate enhancement in the quality of the intraoperative bronchial point cloud, providing more precise maps of the cavity environment for surgical robots. The method is validated and evaluated using CT and video frame data collected from ex vivo pig lungs, achieving intraoperative mapping accuracy of 0.5 millimeter, respectively. These results surpass those of methods relying solely on SLAM for intraoperative mapping

15:20-15:25, Paper ThCT12.5
PEACE: Prompt Engineering Automation for CLIPSeg Enhancement for Safe-Landing Zone Segmentation

Bong, Haechan Mark	Polytechnique Montreal
Zhang, Rongge	Polytechnique Montreal
Robillard, Antoine	Polytechnique Montreal
Beltrame, Giovanni	Ecole Polytechnique De Montreal
Keywords: Vision-Based Navigation, Visual Servoing, Aerial Systems: Perception and Autonomy Abstract: Safe landing is essential in robotics applications, from industrial settings to space exploration. As artificial intelligence advances, we have developed PEACE (Prompt Engineering Automation for CLIPSeg Enhancement), a system that automatically generates and refines prompts for identifying landing zones in changing environments. Traditional approaches using fixed prompts for open-vocabulary models struggle with environmental changes and can lead to dangerous outcomes when conditions are not represented in the predefined prompts. PEACE addresses this limitation by dynamically adapting to shifting data distributions. Our key innovation is the dual segmentation of safe and unsafe landing zones, allowing the system to refine the results by removing unsafe areas from potential landing sites. Using only monocular cameras and image segmentation, PEACE can safely guide descent operations from 100 meters to altitudes as low as 20 meters. The testing shows that PEACE significantly outperforms the standard CLIP and CLIPSeg prompting methods, improving the successful identification of safe landing zones from 57% to 92%. We have also demonstrated enhanced performance when replacing CLIPSeg with FastSAM. The complete source code is available as an open-source software.

15:25-15:30, Paper ThCT12.6
Autonomous UAV Control for Maritime Applications Using Deep Reinforcement Learning-Based Image Optimisation

Yang, Yuanqing	King's College London
Spyrakos-Papastavridis, Emmanouil	King's College London
Wang, Mingfeng	Brunel University London
Deng, Yansha	King's College London
Keywords: Surveillance Robotic Systems, Deep Learning Methods, Aerial Systems: Perception and Autonomy Abstract: In this paper, we present an autonomous control system for Unmanned Aerial Vehicles (UAVs), specifically designed to inspect a detected suspicious vessel and capture information-rich images in a maritime environment. The maritime environment is ever-changing and uncertain, making it challenging to perform maritime monitoring tasks efficiently and reliably. The proposed UAV control system consists of multiple modules, including path planning, vessel searching, image processing, and image optimization. A novel image optimization approach utilizing deep reinforcement learning (DRL) is proposed to enhance the quality of the captured images by jointly controlling the movement of the UAV and camera orientation. The effectiveness and efficiency of the proposed system were validated and evaluated by searching the vessel and optimizing the captured images in the self-developed simulation environment in Gazebo.

15:30-15:35, Paper ThCT12.7
Self-Supervised Monocular Visual Drone Model Identification through Improved Occlusion Handling

Bahnam, Stavrow Abdulmasih	Delft University of Technology
De Wagter, Christophe	Delft University of Technology
de Croon, Guido	TU Delft
Keywords: Visual Learning, Dynamics, Deep Learning for Visual Perception Abstract: Ego-motion estimation is vital for drones when fly- ing in GPS-denied environments. Vision-based methods struggle when flight speed increases and close-by objects lead to difficult visual conditions with considerable motion blur and large occlusions. To tackle this, vision is typically complemented by state estimation filters that combine a drone model with inertial measurements. However, these drone models are currently learned in a supervised manner with ground-truth data from external motion capture systems, limiting scalability to different environments and drones. In this work, we propose a self-supervised learning scheme to train a neural-network-based drone model using only onboard monocular video and flight controller data (IMU and motor feedback). We achieve this by first training a self-supervised relative pose estimation model, which then serves as a teacher for the drone model. To allow this to work at high speed close to obstacles, we propose an improved occlusion handling method for training self-supervised pose estimation models. Due to this method, the root mean squared error of resulting odometry estimates is reduced by an average of 15%. Moreover, the student neural drone model can be successfully obtained from the onboard data. It even becomes more accurate at higher speeds compared to its teacher, the self-supervised vision-based model. We demonstrate the value of the neural drone model by integrating it into a traditional filter-based VIO system (ROVIO), resulting in superior odometry accuracy on ggressive 3D racing trajectories near obstacles. Self-supervised learning of ego-motion estimation represents a significant step toward bridging the gap between flying in controlled, expensive lab environments and real-world drone applications. The fusion of vision and drone models will enable higher-speed flight and improve state estimation, on any drone in any environment

15:35-15:40, Paper ThCT12.8
FEG-VON: Frontier Embedding Graph for Efficient Visual Object Navigation

Dai, Yingru	Tsinghua University
Xie, Pengwei	Tsinghua University
Liu, Yikai	Tsinghua University
Chen, Siang	Tsinghua University
Yang, Wenming	Tsinghua University
Wang, Guijin	Tsinghua University
Keywords: Vision-Based Navigation, RGB-D Perception, Semantic Scene Understanding Abstract: Visual object navigation, requiring agents to locate target objects in novel environments through egocentric visual observation, remains a critical challenge in Embodied AI. We propose FEG-VON, a training-free framework that constructs and maintains a Frontier Embedding Graph for efficient Visual Object Navigation. The graph initializes frontier embeddings using Vision Language Models (VLMs), where visual observations are encoded into spatially anchored semantic embeddings through cross-modal alignment with target text descriptors. We then update the graph by aggregating spatio-temporal semantic relations across frontiers, enabling online adaptation to new targets via similarity scoring without remapping. The evaluation results in public benchmarks demonstrate the superior performance of FEG-VON in both single- and multi-object navigation tasks compared with state-of-the-art methods. Crucially, FEG-VON eliminates dependency on task-specific training for exploration and advances the feasibility of zero-shot navigation in open-world environments.


ThCT13	311C
Deep Learning for Visual Perception 11	Regular Session
Chair: Ding, Wenbo	Tsinghua University
Co-Chair: Yao, Shanliang	Yancheng Institute of Technology

15:00-15:05, Paper ThCT13.1
Flow4D: Leveraging 4D Voxel Network for LiDAR Scene Flow Estimation

Kim, Jaeyeul	DGIST
Woo, Jungwan	DGIST
Shin, Ukcheol	CMU(Carnegie Mellon University)
Oh, Jean	Carnegie Mellon University
Im, Sunghoon	DGIST
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, Recognition Abstract: Understanding the motion states of the surrounding environment is critical for safe autonomous driving. These motion states can be accurately derived from scene flow, which captures the three-dimensional motion field of points. Existing LiDAR scene flow methods extract spatial features from each point cloud and then fuse them channel-wise, resulting in the implicit extraction of spatio-temporal features. Furthermore, they utilize 2D Bird's Eye View and process only two frames, missing crucial spatial information along the Z-axis and the broader temporal context, leading to suboptimal performance. To address these limitations, we propose Flow4D, which temporally fuses multiple point clouds after the 3D intra-voxel feature encoder, enabling more explicit extraction of spatio-temporal features through a 4D voxel network. However, while using 4D convolution improves performance, it significantly increases the computational load. For further efficiency, we introduce the Spatio-Temporal Decomposition Block (STDB), which combines 3D and 1D convolutions instead of using heavy 4D convolution. In addition, Flow4D further improves performance by using five frames to take advantage of richer temporal information. As a result, the proposed method achieves a 45.9% higher performance compared to the state-of-the-art while running in real-time, and won 1st place in the 2024 Argoverse 2 Scene Flow Challenge.

15:05-15:10, Paper ThCT13.2
Towards Robust Keypoint Detection and Tracking: A Fusion Approach with Event-Aligned Image Features

Wang, Xiangyuan	Wuhan University
Yu, Huai	Wuhan University
Yu, Lei	Wuhan University
Yang, Wen	Wuhan University
Xia, Gui-Song	Wuhan University
Keywords: Deep Learning for Visual Perception, Visual Tracking Abstract: Robust keypoint detection and tracking are crucial for various robotic tasks. However, conventional cameras struggle under rapid motion and lighting changes, hindering local and edge feature extraction essential for keypoint detection and tracking. Event cameras offer advantages in such scenarios due to their high dynamic range and low latency. Yet, their inherent noise and motion dependence can lead to feature instability. This paper presents a novel image-event fusion approach for robust keypoint detection and tracking under challenging conditions. We leverage the complementary strengths of image and event data by introducing: (i) the Implicit Compensation Module and Temporal Alignment Module for high-frequency feature fusion and keypoint detection; and (ii) a temporal neighborhood matching strategy for robust keypoint tracking within a sliding window. Furthermore, a self-supervised temporal response consistency constraint ensures keypoint continuity and stability. Extensive experiments demonstrate the effectiveness of our method against state-of-the-art approaches under diverse challenging scenarios. In particular, our method exhibits the longest tracking lifetime and strong generalization ability on real-world data. The codes and pre-trained models are available at url{https://github.com/yuyangpoi/FF-KDT}

15:10-15:15, Paper ThCT13.3
Efficient and Hardware-Friendly Online Adaptation for Deep Stereo Depth Estimation on Embedded Robots

Xu, Yuanfan	Tsinghua University
Chen, Shuaiwen	Tsinghua University
Yang, Xinting	Tsinghua University
Xiang, Yunfei	Tsinghua University
Yu, Jincheng	Tsinghua University
Ding, Wenbo	Tsinghua University
Wang, Jian	Tsinghua Univ
Wang, Yu	Tsinghua University
Keywords: Deep Learning for Visual Perception, Embedded Systems for Robotic and Automation, Continual Learning Abstract: Accurate and real-time stereo depth estimation is important for autonomous robots, such as autonomous aerial vehicles (AAVs). Due to the computation constraints of these miniaturized robots, current state-of-the-art algorithms deploy light-weight neural networks while using self-supervised online adaptation to compensate for the lack of generalization. However, the traditional online training approach introduces 2x extra computation overhead, resulting in the failure to meet real-time requirements. Existing efficient training algorithms are primarily designed for train-from-scratch scenarios rather than online training, and involve complicated data quantization methods and non-standard operations, making them highly unfriendly to deployment on robots equipped with embedded GPUs or neural processing units (NPUs). Therefore, this paper aims to improve the online adaptation for deep stereo at the system level from both hardware and software aspects, and proposes a novel online adaptation method, which is robust, computationally efficient, and hardware-friendly. First, we adopt 8-bit quantized training strategy to maximize the performance of typical embedded computing platforms. Considering the streaming input of data during deployment, we design an online calibration method for quantized self-adaptive deep stereo. Then we only update the bias of the convolutional layers and design a plug-in layer with negligible computational cost to enhance the adaptation effect. Meanwhile, this layer is inherently compatible with existing GPUs and NPUs. Our final deep stereo system speeds up the inference and adaptation by 2.11x, which can process 640x360 resolution images at 11.1 FPS on the NVIDIA Jetson Orin NX, and obtains estimation accuracy comparable to current adaptation methods. When deployed on the Horizon Journey-5 Chip, it can further achieve a 10x speedup than Orin NX.

15:15-15:20, Paper ThCT13.4
USVTrack: USV-Based 4D Radar-Camera Tracking Dataset for Autonomous Driving in Inland Waterways

Yao, Shanliang	YCIT
Guan, Runwei	University of Liverpool
Ni, Yi	Xi'an Jiaotong-Liverpool University
Sen, Xu	Yancheng Institute of Technology
Yue, Yong	Xi'an Jiaotong-Liverpool University
Zhu, Xiaohui	Xi'an Jiaotong-Liverpool University
Liu, Ryan Wen	Wuhan University of Technology
Keywords: Data Sets for Robotic Vision, Sensor Fusion, Marine Robotics Abstract: Object tracking in inland waterways plays a crucial role in safe and cost-effective applications, including waterborne transportation, sightseeing tours, environmental monitoring and surface rescue. Our Unmanned Surface Vehicle (USV), equipped with a 4D radar, a monocular camera, a GPS, and an IMU, delivers robust tracking capabilities in complex waterborne environments. By leveraging these sensors, our USV collected comprehensive object tracking data, which we present as USVTrack, the first 4D radar-camera tracking dataset tailored for autonomous driving in new generation waterborne transportation systems. Our USVTrack dataset presents rich scenarios, featuring diverse various waterways, varying times of day, and multiple weather and lighting conditions. Moreover, we present a simple but effective radar-camera matching method, termed RCM, which can be plugged into popular two-stage association trackers. Experimental results utilizing RCM demonstrate the effectiveness of the radar-camera matching in improving object tracking accuracy and reliability for autonomous driving in waterborne environments. The USVTrack dataset is public on https://usvtrack.github.io

15:20-15:25, Paper ThCT13.5
The Impact of VR and 2D Interfaces on Human Feedback in Preference-Based Robot Learning

de Heuvel, Jorge	University of Bonn
Marta, Daniel	KTH Royal Institute of Technology
Holk, Simon	The University of Toyko
Leite, Iolanda	KTH Royal Institute of Technology
Bennewitz, Maren	University of Bonn
Keywords: Virtual Reality and Interfaces, Data Sets for Robot Learning Abstract: Aligning robot navigation with human preferences is essential for ensuring comfortable, and predictable robot movement in shared spaces. While preference-based learning methods, such as reinforcement learning from human feedback (RLHF), enable this alignment, the choice of the preference collection interface may influence the process. Traditional 2D interfaces provide structured views but lack spatial depth, whereas immersive VR offers richer perception, potentially affecting preference articulation. This study systematically examines how the interface modality impacts human preference collection and navigation policy alignment. We introduce a novel dataset of 2,325 human preference queries collected through both VR and 2D interfaces, revealing significant differences in user experience, preference consistency, and policy outcomes. Our findings highlight the trade-offs between immersion, perception, and preference reliability, emphasizing the importance of interface selection in preference-based robot learning. The dataset is available to support future research.

15:25-15:30, Paper ThCT13.6
Adaptive Manipulation Using Behavior Trees

Cloete, Jacques	University of Oxford
Merkt, Wolfgang Xaver	University of Oxford
Havoutis, Ioannis	University of Oxford
Keywords: Behavior-Based Systems, Reactive and Sensor-Based Planning, Robot Safety Abstract: Many manipulation tasks pose a challenge since they depend on non-visual environmental information that can only be determined after sustained physical interaction has already begun. This is particularly relevant for effort-sensitive, dynamics-dependent tasks such as tightening a valve. To perform these tasks safely and reliably, robots must be able to quickly adapt in response to unexpected changes during task execution, and should also learn from past experience to better inform future decisions. Humans can intuitively respond and adapt their manipulation strategy to suit such problems, but representing and implementing such behaviors for robots remains a challenge. In this work we show how this can be achieved within the framework of behavior trees. We present the adaptive behavior tree, a scalable and generalizable behavior tree design that enables a robot to quickly adapt to and learn from both visual and non-visual observations during task execution, preempting task failure or switching to a different manipulation strategy. The adaptive behavior tree selects the manipulation strategy that is predicted to optimize task performance, and learns from past experience to improve these predictions for future attempts. We test our approach on a variety of tasks commonly found in industry; the adaptive behavior tree demonstrates safety, robustness (100% success rate) and efficiency in task completion (up to 36% task speedup from the baseline).

15:30-15:35, Paper ThCT13.7
Human-In-The-Loop Learning for Adaptive Robot Manipulation Using Large Language Models and Behavior Trees

Zhou, Haotian	Wuhan University of Science and Technology
Lin, Yunhan	Wuhan University of Science and Technology
Yan, Longwu	Wuhan University of Science and Technology
Min, Huasong	Robotics Institute of Beihang University of China
Keywords: Behavior-Based Systems, AI-Enabled Robotics, Autonomous Agents Abstract: Large Language Models (LLMs) are now transforming the way robots learn to work in unpredictable environments, such as homes or small enterprises. A growing number of approaches are combining LLMs with Behavior Trees (BTs). Not only do user commands need to be interpreted into BTs that contain the task's goal, but external disturbances also need to be handled during the process when BT planners dynamically expand BTs based on action databases. However, in these approaches, the action database is manually pre-built and requires the capability for incremental learning and expansion. To address this issue, we propose a human-in-the-loop learning mechanism. First, we design a context for the LLM and then use it to generate action knowledge through in-context learning. In addition, we introduce human-in-the-loop. User feedback is utilized to guide the LLM to correct and refine the action knowledge, ensuring its accuracy and safety. Finally, the generated action knowledge can be directly used for adaptive manipulation without the need for knowledge transfer effort, enabling the robot to complete tasks and handle external disturbances. Experiments across various tasks are conducted and the experimental results validate our method.


ThCT14	311D
Medical Robots and Systems 7	Regular Session
Co-Chair: Shi, Chaoyang	Tianjin University

15:00-15:05, Paper ThCT14.1
Development of a Miniature 5-DOF Modularized Flexible Instrument with Distal Rotation Capability for Dual-Armed Upper Gastrointestinal Endoscopic Robots (I)

Song, Dezhi	Tianjin University
Yu, Xiangyang	, Tianjin Hospital of ITCWM/Tianjin Nankai Hospital
Leng, Bohan	Tianjin University
Zhang, Bo	Waseda University
Shi, Chaoyang	Tianjin University
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles, Soft Robot Applications Abstract: The clinical endoluminal flexible instruments exhibit limited degrees of freedom (DOFs) and operational dexterity, making reorientation or repositioning challenging during the complex flexible endoscopic procedure, particularly in endoscopic submucosal dissection (ESD). This work presents a novel 5-DOF miniature flexible instrument for dual-armed upper gastrointestinal endoscopic robots with only 2.6mm in diameter. It employs a continuum bending section featuring interlocking discrete joints and superelastic NiTi driving rods to achieve high dexterity, accurate positioning with inconspicuous motion hysteresis, and sufficient loading/clamping capacities. The design enables 360° unrestricted distal independent rotation of the forceps, unlike the entire rotation of typical flexible instruments. The central forceps movement slightly impacts the overall positioning accuracy of the instrument with a crosstalk error of less than 1mm. Kinematic parameter calibration improves the instrument’s motion accuracy, achieving an average distal positioning error of 1.56mm within ±100° in the 2D plane and 2.29mm in 3D space. The distal loading stiffness is 0.29N/mm and the clamping force exceeds 0.8N, providing adequate tissue interaction forces during ESD procedures. The modularized instrument prompts the seamless integration of diverse surgical tools, ultimately establishing a dual-armed endoscopic robot. Ex-vivo experiments have been performed to verify its effectiveness in ESD procedures.

15:05-15:10, Paper ThCT14.2
Sensor-Free Strategy for Estimating Guidewire/Catheter Shape and Contact Force in Endovascular Interventions

Li, Naner	Huazhong University of Science and Technology
Wang, Yiwei	Huazhong University of Science and Technology
Zhao, Huan	Huazhong University of Science and Technology
Ding, Han	Huazhong University of Science and Technology
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles Abstract: Accurate assessment of guidewire shape and contact forces is critical for autonomous robotic endovascular procedures. However, existing sensor-based approaches often require modifications to standard guidewires or the use of custom made alternatives, which can hinder integration into conventional surgical workflows and increase costs. Moreover, the sensor-based method can only obtain partial force information. This letter aimed to develop a novel sensor-free two-step computational method for estimating overall guidewire shape and forces using only routinely obtainable information. The vascular space is discretized into multiple mesh points along the centerline to form a graph. Based on an energy equation, the lowest energy path between the start point (the insertion position) and the endpoint (the guidewire tip position) is searched as the initial shape of the guidewire. This initial shape, along with the known insertion length, is then input into a finite element model to compute the final guidewire configuration and contact forces. The method was validated through 3 rounds of testing in 3 phantom models at 4 different insertion lengths. In 11 successful experimental scenarios, the estimated guidewire shapes closely matched the actual shapes, with an average root mean square error of 0.50 pm 0.12 mm. The contact force estimation achieved an average accuracy of 91.9 pm 2.9%, with an average angular deviation of 1.94 pm 1.03 ^circ from measured values.

15:10-15:15, Paper ThCT14.3
Towards Design and Development of a Concentric Tube Steerable Drilling Robot for Creating S-Shape Tunnels for Pelvic Fixation Procedures

Kulkarni, Yash	The University of Texas at Austin
Sharma, Susheela	Vanderbilt University
Go, Sarah	University of Texas at Austin
Amadio, Jordan P.	University of Texas Dell Medical School
Khadem, Mohsen	University of Edinburgh
Alambeigi, Farshid	University of Texas at Austin
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles Abstract: Current pelvic fixation techniques rely on rigid drilling tools, which inherently constrain the placement of rigid medical screws in the complex anatomy of pelvis. These constraints prevent medical screws from following anatomically optimal pathways and force clinicians to fixate screws in linear trajectories. This suboptimal approach, combined with the unnatural placement of the excessively long screws, lead to complications such as screw misplacement, extended surgery times, and increased radiation exposure due to repeated X-ray images taken ensure to safety of procedure. To address these challenges, in this paper, we present the design and development of a unique 4-degree-of-freedom (DoF) pelvic concentric tube steerable drilling robot (pelvic CT-SDR). The pelvic CT-SDR is capable of creating long S-shaped drilling trajectories that follow the natural curvatures of the pelvic anatomy. The performance of the pelvic CT-SDR was thoroughly evaluated through several S-shape drilling experiments in simulated bone phantoms.

15:15-15:20, Paper ThCT14.4
Towards Fully Autonomous Robotic Ultrasound-Guided Biopsy for Superficial Organs

Wang, Chenwei	The Chinese University of Hong Kong, Shenzhen
Qian, Cheng	The Chinese University of Hong Kong, Shenzhen
Gao, Qian	The Chinese University of Hong Kong, Shenzhen
Ji, Xiaoqiang	The Chinese University of Hong Kong, Shenzhen
Sun, Zhenglong	Chinese University of Hong Kong, Shenzhen
Keywords: Medical Robots and Systems Abstract: Ultrasound-guided therapeutic procedures rely heavily on operator skill, leading to variability and high training costs. The shortage of trained ultrasonographers further exacerbates the issue, increasing workloads and associated health risks. Robotic technology has the potential to effectively tackle these issues, yet there has been limited research on fully autonomous robotic ultrasound-guided biopsy systems based on the entire workflow. To address this challenge, this paper presents an autonomous robotic operative framework for superficial organ biopsy. The system integrates real-time slice-to-volume registration and navigation, along with a needle insertion mechanism, following operational protocols to autonomously perform the entire biopsy procedure. The feasibility, robustness, and generalizability of the system are validated through experimental studies.

15:20-15:25, Paper ThCT14.5
Autonomous Surface Selection for Manipulator-Based UV Disinfection in Hospitals Using Foundation Models

Oh, Xueyan	Singapore University of Technology and Design
Her, Jonathan	Singapore University of Technology and Design
Ong, Zhi Xiang	Singapore University of Technology and Design
Koh, Brandon	Centre for Healthcare Assistive & Robotics Technology (CHART)
Tan, Yun Hann	National Center for Infectious Diseases, Singapore
Tan, U-Xuan	Singapore University of Techonlogy and Design
Keywords: Medical Robots and Systems Abstract: Ultraviolet (UV) germicidal radiation is an established non-contact method for surface disinfection in medical environments. Traditional approaches require substantial human intervention to define disinfection areas, complicating automation, while deep learning-based methods often need extensive fine-tuning and large datasets, which can be impractical for large-scale deployment. Additionally, these methods often do not address scene understanding for partial surface disinfection, which is crucial for avoiding unintended UV exposure. We propose a solution that leverages foundation models to simplify surface selection for manipulator-based UV disinfection, reducing human involvement and removing the need for model training. Additionally, we propose a VLM-assisted segmentation refinement to detect and exclude thin and small non-target objects, showing that this reduces mis-segmentation errors. Our approach achieves over 92% success rate in correctly segmenting target and non-target surfaces, and real-world experiments with a manipulator and simulated UV light demonstrate its practical potential for real-world applications.

15:25-15:30, Paper ThCT14.6
A Novel Integrated Mechanism with Dual RCM-Constraints Toward Robotic Transperineal Prostate Biopsy (I)

Luo, Xiao	The Chinese University of Hong Kong
Lei, Man Cheong	The Chinese University of Hong Kong
Xian, Yitian	The Chinese University of Hong Kong
Hu, Yingbai	Technische Universität München
Zou, Limin	The Chinese University of Hong Kong
Xie, Ke	The Chinese University of Hong Kong
Chiu, Peter Ka Fung	The Chinese University of Hong Kong
Li, Zheng	The Chinese University of Hong Kong
Keywords: Mechanism Design, Medical Robots and Systems, Task and Motion Planning Abstract: Transperineal prostate biopsy (TPPB), a common procedure for clinically detecting prostate cancer, requires constant observation of the needle in the sagittal image plane of the transrectal ultrasound (TRUS) probe, shared needle puncture site and avoiding the obstruction of the pubic bone. In this article, we develop a novel TPPB robot with 9 DoFs that can place the needle guide and the TRUS probe to the desired position with dual integrated remote center of motion (RCM) constraints simultaneously before manual needle insertion. Based on a unique compact serial-parallel hybrid mechanism (dimensions of the core mechanism: 33 cm × 20 cm × 33 cm), the needle can access interest areas via two needle-RCM sites, and the probe moves with the probe-RCM constraint while keeping its sagittal plane coplanar with the needle. For the new mechanism, kinematic analysis is derived and full prostate coverage ability with dual RCM constraints is demonstrated. To demonstrate the RCM locations’ programmability and system flexibility, a three-stage path-planning method considering avoiding obstacles is proposed. Experiments have shown that the trajectory tracking error with the probe-RCM is within 0.58◦ and 0.65 mm, and the needle-RCM accuracy remains within 0.69 mm. A phantom study validates the feasibility of the proposed system and its path-planning method.

15:30-15:35, Paper ThCT14.7
Vibration-Based Energy Metric for Restoring Needle Alignment in Autonomous Robotic Ultrasound

Chen, Zhongyu	Multi-Scale Medical Robotics Center
Li, Chenyang	National Center for Tumor Diseases
Li, Xuesong	Technical University of Munich
Huang, Dianye	Technical University of Munich
Jiang, Zhongliang	Technical University of Munich
Speidel, Stefanie	National Center for Tumor Diseases
Chu, Xiangyu	The Chinese University of Hong Kong
Au, K. W. Samuel	The Chinese University of Hong Kong
Keywords: Medical Robots and Systems Abstract: Precise needle alignment is essential for percutaneous needle insertion in robotic ultrasound-guided procedures. However, inherent challenges such as speckle noise, needle-like artifacts, and low image resolution complicate robust needle detection, which is essential for alignment in ultrasound images. These issues become particularly problematic when visibility is reduced or lost, diminishing the effectiveness of visual-based needle alignment methods. In this paper, we propose a method to restore effectively when the ultrasound imaging plane and the needle insertion plane are misaligned. Unlike many existing approaches that rely heavily on needle visibility in ultrasound images, our method uses a more robust feature by periodically vibrating the needle using a mechanical system. Specifically, we propose a new vibration-based energy metric that remains effective even when the needle is fully out of plane. Using this metric, we develop an elegant control strategy to reposition the ultrasound probe in response to misalignments between the imaging plane and the needle insertion plane in both translation and rotation. Experiments conducted on ex-vivo porcine tissue samples using a dual-arm robotic ultrasound-guided needle insertion system demonstrate the effectiveness of the proposed approach. The experimental results show the translational error of 0.41±0.27 mm and the rotational error of 0.51±0.19 degrees.

15:35-15:40, Paper ThCT14.8
Inchworm-Like Biomimetic Magnetic-Driven Robotic Shell for Capsule Endoscope in a Tubular Environment (I)

Yu, Xinkai	Harbin Institute of Technology (Shenzhen)
Wang, Jiaole	Harbin Institute of Technology, Shenzhen
Su, Jingran	Department of Gastroenterology, Qilu Hospital of Shandong Univer
Song, Shuang	Harbin Institute of Technology (Shenzhen)
Keywords: Medical Robots and Systems Abstract: As a vital tool for diagnosing intestinal diseases, wireless capsule endoscopy traditionally relies on passive movement through intestinal peristalsis. To address the limitations of passive locomotion, we introduce a novel inchworm-like biomimetic magnetic-driven robotic shell that enhances active locomotion capabilities within complex tubular environments. This innovative design employs a magnetic torsion spring for dynamic extension and contraction movements, which can be controlled via an external magnetic field. In addition, the shell is equipped with flexible bristles, facilitating a differential friction effect that significantly augments its inchworm-like crawling motion. Force and torque analysis have been conducted in detail to optimize the design and functionality. A prototype with a diameter of 16 mm and a length of 31.3 mm was developed and integrated with a commercial capsule endoscope for rigorous testing. Experimental evaluations across various setups, including phantoms, in vitro porcine intestines, and in vivo intestinal experiment. The results demonstrated the shell is capable to effectively pass through the tubular environments. The capsule robot achieved an average speed of 2.63 mm/s in the in vitro intestinal experiment and 2.7 mm/s in the in vivo intestinal experiment.


ThCT15	206
Telerobotics and Teleoperation 3	Regular Session
Chair: Aoyama, Tadayoshi	Nagoya University
Co-Chair: Li, Gaofeng	Zhejiang University

15:00-15:05, Paper ThCT15.1
A Two-Stage Dynamic Parameters Identification Approach Based on Kalman Filter for Haptic Rendering in Telerobotics

Wang, Ruize	Zhejiang University
Cheng, Peng	Zhejiang University
Ye, Qi	Zhejiang University
Chen, Jiming	Zhejiang University
Li, Gaofeng	Zhejiang University
Keywords: Telerobotics and Teleoperation, Physical Human-Robot Interaction, Calibration and Identification Abstract: In teleoperated physical Human-Robot Interaction (pHRI), accurate contact force feedback is essential for operators to experience intuitive and responsive interaction. However, the presence of end-effectors on the manipulator often interferes with the force feedback, misrepresenting the real contact forces. To render a precise force to operators, it is critical to eliminate the non-contact forces arising from the end-effector tool’s inertia. To achieve this, the fast and accurate dynamic parameters identification for the end-effectors is essential. Currently, existing dynamic identification algorithms process all collected data in one stage, in which the sensor noises for static and dynamic parameters are coupled, resulting in poor performance. In this letter, we propose a two-stage method for end-effector's inertial parameters identification. Our proposed method breaks the identification process into two steps: the static and the dynamic processing stages, which can make full use of data features and significantly improve the accuracy of mass and Center-of-Mass (COM) identification. Additionally, we introduce the Kalman Filter (KF) into the linear fitting to improve the identification speed. Comparative and ablation experiments are designed to validate the effectiveness of the proposed method. The results demonstrate that our method outperforms the RTLS method in reducing identification errors by 96% for mass and 81.2%–87.6% for COM. As for the haptic rendering, the errors are also reduced by 36.3% in direction of interaction and 99.5% in direction of gravity after the elimination of the end-effector’s inertia.

15:05-15:10, Paper ThCT15.2
Feasibility Checking and Constraint Refinement for Shared Control in Assistive Robotics

Bustamante, Samuel	German Aeroespace Center (DLR), Robotics and Mechatronics Center
Rodriguez Brena, Ismael Valentin	German Aerospace Center (DLR)
Quere, Gabriel	DLR
Lehner, Peter	German Aerospace Center (DLR)
Iskandar, Maged	German Aerospace Center - DLR
Leidner, Daniel	German Aerospace Center (DLR)
Dömel, Andreas	German Aerospace Center (DLR)
Albu-Schäffer, Alin	DLR - German Aerospace Center
Vogel, Jörn	German Aerospace Center (DLR)
Stulp, Freek	DLR - Deutsches Zentrum Für Luft Und Raumfahrt E.V
Keywords: Telerobotics and Teleoperation, Physically Assistive Devices Abstract: Shared control enables users with motor impairments to control high-dimensional assistive robots with low-dimensional user interfaces. The challenge is to simultaneously 1)~provide support for completing daily living tasks 2)~enable sufficient freedom of movement to foster user empowerment 3)~ensure that shared control support precludes the robot from running into kinematic limitations such as obstacles, unreachable areas, or loss of manipulability due to joint limits. In this letter, we propose a framework that performs feasibility checks before executing a shared control task. We activate shared control only if a task is deemed feasible, and refine the task regions by excluding paths that are infeasible, e.g. due to obstacles or kinematic limitations. This reduces task failures, whilst still ensuring freedom of movement. We evaluate our framework on a set of daily living tasks with our wheelchair-based mobile manipulator EDAN.

15:10-15:15, Paper ThCT15.3
Linearized Virtual Energy Tank for Passivity-Based Bilateral Teleoperation Using Linear MPC

Piccinelli, Nicola	University of Verona
Muradore, Riccardo	University of Verona
Keywords: Telerobotics and Teleoperation, Optimization and Optimal Control, Physical Human-Robot Interaction, Motion Control of Manipulators Abstract: Bilateral teleoperation systems are often used in safety-critical scenarios where human operators may interact with the environment remotely, as in robotic-assisted surgery or nuclear plant maintenance. Teleoperation's stability and transparency are the two most important properties to be satisfied, but they cannot be optimized independently since they are in contrast. This paper presents a passive linear MPC control scheme to implement bilateral teleoperation that optimizes the trade-off between stability and transparency (a.k.a. performance). First, we introduce a linear virtual energy tank with a novel energy-sharing policy, allowing us to define a passive linear MPC. Second, we provide conditions to guarantee the stability of the non-linear closed-loop system. We validate the proposed approach in a teleoperation scheme using two 7-DOF manipulators while performing an assembly task. This novel passivity-based bilateral teleoperation using linear MPC and linearized energy tank reduces the computational effort of existing passive non-linear MPC controllers.

15:15-15:20, Paper ThCT15.4
RNN-Based Visual Guidance for Enhanced Sense of Agency in Teleoperation with Time-Varying Delays

Morita, Tomoya	Nagoya University
Armleder, Simon	Technische Universität München
Zhu, Yaonan	University of Tokyo
Iino, Hiroto	Waseda University
Aoyama, Tadayoshi	Nagoya University
Cheng, Gordon	Technical University of Munich
Hasegawa, Yasuhisa	Nagoya University
Keywords: Telerobotics and Teleoperation, Embodied Cognitive Science, Learning from Demonstration Abstract: Intuitive teleoperation enables operators to embody remote robots, providing the sensation that the robot is part of their own body during control. The sense of agency (SoA), i.e., the feeling of controlling the robot, contributes to enhanced motivation and embodiment during teleoperation. However, the SoA can be diminished by time-varying communication delays associated with teleoperation. We propose a visual guidance system to assist operations while maintaining a high SoA when teleoperating robots with time-varying delays, thereby improving positioning accuracy. In the proposed system, a recurrent neural network (RNN) model, trained on the pouring tasks of skilled operators, predicts the input position 500 ms ahead of the input from the novice operator and visually presents it in real-time as the end-effector target position. Experiments with time-varying delays confirmed that the proposed method provides a visual representation of the target position interpolated in time and space from the real-time input of the operator, guiding the operator to align with the trajectory of the skilled operator. The proposed method significantly improves task performance even under time-varying delays while maintaining a high SoA compared to other conditions. Applying the prediction system developed in this study to human–robot collaborative control may enable interventions that maintain the SoA.

15:20-15:25, Paper ThCT15.5
Unified Contact Model and Hybrid Motion/force Control for Teleoperated Manipulation in Unknown Environments (I)

Huang, Fanghao	Zhejiang University
Yang, Xiao	Zhejiang University
Mei, Deqing	Zhejiang University
Chen, Zheng	Zhejiang University
Keywords: Telerobotics and Teleoperation, Force Control, Contact Modeling Abstract: Teleoperated manipulation under the human intelligence is an effective solution to confront complicated tasks in unknown environments. However, the uncertain contact in environments is the main challenge to achieve good teleoperation, and some inevitable issues such as nonlinearities, various uncertainties, constraints, and communication delays in local and remote robots should also be taken into account. In this article, a unified contact model is proposed to be the targeted environment interaction for remote robot, which can cover various conditions such as free motion and rigid contact by setting different model parameters. Subsequently, a hybrid motion/force controller is developed to cope with nonlinearities and various uncertainties by adaptive robust technique, thus guaranteeing system stability and good transient convergence to the unified contact model. Particularly, to handle misteleoperation or sudden change of conditions described in unified contact model, a model predictive control-based contact optimization method is developed to be the outer loop of hybrid motion/force controller, which plans the desired motion and force trajectories that meet the state and targeted interaction constraints. By the estimation and transmission of environment parameter, the environment dynamics is reconstructed in the local side, which provides the effective force feedback of remote contact environment for human operator. Since the signal transmitted is replaced by the estimated environment parameter, the power-cycle in communication channel is eliminated. This design can avoid the passivity stability problem caused by communication delays, so the stability and good transparency under delays can be guaranteed by the separate design of local and remote controllers. The comparative experiments are implemented and verify the effectiveness of proposed framework for teleoperated manipulation.

15:25-15:30, Paper ThCT15.6
Whole-Body Teleoperation for Mobile Manipulation at Zero Added Cost

Honerkamp, Daniel	Albert Ludwigs Universität Freiburg
Mahesheka, Harsh	Indian Institute of Technology, Varanasi
von Hartz, Jan Ole	University of Freiburg
Welschehold, Tim	Albert-Ludwigs-Universität Freiburg
Valada, Abhinav	University of Freiburg
Keywords: Telerobotics and Teleoperation, Mobile Manipulation, Human-Robot Collaboration Abstract: Demonstration data plays a key role in learning complex behaviors and training robotic foundation models. While effective control interfaces exist for static manipulators, data collection remains cumbersome and time intensive for mobile manipulators due to their large number of degrees of freedom. While specialized hardware, avatars, or motion tracking can enable whole-body control, these approaches are either expensive, robot-specific, or suffer from the embodiment mismatch between robot and human demonstrator. In this work, we present MoMa-Teleop, a novel teleoperation method that infers end-effector motions from existing interfaces and delegates the base motions to a previously developed reinforcement learning agent, leaving the operator to focus fully on the task-relevant end-effector motions. This enables whole-body teleoperation of mobile manipulators with no additional hardware or setup costs via standard interfaces such as joysticks or hand guidance. Moreover, the operator is not bound to a tracked workspace and can move freely with the robot over spatially extended tasks. We demonstrate that our approach results in a significant reduction in task completion time across a variety of robots and tasks. As the generated data covers diverse whole-body motions without embodiment mismatch, it enables efficient imitation learning. By focusing on task-specific end-effector motions, our approach learns skills that transfer to unseen settings, such as new obstacles or changed object positions, from as little as five demonstrations. We make code and videos available at https://moma-teleop.cs.uni-freiburg.de.

15:30-15:35, Paper ThCT15.7
Design, Integration, and Field Testing of a Digital Twin-Based Teleoperated Rock Scaling Robot (I)

Le, Dinh Tung	University of Technology Sydney
Sutjipto, Sheila	University of Technology, Sydney
Nguyen, Dac Dang Khoa	University of Technology Sydney
Paul, Gavin	University of Technology Sydney
Keywords: Virtual Reality and Interfaces, Robotics in Hazardous Fields, Telerobotics and Teleoperation Abstract: This paper presents the design, integration, and field testing of a digital twin-based teleoperated rock scaling robot aimed at improving safety in mining operations. Traditional rock scaling, which involves the removal loose rocks to prevent rockfall, poses significant risks to mine site workers. The proposed solution is a teleoperated custom mobile manipulator capable of rope-based abseiling locomotion, equipped with an air chipper end-effector. Teleoperation is facilitated by live digital twins of the robot and environment, with a virtual reality (VR) interface that allows operators to perform rock scaling tasks within an immersive virtual reconstruction of the remote scene. The robot's hardware design and sensing capabilities are detailed, along with the system's teleoperation architecture. Key components include the integration of an optimised, hardware accelerated, image-based point cloud streaming implementation; a markerless depth-camera extrinsic calibration process suitable for field settings; and the system’s teleoperation interfaces featuring a cyber-physical VR interface with affordance feedback. Field tests at a sandstone quarry and an open-pit mine demonstrate significant improvements in operator safety, and highlight the system’s ability to withstand harsh mining environments while performing teleoperated rock scaling at its current scaled-down size and power. We collected and analysed user data from rope access technicians with no prior experience in robot teleoperation or VR. The results suggest the system's intuitiveness with learning effects over time. Lessons from these site trials, including hardware and software limitations, are discussed, providing directions for further robot design improvements and enhancements to the digital twin teleoperation architecture.

15:35-15:40, Paper ThCT15.8
Improved Availability of Mobile Network Teleoperation by Employing a Video Conferencing Application with Audio-Embedded Commands

Hatano, Sho	Nagoya University
Zhu, Yaonan	University of Tokyo
Aoyama, Tadayoshi	Nagoya University
Hasegawa, Yasuhisa	Nagoya University
Keywords: Virtual Reality and Interfaces, Telerobotics and Teleoperation Abstract: This study proposed an innovative teleoperation system utilizing video conferencing applications to enhance availability and operability in mobile networks. The proposed system accomplished this by transmitting video through screen sharing and embedding control signals into audio signals by converting them into sine waves whose amplitudes correspond to control values. Compared with conventional VPN-based approaches, our approach reduced communication delay over mobile data networks by more than 1 s and, in evaluation experiments using Toyota’s Human Support Robot, demonstrated improved operational accuracy. The proposed method expanded the operational range beyond Wi-Fi or cable limitations, enabling outdoor remote control via mobile data communication. This advancement opens new possibilities in hazardous environment operations, elderly care, and data collection for AI model training. Our approach improved the flexibility of teleoperation systems, paving the way for widespread adoption across various industries and research domains. Future improvements in video conferencing applications and next-generation mobile networks could further enhance system performance.


ThCT16	207
Task and Motion Planning 3	Regular Session
Chair: Ravankar, Ankit A.	Tohoku University
Co-Chair: Street, Charlie	University of Birmingham

15:00-15:05, Paper ThCT16.1
Signal Temporal Logic Compliant Co-Design of Planning and Control

Juvvi, Manas Sashank	Indian Institute of Science, Bengaluru
Kurne, Tushar	Indian Institute of Science
J, Vaishnavi	Indian Institute of Science
Kolathaya, Shishir	Indian Institute of Science
Jagtap, Pushpak	Indian Institute of Science
Keywords: Task and Motion Planning, Formal Methods in Robotics and Automation, Motion and Path Planning Abstract: This work presents a novel co-design strategy that integrates trajectory planning and control to handle STL-based tasks in autonomous robots. The method consists of two phases: (i) learning spatio-temporal motion primitives to encapsulate the inherent robot-specific constraints and (ii) constructing an STL-compliant motion plan from these primitives. Initially, we employ reinforcement learning to construct a library of control policies that perform trajectories described by the motion primitives. Then, we map motion primitives to spatio-temporal characteristics. Subsequently, we present a sampling-based STL-compliant motion planning strategy to meet the STL specification. The proposed model-free approach, which generates feasible STL-compliant motion plans across various environments, is validated on differential-drive and quadruped robots across various STL specifications. Demonstration videos are available at https://youtu.be/xo2cXRYdDPQ and detailed version at https://doi.org/10.48550/arXiv.2507.13225.

15:05-15:10, Paper ThCT16.2
Enhancing Object Search in Indoor Spaces Via Personalized Object-Factored Ontologies

Chikhalikar, Akash	Tohoku University
Ravankar, Ankit A.	Tohoku University
Salazar Luces, Jose Victorio	Tohoku University
Hirata, Yasuhisa	Tohoku University
Keywords: Task Planning, Task and Motion Planning, Autonomous Agents Abstract: Personalization is critical for the advancement of service robots. Robots need to develop tailored understandings of the environments they are put in. Moreover, they need to be aware of changes in the environment to facilitate long-term deployment. Long-term understanding as well as personalization is necessary to execute complex tasks like Prepare Dinner Table or Tidy My Room. A precursor to such tasks is that of ‘Object Search’. Consequently, this paper focuses on locating and searching multiple objects in indoor environments. In this paper, we propose two crucial novelties. Firstly, we propose a novel framework that can enable robots to deduce Personalized Ontologies of indoor environments. Our framework consists of a personalization schema that enables the robot to tune its understanding of ontologies. Secondly, we propose an Adaptive Inferencing strategy. We integrate ‘Dynamic Belief Updates’ into our approach which improves performance in multi-object search tasks. The cumulative effect of Personalization and Adaptive Inferencing is an improved capability in long-term Object Search. This framework is implemented on top of a multi-layered semantic map. We conduct experiments in real environments and compare our results against various state-of-the-art (SOTA) methods to demonstrate the effectiveness of our approach. Additionally, we show that Personalization can act as a catalyst to enhance the performance of SOTAs. Video Link: https://bit.ly/3WHk9i9

15:10-15:15, Paper ThCT16.3
LERa: Replanning with Visual Feedback in Instruction Following

Pchelintsev, Svyatoslav	Moscow Institute of Physics and Technology (National Research Un
Patratskiy, Maxim	Moscow Institute of Physics and Technology (National Research Un
Onishenko, Anatoly	Moscow Institute of Physics and Technology
Korchemnyi, Alexandr	Sber
Medvedev, Aleksandr	Sberbank of Russia, Robotics Center
Vinogradova, Uliana	Sberbank of Russia, Robotics Center
Galuzinsky, Ilya	Sberbank of Russia, Robotics Center
Postnikov, Aleksey	Skoltech, Sber
Kovalev, Alexey	AIRI
Panov, Aleksandr	AIRI
Keywords: Task Planning, Failure Detection and Recovery Abstract: Large Language Models are increasingly used in robotics for task planning, but their reliance on textual inputs limits their adaptability to real-world changes and failures. To address these challenges, we propose LERa - Look, Explain, Replan - a Visual Language Model-based replanning approach that utilizes visual feedback. Unlike existing methods, LERa requires only a raw RGB image, a natural language instruction, an initial task plan, and failure detection - without additional information such as object detection or predefined conditions that may be unavailable in a given scenario. The replanning process consists of three steps: (i) Look - where LERa generates a scene description and identifies errors; (ii) Explain - where it provides corrective guidance; and (iii) Replan - where it modifies the plan accordingly. LERa is adaptable to various agent architectures and can handle errors from both dynamic scene changes and task execution failures. We evaluate LERa on the newly introduced ALFRED-ChaOS and VirtualHome-ChaOS datasets, achieving a 40% improvement over baselines in dynamic environments. In tabletop manipulation tasks with a predefined probability of task failure within the PyBullet simulator, LERa improves success rates by up to 67%. Further experiments, including real-world trials with a tabletop manipulator robot, confirm LERa’s effectiveness in replanning. We demonstrate that LERa is a robust and adaptable solution for error-aware task execution in robotics. The project page is available at https://lera-robo.github.io.

15:15-15:20, Paper ThCT16.4
Planning under Uncertainty from Behaviour Trees

Street, Charlie	University of Birmingham
Grubb, Oliver Ian	University of Birmingham
Mansouri, Masoumeh	Birmingham University
Keywords: Task Planning, Probability and Statistical Methods, Autonomous Agents Abstract: Behaviour trees (BTs) are popular within robotics due to their reactivity, reusability, and modularity. BTs are often designed by hand using expert domain knowledge. However, robot environments contain sources of uncertainty which affect robot behaviour. It is challenging for human designers to reason over the effects of uncertainty up to the task horizon, limiting robot performance. For example, the chance of an unexpected blockage late along a robot’s route should encourage the robot to take an alternate path. Therefore, in this paper we refine the task-level behaviour encoded in a BT through planning under uncertainty. The refinement process modifies when action nodes are executed by reasoning over the effects of uncertainty, improving task performance. We first extract a state space from the BT and learn a set of Bayesian networks (BNs) which model the stochastic dynamics of robot actions. We then use the extracted state space and BNs to construct and solve a Markov decision process which captures robot execution. This produces a policy which describes the refined behaviour. We empirically demonstrate how our approach reduces the completion time for robot navigation and search tasks.

15:20-15:25, Paper ThCT16.5
Towards Zero-Knowledge Task Planning Via a Language-Based Approach

Merz Hoffmeister, Liam	Yale University
Scassellati, Brian	Yale
Rakita, Daniel	Yale University
Keywords: Task Planning Abstract: In this work, we introduce and formalize the Zero-Knowledge Task Planning (ZKTP) problem, i.e., formulating a sequence of actions to achieve some goal without task-specific knowledge. Additionally, we present a first investigation and approach for ZKTP that leverages a large language model (LLM) to decompose natural language instructions into subtasks and generate behavior trees (BTs) for execution. If errors arise during task execution, the approach also uses an LLM to adjust the BTs on-the-fly in a refinement loop. Experimental validation in the AI2-THOR simulator demonstrates our approach's effectiveness in improving overall task performance compared to alternative approaches that leverage task-specific knowledge. Our work demonstrates the potential of LLMs to effectively address several aspects of the ZKTP problem, providing a robust framework for automated behavior generation with no task-specific setup.

15:25-15:30, Paper ThCT16.6
LTLCodeGen: Code Generation of Syntactically Correct Temporal Logic for Robot Task Planning

Rabiei, Behrad	University of California San Diego
Ananthakrishnan Rameshkumar, Mahesh Kumar	University of California San Diego
Dai, Zhirui	UC San Diego
Pilla, Surya Lakshmi Subba Rao	University of California San Diego
Dong, Qiyue	University of California, San Diego
Atanasov, Nikolay	University of California, San Diego
Keywords: Task and Motion Planning, Formal Methods in Robotics and Automation, Semantic Scene Understanding Abstract: This paper focuses on planning robot navigation tasks from natural language specifications. We develop a modular approach, where a large language model (LLM) translates the natural language instructions into a linear temporal logic (LTL) formula with propositions defined by object classes in a semantic occupancy map. The LTL formula and the semantic occupancy map are provided to a motion planning algorithm to generate a collision-free robot path that satisfies the natural language instructions. Our main contribution is LTLCodeGen, a method to translate natural language to syntactically correct LTL using code generation. We demonstrate the complete task planning method in real-world experiments involving human speech to provide navigation instructions to a mobile robot. We also thoroughly evaluate our approach in simulated and real-world experiments in comparison to end-to-end LLM task planning and state-of-the-art LLM-to-LTL translation methods.

15:30-15:35, Paper ThCT16.7
Code-As-Symbolic-Planner: Foundation Model-Based Robot Planning Via Symbolic Code Generation

Chen, Yongchao	Harvard University
Hao, Yilun	Stanford University
Zhang, Yang	IBM
Fan, Chuchu	Massachusetts Institute of Technology
Keywords: Task and Motion Planning, Task Planning, Machine Learning for Robot Control Abstract: Recent works have shown great potential of Large Language Models (LLMs) in robot task and motion planning (TAMP). Current LLM approaches generate text- or code-based reasoning chains with sub-goals and action plans. However, they do not fully leverage LLMs' symbolic computing and code generation capabilities. Many robot TAMP tasks involve complex optimization under multiple constraints, where pure textual reasoning is insufficient. While augmenting LLMs with predefined solvers and planners improves performance, it lacks generalization across tasks. Given LLMs' growing coding proficiency, we enhance their TAMP capabilities by steering them to generate code as symbolic planners for optimization and constraint verification. Unlike prior work that uses code to interface with robot action modules or pre-designed planners, we steer LLMs to generate code as solvers, planners, and checkers for TAMP tasks requiring symbolic computing, while still leveraging textual reasoning to incorporate common sense. With a multi-round guidance and answer evolution framework, the proposed Code-as-Symbolic-Planner improves success rates by average 24.1% over best baseline methods across seven typical TAMP tasks and three popular LLMs. Code-as-Symbolic-Planner shows strong effectiveness and generalizability across discrete and continuous environments, 2D/3D simulations and real-world settings, as well as single- and multi-robot tasks with diverse requirements. See our project website https://yongchao98.github.io/Code-Symbol-Planner/ for prompts, videos, and code.

15:35-15:40, Paper ThCT16.8
Explicit-Implicit Subgoal Planning for Long-Horizon Tasks with Sparse Rewards (I)

Wang, Fangyuan	The Hong Kong Polytechnic University
Duan, Anqing	Mohamed Bin Zayed University of Artificial Intelligence
Zhou, Peng	Great Bay University
Huo, Shengzeng	The Hong Kong Polytechnic University
Guo, Guodong	Eastern Institute of Technology
Yang, Chenguang	University of Liverpool
Navarro-Alarcon, David	The Hong Kong Polytechnic University
Keywords: Task and Motion Planning, Manipulation Planning, Task Planning Abstract: The challenges inherent in long-horizon tasks in robotics persist due to the typical inefficient exploration and sparse rewards in traditional reinforcement learning approaches. To address these challenges, we have developed a novel algorithm, termed explicit-implicit subgoal planning (EISP), designed to tackle long-horizon tasks through a divide-and-conquer approach. We utilize two primary criteria, feasibility and optimality, to ensure the quality of the generated subgoals. EISP consists of three components: a hybrid subgoal generator, a hindsight sampler, and a value selector. The hybrid subgoal generator uses an explicit model to infer subgoals and an implicit model to predict the final goal, inspired by way of human thinking that infers subgoals by using the current state and final goal as well as reason about the final goal conditioned on the current state and given subgoals. Additionally, the hindsight sampler selects valid subgoals from an offline dataset to enhance the feasibility of the generated subgoals. While the value selector utilizes the value function in reinforcement learning to filter the optimal subgoals from subgoal candidates. To validate our method, we conduct four long-horizon tasks in both simulation and the real world. The obtained quantitative and qualitative data indicate that our approach achieves promising performance compared to other baseline methods. These experimental results can be seen on the website https://sites.google.com/view/vaesi.


ThCT17	210A
Field Robots 3	Regular Session
Chair: Varadharajan, Vivek shankar	Polytechnique Montréal
Co-Chair: Zhang, Guoteng	Shandong University

15:00-15:05, Paper ThCT17.1
Learning-Based On-Track System Identification for Scaled Autonomous Racing in under a Minute

Dikici, Onur	Politecnico Di Milano
Ghignone, Edoardo	ETH
Hu, Cheng	Zhejiang University
Baumann, Nicolas	ETH
Xie, Lei	Zhejiang University
Carron, Andrea	ETH Zurich
Magno, Michele	ETH Zurich
Corno, Matteo	Politecnico Di Milano
Keywords: Field Robots, Wheeled Robots, Machine Learning for Robot Control Abstract: Accurate tire modeling is crucial for optimizing autonomous racing vehicles, as State-of-the-Art (SotA) model- based techniques rely on precise knowledge of the vehicle’s parameters, yet system identification in dynamic racing conditions is challenging due to varying track and tire conditions. Traditional methods require extensive operational ranges, often impractical in racing scenarios. Machine Learning (ML)-based methods, while improving performance, struggle with generalization and depend on accurate initialization. This paper introduces a novel on-track system identification algorithm, incorporating a NN for error correction, which is then employed for traditional system identification with virtually generated data. Crucially, the process is iteratively reapplied, with tire parameters updated at each cycle, leading to notable improvements in accuracy in tests on a scaled vehicle. Experiments show that it is possible to learn a tire model without prior knowledge with only 30 seconds of driving data, and 3 seconds of training time. This method demonstrates greater one-step prediction accuracy than the baseline Nonlinear Least Squares (NLS) method under noisy conditions, achieving a 3.3x lower Root Mean Square Error (RMSE), and yields tire models with comparable accuracy to traditional steady-state system identification. Furthermore, unlike steady-state methods requiring large spaces and specific experimental setups, the proposed approach identifies tire parameters directly on a race track in dynamic racing environments.

15:05-15:10, Paper ThCT17.2
UTC-RS: An Underwater Tracked Cleaning Robot System for Hydraulic Structures

Mao, Juzheng	Southeast University
Zhou, Jun	Southeast University
Xie, Feng	Southeast University
Liu, Yue	Southeast University
Song, Guangming	Southeast University
Song, Aiguo	Southeast University
Keywords: Field Robots, Marine Robotics, Robotics and Automation in Construction Abstract: During the inspection and maintenance of the underwater parts of hydraulic structures, it is often necessary to clean the surface of a certain area for subsequent operations. At present, there are few robots capable of fine underwater cleaning. Therefore, this article presents the design of a novel tracked robot system that can be used for delicate underwater surface cleaning operations. The robot uses a crawling track chassis and is equipped with a lightweight underwater manipulator, force-controlled grippers, and underwater brushing tools. Considering the complexity of the underwater environment, an underwater force-position hybrid control algorithm for the robot that accounts for water flow resistance is proposed. The effectiveness of this algorithm is verified in an experimental pool, and the proposed robot system is applied to an actual engineering site. The robot achieves a brushing efficiency of approximately 72 m^2 per hour, with a force control accuracy of ±0.5 N, and the cleaning effect significantly improves the condition of the hydraulic structure surface.

15:10-15:15, Paper ThCT17.3
A Multi-Robot Exploration Planner for Space Applications

Varadharajan, Vivek shankar	Polytechnique Montréal
Beltrame, Giovanni	Ecole Polytechnique De Montreal
Keywords: Field Robots, Space Robotics and Automation, Motion and Path Planning Abstract: We propose a distributed multi-robot exploration planning method designed for complex, unconstrained environments featuring steep elevation changes. The method employs a two-tiered approach: a local exploration planner that constructs a grid graph to maximize exploration gain and a global planner that maintains a sparse navigational graph to track visited locations and frontier information. The global graphs are periodically synchronized among robots within communication range to maintain an updated representation of the environment. Our approach integrates localization loop closure estimates to correct global graph drift. In simulation and field tests, the proposed method achieves 50% lower computational runtime compared to state-of-the-art methods while demonstrating superior exploration coverage. We evaluate its performance in two simulated subterranean environments and in field experiments at a Mars-analog terrain.

15:15-15:20, Paper ThCT17.4
Magnetic Wall-Climbing Wheels with Controllable Adhesion Reduction Via Soft Magnetic Material

Tian, Yang	Shinshu University
Jitsukawa, Hayato	Ritsumeikan University
Ma, Shugen	Hong Kong University of Science and Technology (Guangzhou)
Zhang, Guoteng	Shandong University
Keywords: Field Robots, Robotics and Automation in Construction, Robot Safety Abstract: With the aging of critical infrastructure like bridges and plant facilities, the development of innovative wall-climbing robots using permanent magnets has become increasingly important. Traditional designs of such robots depend on controlling the position of lifters or permanent magnets to control the adhesion condition, which introduces significant safety concerns, including inconsistent adhesion during surface transitions and the risk of falls when control is lost. To overcome these issues, this paper introduces a novel magnet wheel design that utilizes Soft Magnetic Material (SMM) to control the reduction of adhesive force in a specific direction. The effect on adhesion was shown with a comparative analysis of various magnet and SMM configurations. Based on the analyses, a wheel design was provided with investigating the effect of the SMM cover region. For verifying the effectiveness of the adhesion reduction, a robot with the proposed wheel is presented and analyzed to realize wall-climbing tasks. In experiments, a prototype robot equipped with the proposed wheel design demonstrates enhanced safety for wall-climbing tasks under controlled conditions.

15:20-15:25, Paper ThCT17.5
From Concept to Field Trials: Design, Analysis, and Evaluation of a Novel Quadruped Robot with Deformable Wheel-Foot Structure

Ju, Zhongjin	Yanshan University
Wei, Ke	YanShanUniversity
Hu, Kaidong	Yanshan University
Xu, Yundou	Parallel Robot and Mechatronic System Laboratory of Hebei Provin
Keywords: Field Robots, Legged Robots, Mechanism Design, Wheeled Robots Abstract: This study introduces a novel quadruped robot, the TerraAdapt, furnished with an innovative deformable wheel-foot integrated structure. This unique design grants the robot the flexibility to alternate between wheeled and footed modes of locomotion, making it efficient in traversing diverse terrains, from smooth indoor floors to challenging outdoor landscapes laden with obstacles. The study delineates an in-depth design and analysis of the deformable wheel and its integrated wheel-foot structure using screw theory. We engineer a 2RRR-RP wheel-foot mode-switching mechanism by modifying a 2RRR spatial sixbar mechanism with an additional RP branch. This mechanism aids in seamless transitioning between different movement modes. Moreover, a 2RRR parallel structure is employed to construct the footed mode structure.To substantiate the viability and efficacy of the proposed design, we carry out extensive motion simulations and construct an experimental prototype for field testing. The field trials reveal the robot’s adeptness in adapting to varied terrains, highlighting the possible advantages of incorporating the proposed deformable wheel into micro mobile robot designs.

15:25-15:30, Paper ThCT17.6
Online Triangular Constraint Calibration for LiDAR and Cameras in Open-Pit Mines (I)

Li, Yuchen	Hong Kong Baptist University; UIC; Zongmu Tech；
Li, Luxi	Hong Kong Baptist University
Teng, Siyu	HKBU; UIC
Bing, Zhenshan	Technical University of Munich
Knoll, Alois	Technical University of Munich
Xuanyuan, Zhe	Beijing Normal University-Hong Kong Baptist University United In
Chen, Long	Chinese Academy of Sciences
Keywords: Field Robots, Mining Robotics, Calibration and Identification Abstract: Extrinsic calibration between a LiDAR and a camera is a crucial task for intelligent vehicles and mobile robots. However, most of the current extrinsic calibration methods are offline and target-based, making it hard to achieve real-time calibration depending on the characteristics of the environments. Although a few online calibration strategies can solve the above drawbacks, they still rely on a large amount of labeled data and struggle to achieve satisfactory performance in extrinsic calibration in unstructured scenarios where valid targets are scarce. In this work, we propose an effective online calibration strategy for mining scenarios, one of the typical unstructured scenarios. Based on the analysis of mining area data, we adopt an approach that integrates image and point cloud segmentation to accomplish the entire calibration process. An initialization method supporting a single rigid target is proposed, effectively addressing the initialization failure caused by the scarcity of targets in mining areas. Incorporating the segmentation results from the right camera to establish a left-right camera-LiDAR triangular constraint effectively reduces translation errors. Temporal data with confidence scores are then used to update and optimize the extrinsic parameters. Our online extrinsic calibration method has been validated on mining scenarios data and has achieved less than 0.2 m in data collected from an SUV and 0.3 m from trucks in translation errors within 10 frames. Besides, we provide a detailed discussion on the challenges of calibration in mining areas and the limitations of the proposed method.

15:30-15:35, Paper ThCT17.7
A System for Multi-View Mapping of Dynamic Scenes Using Time-Synchronized UAVs

Gupta, Aniket	Northeastern University
Giaya, Dennis	Northeastern University
Annadanam, Vishnu Rohit	Northeastern University
Diddi, Mithun	Northeastern University
Jiang, Huaizu	Northeastern University
Singh, Hanumant	Northeatern University
Keywords: Data Sets for Robotic Vision, Field Robots, Aerial Systems: Applications Abstract: Recent advances in 3D scene reconstruction, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting, have demonstrated remarkable results in novel view synthesis and dynamic scene representation. Despite these successes, existing approaches rely on time-synchronized multi-view imagery captured using specialized camera rigs in controlled environments. This reliance limits their applicability in uncontrolled, unbounded dynamic scenes. In this work, we propose a novel Unmanned Aerial Vehicle (UAV) based multi-view capture system that leverages GNSS Pulse Per Second (PPS) signals for precise frame synchronization across multiple cameras. Our system eliminates the need for fixed infrastructure, enabling flexible and scalable data collection for dynamic scene reconstruction in diverse environments. In addition to the system architecture, we also introduce a dataset of synchronized multi-view images captured in unbounded outdoor scenes from four synchronized UAVs, each carrying a stereo camera rig. We benchmark several 3D and 4D representation methods on our dataset and highlight the challenges associated with data collection in unstructured outdoor settings such as sparse views, varied lighting conditions, visual degradation etc. Our hardware configuration details, software details and dataset is available at url{https://github.com/neufieldrobotics/Dynamic_Mapping}.

15:35-15:40, Paper ThCT17.8
Blind-Wayfarer: A Minimalist, Probing-Driven Framework for Resilient Navigation in Perception-Degraded Environments

Xu, Yanran	University of Southampton
Zauner, Klaus-Peter	University of Southampton
Tarapore, Danesh	University of Southampton
Keywords: Behavior-Based Systems, Robotics and Automation in Agriculture and Forestry, Field Robots Abstract: Navigating autonomous robots through dense forests and rugged terrains is especially daunting when exteroceptive sensors---such as cameras and LiDAR sensors---fail under occlusions, low-light conditions, or sensor noise. We present Blind-Wayfarer, a probing-driven navigation framework inspired by maze-solving algorithms that relies primarily on a compass to robustly traverse complex, unstructured environments. In 1,000 simulated forest experiments, Blind-Wayfarer achieved a 99.7% success rate. In real-world tests in two distinct scenarios---with rover platforms of different sizes---our approach successfully escaped forest entrapments in all 20 trials. Remarkably, our framework also enabled a robot to escape a dense woodland, traveling from 45 m inside the forest to a paved pathway at its edge. These findings highlight the potential of probing-based methods for reliable navigation in challenging perception-degraded field conditions. Videos and code are available on our website https://sites.google.com/view/blind-wayfarer


ThCT18	210B
Mapping 3	Regular Session
Chair: Wang, Yue	Zhejiang University

15:00-15:05, Paper ThCT18.1
PRISM-TopoMap: Online Topological Mapping with Place Recognition and Scan Matching

Muravyev, Kirill	Federal Research Center for Computer Science and Control of Russ
Melekhin, Alexander	Moscow Institute of Physics and Technology
Yudin, Dmitry	Moscow Institute of Physics and Technology
Yakovlev, Konstantin	Federal Research Center "Computer Science and Control" of the Ru
Keywords: Mapping, Localization, SLAM Abstract: Mapping is one of the crucial tasks enabling autonomous navigation of a mobile robot. Conventional mapping methods output a dense geometric map representation, e.g. an occupancy grid, which is not trivial to keep consistent for prolonged runs covering large environments. Meanwhile, capturing the topological structure of the workspace enables fast path planning, is typically less prone to odometry error accumulation, and does not consume much memory. Following this idea, this paper introduces PRISM-TopoMap -- a topological mapping method that maintains a graph of locally aligned locations not relying on global metric coordinates. The proposed method involves original learnable multimodal place recognition paired with the scan matching pipeline for localization and loop closure in the graph of locations. The latter is updated online, and the robot is localized in a proper node at each time step. We conduct a broad experimental evaluation of the suggested approach in a range of photo-realistic environments and on a real robot, and compare it to state of the art. The results of the empirical evaluation confirm that PRISM-Topomap consistently outperforms competitors computationally-wise, achieves high mapping quality and performs well on a real robot. The code of PRISM-Topomap is open-sourced and is available at: https://github.com/kirillMouraviev/prism-topomap.

15:05-15:10, Paper ThCT18.2
Physics-Informed Neural Mapping and Motion Planning in Unknown Environments

Liu, Yuchen	Purdue University
Ni, Ruiqi	Purdue University
Qureshi, Ahmed H.	Purdue University
Keywords: Mapping, Motion and Path Planning, Deep Learning in Robotics and Automation, Physics-Informed Neural Networks Abstract: Mapping and motion planning are two essential elements of robot intelligence that are interdependent in generating environment maps and navigating around obstacles. The existing mapping methods create maps that require computationally expensive motion planning tools to find a path solution. In this paper, we propose a new mapping feature called arrival time fields, which is a solution to the Eikonal equation. The arrival time fields can directly guide the robot in navigating the given environments. Therefore, this paper introduces a new approach called Active Neural Time Fields (Active NTFields), which is a physics-informed neural framework that actively explores the unknown environment and maps its arrival time field on the fly for robot motion planning. Our method does not require any expert data for learning and uses neural networks to directly solve the Eikonal equation for arrival time field mapping and motion planning. We benchmark our approach against state-of-the-art mapping and motion planning methods and demonstrate its superior performance in both simulated and real-world environments with a differential drive robot and a 6 degrees-of-freedom (DOF) robot

15:10-15:15, Paper ThCT18.3
SMR-GA: Semantic Map Registration under Large Perspective Differences through Genetic Algorithm

Qiao, XiaoNing	National Space Science Center Chinese Academy of Sciences
Tang, Jie	National Space Science Center Chinese Academy of Sciences
Li, Guangyun	Ordance Science and Research of China
Yang, Weiqing	Ordance Science and Research of China
Shen, Bingke	National Space Science Center Chinese Academy of Sciences
Niu, Wenlong	National Space Science Center Chinese Academy of Sciences
Xie, Wenming	National Space Science Center Chinese Academy of Sciences
Peng, Xiaodong	National Space Science Center, Chinese Academy of Sciences
Keywords: Mapping, Multi-Robot SLAM Abstract: The registration of multiple local maps is a crucial step in multi-robot collaboration. However, existing methods are insufficient to address the complexities associated with sparse features and elevated outlier rates, primarily due to significant discrepancies in perspective. This paper proposes a semantic map registration method using a genetic algorithm to handle larger perspective differences. We introduce a potential corresponding point selection strategy to eliminate outliers. A fitness calculation model has been developed that closely integrates semantic and geometric features. To assist in avoiding local minima and accelerating convergence, an adaptive mutation step adjustment strategy is proposed. Experiments on multiple datasets demonstrate that our algorithm significantly improves accuracy and success rates. Successful registration can still be achieved even when the perspective difference exceeds 90°. Ablation experiments confirm the effectiveness of each model.

15:15-15:20, Paper ThCT18.4
Mag-Match: Magnetic Vector Field Features for Map Matching and Registration

McDonald, William	University of Technology, Sydney
Le Gentil, Cedric	University of Toronto
Vidal-Calleja, Teresa A.	University of Technology Sydney
Wakulicz, Jennifer	University of Technology Sydney, Robotics Institute
Keywords: Mapping, Probabilistic Inference, Localization Abstract: Map matching and registration are essential tasks in robotics for localisation and integration of multi-session or multi-robot data. Traditional methods rely on cameras or LiDARs to capture visual or geometric information but struggle in challenging conditions like smoke or dust. Magnetometers, on the other hand, detect magnetic fields, revealing features invisible to other sensors and remaining robust in such environments. In this paper, we introduce Mag-Match, a novel method for extracting and describing features in 3D magnetic vector field maps to register different maps of the same area. Our feature descriptor, based on higher-order derivatives of magnetic field maps, is invariant to global orientation, eliminating the need for gravity-aligned mapping. To obtain these higher-order derivatives map-wide given point-wise magnetometer data, we leverage a physics-informed Gaussian process to perform efficient and recursive probabilistic inference of both the magnetic field and its derivatives. We evaluate Mag-Match in simulated and real-world experiments against a SIFT-based approach, demonstrating accurate map-to-map, robot-to-map, and robot-to-robot transformations—even without initial gravitational alignment.

15:20-15:25, Paper ThCT18.5
LI-GS: Gaussian Splatting with LiDAR Incorporated for Accurate Large-Scale Reconstruction

Jiang, Changjian	Zhejiang University
Gao, Ruilan	Zhejiang University
Shao, Kele	Zhejiang University
Wang, Yue	Zhejiang University
Xiong, Rong	Zhejiang University
Zhang, Yu	Zhejiang University
Keywords: Mapping, Range Sensing, Computer Vision for Automation Abstract: Large-scale 3D reconstruction is critical in the field of robotics, and the potential of 3D Gaussian Splatting (3DGS) for achieving accurate object-level reconstruction has been demonstrated. However, ensuring geometric accuracy in outdoor and unbounded scenes remains a significant challenge. This study introduces LI-GS, a reconstruction system that incorporates LiDAR and Gaussian Splatting to enhance geometric accuracy in large-scale scenes. 2D Gaussain surfels are employed as the map representation to enhance surface alignment. Additionally, a novel modeling method is proposed to convert LiDAR point clouds to plane-constrained multimodal Gaussian Mixture Models (GMMs). The GMMs are utilized during both initialization and optimization stages to ensure sufficient and continuous supervision over the entire scene while mitigating the risk of over-fitting. Furthermore, GMMs are employed in mesh extraction to eliminate artifacts and improve the overall geometric quality. Experiments demonstrate that our method outperforms state-of-the-art methods in large-scale 3D reconstruction, achieving higher accuracy compared to both LiDAR-based methods and Gaussian-based methods with improvements of 52.6% and 68.7%, respectively.

15:25-15:30, Paper ThCT18.6
GS-SDF: LiDAR-Augmented Gaussian Splatting and Neural SDF for Geometrically Consistent Rendering and Reconstruction

Liu, Jianheng	The University of Hong Kong
Wan, YunFei	The University of Hong Kong
Wang, Bowen	University of Hong Kong
Zheng, Chunran	The University of Hong Kong
Lin, Jiarong	The University of Hong Kong
Zhang, Fu	University of Hong Kong
Keywords: Mapping, Range Sensing, RGB-D Perception Abstract: Digital twins are fundamental to the development of autonomous driving and embodied artificial intelligence. However, achieving high-granularity surface reconstruction and high-fidelity rendering remains a challenge. Gaussian splatting offers efficient photorealistic rendering but struggles with geometric inconsistencies due to fragmented primitives and sparse observational data in robotics applications. Existing regularization methods, which rely on render-derived constraints, often fail in complex environments. Moreover, effectively integrating sparse LiDAR data with Gaussian splatting remains challenging. We propose a unified LiDAR-visual system that synergizes Gaussian splatting with a neural signed distance field. The accurate LiDAR point clouds enable a trained neural signed distance field to offer a manifold geometry field. This motivates us to offer an SDF-based Gaussian initialization for physically grounded primitive placement and a comprehensive geometric regularization for geometrically consistent rendering and reconstruction. Experiments demonstrate superior reconstruction accuracy and rendering quality across diverse trajectories. To benefit the community, the codes are released at url{https://github.com/hku-mars/GS-SDF}.

15:30-15:35, Paper ThCT18.7
Mesh-Learner: Texturing Mesh with Spherical Harmonics

Wan, YunFei	The University of Hong Kong
Liu, Jianheng	The University of Hong Kong
Zheng, Chunran	The University of Hong Kong
Lin, Jiarong	The University of Hong Kong
Zhang, Fu	University of Hong Kong
Keywords: Mapping, Range Sensing Abstract: In this work, we present a 3D reconstruction and rendering framework that is natively compatible with a traditional rasterization pipeline termed Mesh-Learner. This framework integrates mesh and spherical harmonic (SH) tex- ture (i.e., texture filled with SH coefficients) into the learning process to learn the view-dependent radiance of each mesh in an end-to-end manner. In the inference process of Mesh-Learner, images are created by interpolating the surrounding SH texels at each pixel’s sampling point using a novel interpolation method. During the backward process, gradients from each pixel are back-propagated to the related SH Texels in SH textures. The whole framework is implemented on OpenGL, exploiting its graphic features (texture sampling, deferred rendering) to render. This makes Mesh-Learner naturally compatible with modern tools (e.g., Blender) and tasks (e.g., 3D reconstruction, scene rendering, reinforcement learning for robotics) that are based on the rasterization pipeline. Our system can train vast, unlimited scenes because we transfer only the SH textures within the frustum to the GPU for training. At other times, the SH textures are stored in CPU RAM, which results in moderate GPU memory usage. The rendering results on interpolation and extrapolation sequences in the replica and FAST-LIVO2 datasets achieve state-of-the-art performance compared to ex- isting state-of-the-art methods (e.g., 3DGS and M2-Mapping). To benefit the society, the code will be available at https: //github.com/hku-mars/Mesh-Learner.

15:35-15:40, Paper ThCT18.8
2.5D Object Mapping Using Gaussian Processes for Robot Navigation

Toraman, Erdem	Middle East Technical University
Kumru, Murat	Volvo Group Trucks Technology
Özkan, Emre	Middle East Technical University
Keywords: Mapping, Reactive and Sensor-Based Planning, Probabilistic Inference Abstract: Mapping and planning are fundamental to robotic navigation in unknown environments. This work introduces a probabilistic framework that combines Gaussian processes (GPs) for 2.5D object modeling with an informative motion planner, using LiDAR-based measurements. The mapping approach employs a flexible, nonparametric representation to process 3D point cloud data to create compact volumetric representations through contours and heights, enabling robust shape estimation even from sparse data. Building on this, the GP representation-based informative motion planner incorporates information gain into the dynamic window approach (DWA) to enhance navigation performance. Simulations validate the framework by comparing its mapping accuracy with OctoMap and elevation map, and its planning efficiency with a baseline DWA.


ThCT19	210C
Aerial Systems: Applications 2	Regular Session
Chair: Wang, Hongpeng	Nankai University
Co-Chair: Yu, Yushu	Beijing Institute of Technology

15:00-15:05, Paper ThCT19.1
Efficiently Kinematic-Constraint-Coupled State Estimation for Integrated Aerial Platforms in GPS-Denied Environments

Lai, Ganghua	Beijing Institute of Technology
Yu, Yushu	Beijing Institute of Technology
Sun, Fuchun	Tsinghua University
Qi, Jing	Hebei University
Lippiello, Vincenzo	University of Naples FEDERICO II
Keywords: Aerial Systems: Applications, Localization, Multi-Robot SLAM Abstract: Small-scale unmanned aerial vehicles (UAVs) are widely used in various fields. However, their underactuated design limits their ability to perform complex tasks that require physical interaction with environments. The fully-actuated Integrated Aerial Platforms (IAPs), where multiple UAVs are connected to a central platform via passive joints, offer a promising solution. However, achieving accurate state estimation for IAPs in GPS-denied environments remains a significant hurdle. In this paper, we introduce a centralized localization and state estimation framework for IAPs with a fusion of odometry and kinematics, using only onboard cameras and inertial measurement units (IMUs). We develop a forward-kinematic-based formulation to fully leverage localization information from kinematic constraints. An online calibration method for kinematic parameters is proposed to enhance state estimation accuracy with forward kinematics. Additionally, we perform an observability analysis, theoretically proving that these kinematic parameters are fully observable under conditions of fully excited motion. Our approach is validated on our collected dataset and real-world experiments using a three-agent IAP prototype. Results demonstrate that the proposed method significantly improves relative localization accuracy and reduces global localization drift compared to the baseline.

15:05-15:10, Paper ThCT19.2
Active Iterative Optimization for Aerial Visual Reconstruction of Wide-Area Natural Environment

Wang, Hongpeng	Nankai University
Cao, Zhongzhi	Nankai University
Zhang, Wenhao	Nankai University
Fei, Yue	Nankai University
Wang, Peizhao	Nankai University
Li, Yaojing	Nankai University
Sun, Chuanyu	Nankai University
He, Ming	Nankai University
Han, Jianda	Nankai University
Keywords: Aerial Systems: Applications, Computer Vision for Other Robotic Applications, Environment Monitoring and Management, 3D Reconstruction Based on Aerial Vision Abstract: Autonomous, accurate, and dynamic 3-D reconstruction for wide-area environments is crucial for unmanned aerial vehicle (UAV) monitoring and rescue tasks, however, when conducted in an unknown complex terrain, the reconstruction result obtained from a single flight suffers poor quality. In this paper, we present an Active Iterative Optimization (AIO) framework for trajectory planning and visual reconstruction. Firstly, the trajectory is planned under the photogrammetric constraints based on rough terrain. Due to the visual field deviation caused by pose error during actual flight, the view loss evaluation is established and keyframes are selected to conduct 3-D reconstruction. A comprehensive metric is designed to quantitatively evaluate reconstruction effect without ground truth. The point cloud is then rasterized and divided into normal or low-scoring region according to the evaluation metric. In the next iteration, trajectory is replanned in low-scoring region to purposefully optimize the point cloud of local area. Thus the reconstruction result can be iteratively optimized. We validated the effectiveness of the proposed framework in simulation and physical experiments.

15:10-15:15, Paper ThCT19.3
Versatile Tasks on Integrated Aerial Platforms Using Only Onboard Sensors: Control Framework, Visual Odometry Kinematics Fusion, and Experimental Validation

Wang, Kaidi	Beijing Institute of Technology
Lai, Ganghua	Beijing Institute of Technology
Yu, Yushu	Beijing Institute of Technology
Du, Jianrui	Beijing Institute of Technology
Sun, Jiali	Beijing Institute of Technology
Xu, Bin	Beijing Institute of Technology
Franchi, Antonio	University of Twente / Sapienza University of Rome
Sun, Fuchun	Tsinghua University
Keywords: Aerial Systems: Applications, Aerial Manipulation, Localization, Aerial Systems: Perception and Autonomy Abstract: Connecting multiple aerial vehicles to a rigid central platform through passive spherical joints holds the potential to construct a fully-actuated aerial platform. The integration of multiple vehicles enhances efficiency in tasks like mapping and object reconnaissance. This paper proposes a control and state estimation framework for the Integrated Aerial Platform (IAP), enabling it to perform versatile tasks like object reconnaissance and physical interactive tasks with only onboard sensors. In the framework, the 6D motion control serves as the low-level controller, while the high-level controller comprises a 6D admittance filter and a perception-aware attitude correction module. The 6D admittance filter, serving as the interaction controller, is adaptable for aerial interaction tasks. The perception-aware attitude correction algorithm is carefully designed by adopting a geometric Model Predictive Controller (MPC). This algorithm, incorporating both offline and online calculations, proves to be well-suited for the intricate dynamics of an IAP. A 6D direct wrench controller is also developed for the IAP. Notably, both the interaction controller and the direct wrench controller opera

15:15-15:20, Paper ThCT19.4
Learnable Cost Metric-Based Multi-View Stereo for Point Cloud Reconstruction (I)

Yang, Guidong	The Chinese University of Hong Kong
Zhou, Xunkuai	Tongji University
Gao, Chuanxiang	The Chinese University of Hong Kong
Chen, Xi	The Chinese University of Hong Kong
Chen, Ben M.	Chinese University of Hong Kong
Keywords: Aerial Systems: Applications, Aerial Systems: Perception and Autonomy Abstract: 3D reconstruction is essential to defect localization. This article proposes LCM-MVSNet, a novel multi-view stereo (MVS) network with a learnable cost metric (LCM) for more accurate and complete dense point cloud reconstruction. To adapt to the scene variation and improve the reconstruction quality in non-Lambertian low-textured scenes, we propose LCM to adaptively aggregate multi-view matching similarity into the 3D cost volume by leveraging sparse point hints. The proposed LCM benefits the MVS approaches in four folds, including depth estimation enhancement, reconstruction quality improvement, memory footprint reduction, and computational burden alleviation, allowing the depth inference for high-resolution images to achieve more accurate and complete reconstruction. In addition, we improve the depth estimation by enhancing the shallow feature propagation via a bottom–up pathway and strengthen the end-to-end supervision by adapting the focal loss to reduce ambiguity caused by sample imbalance. Extensive experiments on three benchmark datasets show that our method achieves state-of-the-art performance on the DTU and BlendedMVS datasets, and exhibits strong generalization ability with a competitive performance on the Tanks and Temples benchmark. Furthermore, we deploy our LCM-MVSNet into our UAV-based infrastructure defect inspection framework for infrastructure reconstruction and defect localization, demonstrating the effectiveness and efficiency of our method.

15:20-15:25, Paper ThCT19.5
Manipulating Magnetic Field of the Magnetic Gripper with Charging Feature for Drones on Energized Power Lines (I)

Duong Hoang, Viet	University of Southern Denmark
Ebeid, Emad	University of Southern Denmark
Keywords: Aerial Systems: Applications, Grippers and Other End-Effectors, Robotics in Hazardous Fields Abstract: Using drones to inspect overhead transmission lines has progressively gained popularity due to safety reasons, ease of deployment, and reasonable costs compared to the traditional approach of using helicopters. However, the short flying time is the challenging problem that deters drones from being used in long-range inspection missions. Recharging the battery from the magnetic field around the cable is a promising solution that enables drones to operate automatically and endlessly without human intervention. In this article, a magnetic gripper was proposed to take advantage of the magnetic field from the line to hold the drone on the power line and recharge the battery at the same time. A magnetic manipulating circuit with different operating modes was developed to maintain the grip regardless of the power line’s current level. The system was tested in the lab and on a drone with an energized transmission line.

15:25-15:30, Paper ThCT19.6
WuKong: Design, Modeling and Control of a Compact Flexible Hybrid Aerial-Aquatic Vehicle

Liu, Yufan	Guangdong University of Technology
Li, Cheng	Guangdong University of Technology
Li, Junjie	Guangdong University of Technology
Lin, Zemin	Guangdong University of Technology
Meng, Wei	Guangdong University of Technology
Zhang, Fumin	Hong Kong University of Science and Technology
Keywords: Aerial Systems: Mechanics and Control, Marine Robotics, Aerial Systems: Applications Abstract: The significant differences in the physical properties of air and water pose a substantial challenge for the development of hybrid aerial-aquatic vehicles (HAAVs), which leading to increased prototype size, heavier thrusters, and reduced efficiency or under-actuation in one of the mediums. This letter introduces ``WuKong," a HAAV with a simple structure, compact size, and high maneuverability. WuKong has a minimum width of 185mm, with 2.48 thrust-to-weight ratio. Its operational range spans 20 meters in the air and 3 meters underwater, enabling seamless cross-domain maneuvers. These capabilities are attributed to the selection and arrangement of the propulsion system and the proposed cross-domain hybrid control framework, along with the incremental nonlinear dynamic inversion (INDI)-based underwater attitude control law. Through simulations and field experiments, WuKong's cross-domain transition capability and the underwater maneuverability performance are demonstrated. Our development provides a practical solution for HAAV applications and offers a platform for subsequent HAAV collaborative tasks. The corresponding demonstration video can be found at https://youtu.be/Zei6QCtwNn8.

15:30-15:35, Paper ThCT19.7
Automated Layout and Control Co-Design of Robust Multi-UAV Transportation Systems

Bosio, Carlo	University of California, Berkeley
Mueller, Mark Wilfried	University of California, Berkeley
Keywords: Aerial Systems: Mechanics and Control, Optimization and Optimal Control, Robust/Adaptive Control Abstract: The joint optimization of physical parameters and controllers in robotic systems is challenging. This is due to the difficulties of predicting the effect that changes in physical parameters have on final performances. At the same time, physical and morphological modifications can improve robot capabilities, perhaps completely unlocking new skills and tasks. We present a novel approach to co-optimize the physical layout and the control of a cooperative aerial transportation system. The goal is to achieve the most precise and robust flight when carrying a payload. We assume the agents are connected to the payload through rigid attachments, essentially transforming the whole system into a larger flying object with ``thrust modules" at the attachment locations of the quadcopters. We investigate the optimal arrangement of the thrust modules around the payload, so that the resulting system achieves the best disturbance rejection capabilities. We propose a novel metric of robustness inspired by H_2 control, and propose an algorithm to optimize the layout of the vehicles around the object and their controller altogether. We experimentally validate the effectiveness of our approach using fleets of three and four quadcopters and payloads of diverse shapes.

15:35-15:40, Paper ThCT19.8
Flexible Affine Formation Control Based on Dynamic Hierarchical Reorganization

Li, Yuzhu	Shanghai Jiao Tong University
Dong, Wei	Shanghai Jiao Tong University
Keywords: Aerial Systems: Mechanics and Control, Multi-Robot Systems, Swarm Robotics Abstract: Current formations commonly rely on invariant hierarchical structures, such as predetermined leaders or enumerated formation shapes. These structures could be unidirectional and sluggish, constraining their flexibility and agility when encountering cluttered environments. To surmount these constraints, this work proposes a dynamic hierarchical reorganization approach with affine formation. Central to our approach is the fluid leadership and authority redistribution, for which we develop a minimum time-driven leadership evaluation algorithm and a power transition control algorithm. These algorithms facilitate autonomous leader selection and ensure smooth power transitions, enabling the swarm to adapt hierarchically in alignment with the external environment. Extensive simulations and real-world experiments validate the effectiveness of the proposed method. The formation of five aerial robots successfully performs dynamic hierarchical reorganizations, enabling the execution of complex tasks such as swerving maneuvers and navigating through hoops at velocities of up to 1.05 m/s. Comparative experimental results further demonstrate the significant advantages of hierarchical reorganization in enhancing formation flexibility and agility, particularly during complex maneuvers such as U-turns. Notably, in the aforementioned real-world experiments, the proposed method reduces the flight path length by at least 33.8% compared to formations without hierarchical reorganization.


ThCT20	210D
Perception for Grasping and Manipulation 3	Regular Session
Co-Chair: Huo, Shengzeng	The Hong Kong Polytechnic University

15:00-15:05, Paper ThCT20.1
Using Fiber Optic Bundles to Miniaturize Vision-Based Tactile Sensors

Di, Julia	Stanford University
Dugonjic, Zdravko	Technische Universität Dresden
Fu, Jiaxiang	No Affliation
Wu, Tingfan	Meta AI
M, Romeo	Retired from Meta Technologies
Sawyer, Kevin	Meta
Most, Victoria Rose	Facebook
Kammerer, Gregg	Facebook
Speidel, Stefanie	National Center for Tumor Diseases
Fan, Richard	UCLA
Sonn, Geoffrey	Stanford University
Cutkosky, Mark	Stanford University
Lambeta, Mike Maroje	Facebook
Calandra, Roberto	TU Dresden
Keywords: Perception for Grasping and Manipulation, Force and Tactile Sensing, Fiber Optics, Deep Learning in Robotics and Automation Abstract: Vision-based tactile sensors have recently become popular due to their combination of low cost, very high spatial resolution, and ease of integration using widely available miniature cameras. The associated field of view and focal length, however, are difficult to package in a human-sized finger. In this paper we employ optical fiber bundles to achieve a form factor that, at 15 mm diameter, is smaller than an average human fingertip. The electronics and camera are also located remotely, further reducing package size. The sensor achieves a spatial resolution of 0.22 mm and a minimum force resolution 5 mN for normal and shear contact forces. With these attributes, the DIGIT Pinki sensor is suitable for applications such as robotic and teleoperated digital palpation. We demonstrate its utility for palpation of the prostate gland and show that it can achieve clinically relevant discrimination of prostate stiffness for phantom and ex vivo tissue.

15:05-15:10, Paper ThCT20.2
Generalizable and Actionable Part Detection and Manipulation with SAM-Rectified Segmentation and Iterative Pose Refinement

Qian, Sucheng	Shanghai Jiao Tong University
Zhang, Li	University of Science and Technology of China
Wei, Yanyan	Hefei University of Technology
Liu, Liu	Hefei University of Technology
Lu, Cewu	ShangHai Jiao Tong University
Keywords: Perception for Grasping and Manipulation, RGB-D Perception, Visual Learning Abstract: The ability to perform cross-category object perception and manipulation is highly desirable in building intelligent robots. One promising approach is to define the concept of Generalizable and Actionable Parts (GAParts), such as buttons and handles, on both seen and unseen object categories. However, the accurate cross-category perception of GAParts is still challenging due to the large inter-category object shape variations. To address this issue, we introduce SAMIR, a novel framework using SAM-rectified segmentation and iterative pose refinement for GAPart detection and manipulation. Firstly, we introduce a Segment Anything (SAM) segmentation prior to rectify the unconfident, fragmented GAPart instance proposals. Secondly, in addition to the zero-shot generalization of the SAM foundation model, we further finetune it with a lightweight adaptor model on our task dataset. Finally, we propose an iterative pose refinement procedure that improves the accuracy of GAPart pose estimation. Our perception experiments on GAPartNet dataset show that SAMIR consistently outperforms the baseline method on instance segmentation and pose estimation tasks. Our manipulation experiments in Sapien simulator illustrate that SAMIR leads to an improved manipulation success rate. We also deploy our method to a real robot for real-world manipulation. Our code and video are available at sites.google.com/view/samir-gapart.

15:10-15:15, Paper ThCT20.3
Consensus-Driven Uncertainty for Robotic Grasping Based on RGB Perception

Joyce, Eric C.	Stevens Institute of Technology
Zhao, Qianwen	Stevens Institute of Technology
Burgdorfer, Nathaniel	Stevens Institute of Technology
Wang, Long	Stevens Institute of Technology
Mordohai, Philippos	Stevens Institute of Technology
Keywords: Perception for Grasping and Manipulation, Computer Vision for Automation Abstract: Deep object pose estimators are notoriously overconfident. A grasping agent that both estimates the 6-DoF pose of a target object and predicts the uncertainty of its own estimate could avoid task failure by choosing not to act under high uncertainty. Even though object pose estimation improves and uncertainty quantification research continues to make strides, few studies have connected them to the downstream task of robotic grasping. We propose a method for training lightweight, deep networks to predict whether a grasp guided by an image-based pose estimate will succeed before that grasp is attempted. We generate training data for our networks via object pose estimation on real images and simulated grasping. We also find that, despite high object variability in grasping trials, networks benefit from training on all objects jointly, suggesting that a diverse variety of objects can nevertheless contribute to the same goal. Data, code, and guides are hosted at: https://github.com/EricCJoyce/Consensus-Driven-Uncertainty/

15:15-15:20, Paper ThCT20.4
Learning to Hang Crumpled Garments with Confidence-Guided Grasping and Active Perception

Huo, Shengzeng	The Hong Kong Polytechnic University
Zhang, He	The Tencent Robotics X
Lee, Hoi-Yin	The Hong Kong Polytechnic University
Zhou, Peng	Great Bay University
Navarro-Alarcon, David	The Hong Kong Polytechnic University
Keywords: Perception for Grasping and Manipulation, Bimanual Manipulation, Deep Learning in Grasping and Manipulation Abstract: Accurately recognizing the structural regions of targeted objects is crucial for successful manipulation. In this study, we concentrate on the task of hanging crumpled garments on a rack, a common scenario in household environments. This context presents two primary challenges: (1) perceiving and grasping the structural regions of garments that exhibit severe deformations and self-occlusions; (2) adjusting the configuration of garments to fit the supporting components of the rack. To address these challenges, we propose a confidenceguided grasping strategy that actively seeks garment collars through handovers between dual robotic arms. In particular, we develop an autonomous data collection procedure in real-world settings to train the collar detection network. The exact grasping pose is determined through depth-aware contour extraction, and its success is evaluated based on a specially designed metric. Furthermore, we formulate the hanging task as one-shot imitation learning with an egocentric view. To precisely align the collar with the supporting item, we propose a two-step hanging strategy that involves coarse approaching followed by fine transformation. We perform comprehensive experiments and show that our framework notably enhances the success rate compared to existing methods.

15:20-15:25, Paper ThCT20.5
HGDiffuser: Efficient Task-Oriented Grasp Generation Via Human-Guided Grasp Diffusion Models

Huang, Dehao	Southern University of Science and Technology
Dong, Wenlong	Southern University of Science and Technology
Tang, Chao	Southern University of Science and Technology
Zhang, Hong	SUSTech
Keywords: Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation Abstract: Task-oriented grasping (TOG) is essential for robots to perform manipulation tasks, requiring grasps that are both stable and compliant with task-specific constraints. Humans naturally grasp objects in a task-oriented manner to facilitate subsequent manipulation tasks. By leveraging human grasp demonstrations, current methods can generate high-quality robotic parallel-jaw task-oriented grasps for diverse objects and tasks. However, they still encounter challenges in maintaining grasp stability and sampling efficiency. These methods typically rely on a two-stage process: first performing exhaustive task-agnostic grasp sampling in the 6-DoF space, then applying demonstration-induced constraints (e.g., contact regions and wrist orientations) to filter candidates. This leads to inefficiency and potential failure due to the vast sampling space. To address this, we propose the Human-guided Grasp Diffuser (HGDiffuser), a diffusion-based framework that integrates these constraints into a guided sampling process. Through this approach, HGDiffuser directly generates task-compliant 6-DoF grasps in a single stage, eliminating exhaustive task-agnostic sampling. Furthermore, by incorporating Diffusion Transformer (DiT) blocks as the feature backbone, HGDiffuser improves grasp generation quality compared to MLP-based methods. Experimental results demonstrate that our approach significantly improves the efficiency of task-oriented grasp generation, enabling more effective transfer of human grasping strategies to robotic systems. To access the source code and supplementary videos, visit url{https://sites.google.com/view/hgdiffuser}.

15:25-15:30, Paper ThCT20.6
RoboCAP: Robotic Classification and Precision Pouring of Diverse Liquids and Granular Media with Capacitive Sensing

Hu, Yexin	Carnegie Mellon University
Gillespie, Alexandra	Colby College
Padmanabha, Akhil	Carnegie Mellon University
Puthuveetil, Kavya	Carnegie Mellon University
Lewis, Wesley	Carnegie Mellon University
Khokar, Karan	Carnegie Mellon University
Erickson, Zackory	Carnegie Mellon University
Keywords: Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation, Grippers and Other End-Effectors Abstract: Liquids and granular media (e.g., oats, rice, lentils) are pervasive throughout human environments, yet remain challenging for robots to sense and manipulate precisely. In this work, we present a systematic approach to integrating capacitive sensing within robotic end effectors, enabling robust sensing and precise manipulation of liquids and granular media. We introduce the parallel-jaw RoboCAP Gripper with embedded capacitive sensing arrays that enable a robot to directly sense the materials and dynamics of liquids inside diverse containers. Our system achieves 82.8% classification accuracy across 81 container–substance combinations, and enables a robotic manipulator to perform precision pouring with a mean error of 3.2g over 200 trials. Code, designs, and build details are available on the project website.

15:30-15:35, Paper ThCT20.7
Zero-Shot Temporal Interaction Localization for Egocentric Videos

Zhang, Erhang	SHANDONG University
Ma, Junyi	Beijing Institute of Technology
Zheng, Yin-Dong	Shanghai Jiao Tong University
Zhou, Yixuan	Shanghai Jiao Tong University
Xu, Fan	Shanghai Jiao Tong University
Keywords: Perception for Grasping and Manipulation, Human and Humanoid Motion Analysis and Synthesis, Semantic Scene Understanding Abstract: Locating human-object interaction (HOI) actions within video serves as the foundation for multiple downstream tasks, such as human behavior analysis and human-robot skill transfer. Current temporal action localization methods typically rely on annotated action and object categories of interactions for optimization, which leads to domain bias and low deployment efficiency. Although some recent works have achieved zero-shot temporal action localization (ZS-TAL) with large vision-language models (VLMs), their coarse-grained estimations and open-loop pipelines hinder further performance improvements for temporal interaction localization (TIL). To address these issues, we propose a novel zero-shot TIL approach dubbed EgoLoc to locate the timings of grasp actions for human-object interaction in egocentric videos. EgoLoc introduces a self-adaptive sampling strategy to generate reasonable visual prompts for VLM reasoning. By absorbing both 2D and 3D observations, it directly samples high-quality initial guesses around the possible contact/separation timestamps of HOI according to 3D hand velocities, leading to high inference accuracy and efficiency. In addition, EgoLoc generates closed-loop feedback from visual and dynamic cues to further refine the localization results. Comprehensive experiments on the publicly available dataset and our newly proposed benchmark demonstrate that EgoLoc achieves better temporal interaction localization for egocentric videos compared to state-of-the-art baselines. We will release our code and relevant data as open-source at https://github.com/IRMVLab/EgoLoc.


ThCT21	101
Machine Learning for Robot Control 3	Regular Session
Chair: Zhong, Hang	Hunan University
Co-Chair: Xie, Liang	Zhejiang University

15:00-15:05, Paper ThCT21.1
ColaDex: Contact-Guided Optimization and VLM-Assisted Selection for Task-Oriented Dexterous Grasp Generation

Ma, Yiyao	The Chinese University of Hong Kong
Chen, Kai	The Chinese University of Hong Kong
Xu, Xuecheng	Zhejiang University
Zhou, Zhongxiang	Zhejiang University
Xie, Liang	Zhejiang University
Xiong, Rong	Zhejiang University
Dou, Qi	The Chinese University of Hong Kong
Keywords: AI-Enabled Robotics, Multifingered Hands Abstract: Task-oriented dexterous grasp generation aims to generate stable and functional grasps that enable a robotic hand to effectively interact with objects to accomplish specific tasks. However, generating high-dimensional hand configurations that seamlessly adapt to diverse task requirements and object geometries remains a significant challenge. In this paper, we propose a novel pipeline called ColaDex to address this challenging problem. The core idea of ColaDex is to leverage a vision-language models (VLMs) to select the dexterous grasp from a set of candidates that aligns well with the task description. To this end, we first introduce a contact-guided optimization method to generate a set of high-quality grasp candidates around the object through analytical optimization. Subsequently, to effectively prompt VLMs with the sampled numerous grasp candidates, we propose an object-centric approach that adaptively represents a group of candidates as prototypical contact maps, learned based on the geometric relationships between the grasping hand and object shape. We then feed the task requirement and the generated prototypical contact maps into the VLM, enabling it to reason about grasp-object interactions and assess their alignment with the given task, ultimately selecting the grasp that best aligns with the task requirement. Extensive experiments demonstrate that our prototypical contact map is a more informative prompting mechanism than conventional RGB images, enabling ColaDex to consistently generate high-quality task-oriented grasps and achieve a high success rate across diverse objects and tasks.

15:05-15:10, Paper ThCT21.2
Simulation-Aided Policy Tuning for Black-Box Robot Learning

He, Shiming	Hangzhou City University
von Rohr, Alexander	Technical University of Munich
Baumann, Dominik	Aalto University
Xiang, Ji	Zhejiang University
Trimpe, Sebastian	RWTH Aachen University
Keywords: Learning and Adaptive Systems, Probability and Statistical Methods, Optimization and Optimal Control, Bayesian optimization Abstract: How can robots learn and adapt to new tasks and situations with little data? Systematic exploration and simulation are crucial tools for efficient robot learning. We present a novel black-box policy search algorithm focused on data-efficient policy improvements. The algorithm learns directly on the robot and treats simulation as an additional information source to speed up the learning process. At the core of the algorithm, a probabilistic model learns the dependence of the policy parameters and the robot learning objective not only by performing experiments on the robot, but also by leveraging data from a simulator. This substantially reduces interaction time with the robot. Using this model, we can guarantee improvements with high probability for each policy update, thereby facilitating fast, goal-oriented learning. We evaluate our algorithm on simulated fine-tuning tasks and demonstrate the data-efficiency of the proposed dual-information source optimization algorithm. In a real robot learning experiment, we show fast and successful task learning on a robot manipulator with the aid of an imperfect simulator.

15:10-15:15, Paper ThCT21.3
Safety-Compliant Navigation: Navigation Point-Guided Planning with Primitive Trajectories

Deng, Zixuan	Southwest Petroleum University
Xiang, Yanping	University of Electronic Science and Technology of China
Keywords: Learning from Demonstration, Motion and Path Planning Abstract: In learning from demonstrations (LfD) for trajectory planning, end-to-end deep learning (DL) methods offer fast inference and adaptability to complex inputs. However, they are prone to cumulative errors due to limited expert time-series data, which poses challenges in safety-critical applications. To address this, we introduce bounded discontinuities in trajectory planning, with the bound adaptively determined via binary search. Two generative networks, trained in opposite directions, produce primitive trajectories. These are connected using the discontinuity-allowed multi-point RRT-connect (DAMP-RRT-connect) algorithm, which expands the trajectory while maintaining discontinuities within the bound. A sequence of navigation points directs the expansion. Experiments on aircraft landing and takeoff tasks at a non-towered airport demonstrate the robustness and efficiency of our approach.

15:15-15:20, Paper ThCT21.4
Sensorimotor Learning with Stability Guarantees Via Autonomous Neural Dynamic Policies

Totsila, Dionis	Inria Centre at Université De Lorraine
Chatzilygeroudis, Konstantinos	University of Patras
Modugno, Valerio	University College London
Hadjivelichkov, Denis	University College London
Kanoulas, Dimitrios	University College London
Keywords: Learning from Demonstration, Sensorimotor Learning, Machine Learning for Robot Control Abstract: State-of-the-art sensorimotor learning algorithms, either in the context of reinforcement learning or imitation learning, offer policies that can often produce unstable behaviors, damaging the robot and/or the environment. Moreover, it is very difficult to interpret the optimized controller and analyze its behavior and/or performance. Traditional robot learning, on the contrary, relies on dynamical system-based policies that can be analyzed for stability/safety. Such policies, however, are neither flexible nor generic and usually work only with proprioceptive sensor states. In this work, we bridge the gap between generic neural network policies and dynamical system based policies, and we introduce Autonomous Neural Dynamic Policies (ANDPs) that: (a) are based on autonomous dynamical systems, (b) always produce asymptotically stable behaviors, and (c) are more flexible than traditional stable dynamical system-based policies. ANDPs are fully differentiable, flexible generic-policies that accept any observation input, while ensuring asymptotic stability. Through several experiments, we explore the flexibility and capacity of ANDPs in several imitation learning tasks including experiments with image observations. The results show that ANDPs combine the benefits of both neural network-based and dynamical system-based methods.

15:20-15:25, Paper ThCT21.5
Data-Bootstrapped, Physics-Informed Framework for Object Rearrangement

Wong, Alex	Xi'an Jiaotong University
Dong, Zhiwei	Huawei Technologies Ltd
Keywords: Learning from Experience, Perception-Action Coupling Abstract: Object rearrangement, which involves arranging objects step-by-step to achieve tidy states, is critical in robotic applications. Progress in this area is often constrained by issues such as high-cost data collection and physically infeasible trajectory prediction. To address these challenges, we propose the Data-Bootstrapped, Physics-Informed Rearrangement (DPR) framework, which leverages a transformer for sequential decision making. Specifically, DPR integrates Enhanced Data Generation with a Physics Reward Feedback Transformer. Enhanced Data Generation consists of Random Trajectory Reverse for producing high-quality training data and Bootstrapped Trajectory Synthesis, which leverages the transformer's sequence modeling to diversify training trajectories. To ensure the feasibility of the generated trajectories and to improve the transformer's performance, we incorporate a Physical Reward Feedback mechanism into the transformer. Experiments on ball and room rearrangement tasks show that DPR significantly outperforms existing methods in terms of both efficiency and effectiveness. Code will be released soon.

15:25-15:30, Paper ThCT21.6
Bi-Phase Episodic Memory-Guided Deep Reinforcement Learning for Robot Skills

Dong, Liu	Dalian University of Technology
Jiang, Yuhang	Dalian University of Technology
Du, Yu	Dalian JiaoTong University
Wang, Zitu	Dalian University of Technology
Zhao, Kesong	Dalian University of Technology
Cong, Ming	Dalian University of Technology
Keywords: Learning from Experience, Reinforcement Learning, Deep Learning in Grasping and Manipulation Abstract: Deep reinforcement learning has significant advantages in the field of robot skill learning, however it usually cannot make good use of experience. In this paper, a Bi-phase Episodic Memory Guided (BEMG) method for speeding up the learning process of DRL and solving the problem of sparse reward is proposed, which brings a large learning speed increasing of robotic operation skills acquisition through memory guidance. The method constructs the relationship between robot states at different moments through the guidance of episodic memory, and automatically generates reward function through memory backtracking. In the framework of proposed method, each iteration is divided into two phases. In first phase, the association between state and action in memory is used to guide the action decision-making of DRL, thereby reducing unnecessary exploration for learning. In second phase, the memory module automatically generates reward values based on the relationship between different states, instead of artificially designing reward functions. The experimental verifications are carried out around robot operation skill learning tasks with different DRL algorithms. Experimental results show the proposed method can effectively improve the learning speed of robot operation skills and avoid the dependence on complex reward function.

15:30-15:35, Paper ThCT21.7
Learning Graph Dynamics with Interaction Effects Propagation for Deformable Linear Objects Shape Control (I)

Gu, Feida	Tongji University
Sang, HongRui	Tongji University
Zhou, Yanmin	Tongji University
Ma, Jiajun	Tongji University
Jiang, Rong	Tongji University
Wang, Zhipeng	Tongji University
He, Bin	TongJi University, Shanghai, China
Keywords: Manipulation Planning, Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation Abstract: Robotic manipulation of deformable linear objects (DLOs) has broad application prospects, e.g., manufacturing and medical surgery. To achieve such tasks, a critical challenge is the precise control of the DLOs' shapes, which requires an accurate dynamics model for deformation prediction. However, due to the infinite dimensionality of the DLOs and the complexity of their deformation mechanism, dynamics models are hard to theoretically calculate. In this paper, for representing the DLO, we use multiple particles being uniformly distributed along the DLO. For learning the dynamics model, we adopt Graph Neural Network (GNN) to learn local interaction effects between neighboring particles, and use the attention mechanism to aggregate the effects of these interactions for the purpose of effect propagation along the DLO (called GA-Net). For manipulation, the Model Predictive Control (MPC) coupled with the learned dynamics model is used to calculate the optimal robot movements, which can also generalize to unseen DLOs. Simulation and real-world experiments demonstrate that GA-Net shows better accuracy than existing methods, and the proposed control framework is effective for different DLOs. Specifically, for model prediction (150 steps), the prediction performance of GA-Net is 14.14% better than the strong baseline (IN-BiLSTM). Videos are available at https://parkergu.github.io/work_dlo/.

15:35-15:40, Paper ThCT21.8
HeStIa: Asynchronous Embodied Dynamic Locomotion Learning for Walking Robots through Multimodal Large Language Models

Tan, Xiaoyu	National University of Singapore
Wang, Haoyu	Shanghai University of Engineering Science
Li, Sijia	Shanghai University of Engineering Science
Xu, Yinghui	Fudan University
Qiu, Xihe	Shanghai University of Engineering Science
Keywords: Machine Learning for Robot Control, Whole-Body Motion Planning and Control, Representation Learning Abstract: The control of locomotion in walking robots with various architectural designs presents significant challenges. While existing approaches primarily rely on low-level state information and isolated visual features, lacking the high-level semantic understanding that humans use to reason about movement and posture, we propose HeStIa, a novel framework that bridges visual perception, natural language understanding, and robotic control through multimodal learning. Our framework leverages multimodal large language models (MLLMs) to establish a semantic bridge between visual observations and motion control, enabling robots to understand and adjust their locomotion through both visual and linguistic modalities. By leveraging multimodal large language models (MLLMs), HeStIa establishes a semantic connection between visual observations and motion control, enabling robots to comprehend and adapt their locomotion through both visual and linguistic modalities. Our approach extracts spatiotemporal visual features from robot movements and transforms them into a cross-modal embedding space shared with textual descriptions. HeStIa incorporates an innovative vision-language-motion fusion mechanism to provide informed, context-aware feedback during the dynamic learning process. Through an asynchronous design, HeStIa effectively mitigates the inference delays typically associated with MLLMs while maintaining real-time performance in dynamic scenarios. The cross-modal representations learned by HeStIa facilitate more intuitive and efficient locomotion learning by grounding visual observations in natural language descriptions. Our comprehensive evaluation shows substantial improvements in motion naturalness, stability, and adaptability across diverse environmental conditions.


ThCT22	102A
Collision Avoidance 1	Regular Session
Chair: Liu, Lu	City University of Hong Kong

15:00-15:05, Paper ThCT22.1
Collision-Free Control Barrier Functions for General Ellipsoids Via Separating Hyperplane

Wu, Zeming	Tongji University
Liu, Lu	City University of Hong Kong
Keywords: Collision Avoidance, Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning Abstract: This paper presents a novel collision avoidance method for general ellipsoids based on control barrier functions (CBFs) and separating hyperplanes. First, collision-free conditions for general ellipsoids are analytically derived using the concept of dual cones. These conditions are incorporated into the CBF framework by extending the system dynamics of controlled objects with separating hyperplanes, enabling efficient and reliable collision avoidance. The validity of the proposed collision-free CBFs is rigorously proven, ensuring their effectiveness in enforcing safety constraints. The proposed method requires only single-level optimization, significantly reducing computational time compared to state-of-the-art methods. Numerical simulations and real-world experiments demonstrate the effectiveness and practicality of the proposed algorithm.

15:05-15:10, Paper ThCT22.2
CSC-MPPI: A Novel Constrained MPPI Framework with DBSCAN for Reliable Obstacle Avoidance

Park, Leesai	Kyung Hee University
Jang, Keunwoo	Korea Institute of Science and Technology
Kim, Sanghyun	Kyung Hee University
Keywords: Collision Avoidance, Motion Control, Optimization and Optimal Control Abstract: This paper proposes Constrained Sampling Cluster Model Predictive Path Integral (CSC-MPPI), a novel constrained formulation of MPPI designed to enhance trajectory optimization while enforcing strict constraints on system states and control inputs. Traditional MPPI, which relies on a probabilistic sampling process, often struggles with constraint satisfaction and generates suboptimal trajectories due to the weighted averaging of sampled trajectories. To address these limitations, the proposed framework integrates a primal-dual gradient-based approach and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to steer sampled input trajectories into feasible regions while mitigating risks associated with weighted averaging. First, to ensure that sampled trajectories remain within the feasible region, the primal-dual gradient method is applied to iteratively shift sampled inputs while enforcing state and control constraints. Then, DBSCAN groups the sampled trajectories, enabling the selection of representative control inputs within each cluster. Finally, among the representative control inputs, the one with the lowest cost is chosen as the optimal action. As a result, CSC-MPPI guarantees constraint satisfaction, improves trajectory selection, and enhances robustness in complex environments. Simulation and real-world experiments demonstrate that CSC-MPPI outperforms traditional MPPI in obstacle avoidance, achieving improved reliability and efficiency. The experimental videos are available at https://cscmppi.github.io

15:10-15:15, Paper ThCT22.3
Mapless Collision-Free Flight Via MPC Using Dual KD-Trees in Cluttered Environments

Zhang, Linzuo	Shanghai Jiao Tong University
Hu, Yu	Shanghai Jiao Tong University
Deng, Yang	Shanghai Jiao Tong University
Yu, Feng	Shanghai Jiao Tong University
Zou, Danping	Shanghai Jiao Ton University
Keywords: Collision Avoidance, Vision-Based Navigation, Optimization and Optimal Control Abstract: Collision-free flight in cluttered environments is a critical capability for autonomous quadrotors. Traditional methods often rely on detailed 3D map construction, trajectory generation, and tracking. However, this cascade pipeline can introduce accumulated errors and computational delays, limiting flight agility and safety. In this paper, we propose a novel method for enabling collision-free flight in cluttered environments without explicitly constructing 3D maps or generating and tracking collision-free trajectories. Instead, we leverage Model Predictive Control (MPC) to directly produce safe actions from sparse waypoints and point clouds from a depth camera. These sparse waypoints are dynamically adjusted online based on nearby obstacles detected from point clouds. To achieve this, we introduce a dual KD-Tree mechanism: the Obstacle KD-Tree quickly identifies the nearest obstacle for avoidance, while the Edge KD-Tree provides a robust initial guess for the MPC solver, preventing it from getting stuck in local minima during obstacle avoidance. We validate our approach through extensive simulations and real-world experiments. The results show that our approach significantly outperforms the mapping-based methods and is also superior to imitation learning-based methods, demonstrating reliable obstacle avoidance at up to 12 m/s in simulations and 6 m/s in real-world tests. Our method provides a simple and robust alternative to existing methods. The code is publicly available at https://github.com/SJTU-ViSYS-team/avoid-mpc

15:15-15:20, Paper ThCT22.4
MmWave Radar-Based Non-Line-Of-Sight Pedestrian Localization at T-Junctions Utilizing Road Layout Extraction Via Camera

Park, Byeonggyu	Seoul National University
Kim, Hee-Yeun	Seoul National University
Choi, Byonghyok	Samsung Electro-Mechanics
Cho, Hansang	Samsung Electro-Mechanics
Kim, Byungkwan	Chungnam National University
Lee, Soomok	Ajou University
Jeon, Mingu	Seoul National University
Kim, Seong-Woo	Seoul National University
Keywords: Collision Avoidance, Intelligent Transportation Systems, Sensor Fusion Abstract: Pedestrians Localization in Non-Line-of-Sight (NLoS) regions within urban environments poses a significant challenge for autonomous driving systems. While mmWave radar has demonstrated potential for detecting objects in such scenarios, the 2D radar point cloud (PCD) data is susceptible to distortions caused by multipath reflections, making accurate spatial inference difficult. Additionally, although camera images provide high-resolution visual information, they lack depth perception and cannot directly observe objects in NLoS regions. In this paper, we propose a novel framework that interprets radar PCD through road layout inferred from camera for localization of NLoS pedestrians. The proposed method leverages visual information from the camera to interpret 2D radar PCD, enabling spatial scene reconstruction. The effectiveness of the proposed approach is validated through experiments conducted using a radar-camera system mounted on a real vehicle. The localization performance is evaluated using a dataset collected in outdoor NLoS driving environments, demonstrating the practical applicability of the method.

15:20-15:25, Paper ThCT22.5
Gradient Field-Based Dynamic Window Approach for Collision Avoidance in Complex Environments

Zhang, Ze	Chalmers University of Technology
Xue, Yifan	University of Pennsylvania
Figueroa, Nadia	University of Pennsylvania
Akesson, Knut	Chalmers University of Technology
Keywords: Collision Avoidance, Motion and Path Planning, Path Planning for Multiple Mobile Robots or Agents Abstract: For safe and flexible navigation in multi-robot systems, this paper presents an enhanced and predictive sampling-based trajectory planning approach in complex environments, the Gradient Field-based Dynamic Window Approach (GF-DWA). Building upon the dynamic window approach, the proposed method utilizes gradient information of obstacle distances as a new cost term to anticipate potential collisions. This enhancement enables the robot to improve awareness of obstacles, including those with non-convex shapes. The gradient field is derived from the Gaussian process distance field, which generates both the distance field and gradient field by leveraging Gaussian process regression to model the spatial structure of the environment. Through several obstacle avoidance and fleet collision avoidance scenarios, the proposed GF-DWA is shown to outperform other popular trajectory planning and control methods in terms of safety and flexibility, especially in complex environments with non-convex obstacles.

15:25-15:30, Paper ThCT22.6
DGVO: A Dynamically Constrained Gradient Velocity Obstacle Approach for Mobile Robots in Dynamic Environments

Xiao, Bowen	ShenZhen University
Zhang, Bo	Shenzhen University, Shenzhen 518060, China
Zhang, Danyu	SHENZHEN UNIVERSITY
Xie, Peiyan	Shenzhen University
Wang, XinYu	ShenZhen University
Li, Ruocheng	Beijing Institute of Technology
Keywords: Collision Avoidance, Kinematics, Nonholonomic Mechanisms and Systems Abstract: In this paper, we propose a framework based on velocity obstacles to address dynamic obstacle avoidance problem for constrained mobile robots. The framework establishes a nonlinear mapping from the control domain to the velocity space based on the robot's kinematic model and input constraints. This mapping defines the Velocity Feasible Region (VFR) as the set of reachable velocities at the next time step. Based on the VFR, we propose a gradient field, called the Dynamically Constrained Gradient Velocity Obstacle (DGVO), to represent the feasible motion region for mobile robots. DGVO preserves the original feasible region of the mobile robot. Based on DGVO, we formulate an unconstrained gradient descent optimization problem to compute collision-free velocities in real time. This framework enables real-time online computation of collision-free velocities for any constrained mobile robot, and it exhibits strong robustness to sensor noise. Extensive simulations and real-world experiments have validated the effectiveness of the proposed method. The introduction of the entire work can be found at the following link: https://youtu.be/HrTNTSOhKvE.

15:30-15:35, Paper ThCT22.7
AVOCADO: Adaptive Optimal Collision Avoidance Driven by Opinion

Martinez-Baselga, Diego	University of Zaragoza
Sebastián, Eduardo	University of Cambridge
Montijano, Eduardo	Universidad De Zaragoza
Riazuelo, Luis	Universidad De Zaragoza
Sagues, Carlos	Universidad De Zaragoza
Montano, Luis	Universidad De Zaragoza
Keywords: Collision Avoidance, Multi-Robot Systems, Motion and Path Planning, Opinion Dynamics Abstract: We present AVOCADO (AdaptiVe Optimal Collision Avoidance Driven by Opinion), a novel navigation approach to address holonomic robot collision avoidance when the robot does not know how cooperative are the other agents in the environment. AVOCADO departs from a Velocity Obstacle's (VO) formulation akin to the Optimal Reciprocal Collision Avoidance method. However, instead of assuming reciprocity, it poses an adaptive control problem to adapt in real time to the cooperation level of other robots and agents. This is achieved through a novel nonlinear opinion dynamics design that relies solely on sensor observations. As a by-product, we leverage tools from the opinion dynamics formulation to naturally avoid the deadlocks in geometrically symmetric scenarios that typically suffer VO-based planners. Extensive numerical simulations show that AVOCADO surpasses existing motion planners in mixed cooperative/non-cooperative navigation environments in terms of success rate, time to goal and computational time. In addition, we conduct multiple real experiments that verify that AVOCADO is able to avoid collisions in environments crowded with other robots and humans.

15:35-15:40, Paper ThCT22.8
Flexible Active Safety Motion Control for Robotic Obstacle Avoidance: A CBF-Guided MPC Approach

Liu, Jinhao	China University of Mining and Technology
Yang, Jun	Loughborough University
Mao, Jianliang	Shanghai University of Electric Power
Zhu, Tianqi	Southeast University
Xie, Qihang	Shanghai University of Electric Power
Yimeng, Li	Southeast University
Wang, Xiangyu	Southeast University
Li, Shihua	Southeast University
Keywords: Collision Avoidance, Motion and Path Planning, Motion Control Abstract: A flexible active safety motion (FASM) control approach is proposed for dynamic obstacle avoidance in robot manipulators. The key feature of this method is the use of control barrier functions (CBF) to design flexible CBF-guided safety criteria (CBFSC) with dynamically optimized decay rates, providing both flexibility and active safety in dynamic environments for robot manipulators. First, discrete-time CBFs are utilized to formulate the new flexible CBFSC with dynamic decay rates, which is then integrated into the model predictive control (MPC) framework. The flexible CBFSC serves as safety constraints within the receding-horizon optimization problem. Notably, the decay rates of the CBFSC are incorporated as decision variables, allowing for dynamic adaptability during obstacle avoidance. In addition, a new cost function with an integrated penalty term is designed to dynamically adjust the safety margins of the CBFSC. Finally, experiments in various scenarios using a Universal Robots 5 (UR5) manipulator validate the effectiveness of the proposed approach.


ThCT23	102B
Force and Tactile Sensing 6	Regular Session
Chair: Xu, Dongyan	The Chinese University of Hong Kong
Co-Chair: Fang, Bin	Beijing University of Posts and Telecommunications / Tsinghua University

15:00-15:05, Paper ThCT23.1
ViaTac: A High-Resolution Piezoresistive Tactile Sensor Array with Conformal Contact Surface for Shape Reconstruction

Du, Yanjun	The Chinese University of Hong Kong
Lou, Yuancheng	The Chinese University of Hong Kong
Xu, Dongyan	The Chinese University of Hong Kong
Keywords: Force and Tactile Sensing, Soft Sensors and Actuators, Recognition Abstract: Tactile sensing is crucial for robots to achieve human-like manipulation capabilities and safe interaction with the environment. Existing piezoresistive tactile sensors often suffer from limited spatial resolution and poor conformability to contact objects due to their unstretchable contact surfaces. In this paper, we present a low-cost and easily fabricated piezoresistive tactile sensor array that utilizes flexible printed circuit (FPC) via as electrodes to achieve high spatial resolution (64/cm^2), while incorporating stretchable materials for surface encapsulation to enable conformal contact with objects. Unified material selection for both the sensing and encapsulation layers ensures robust performance. We characterized the sensing performance, mechanical durability, and uniformity of the sensor array and further demonstrated its practical applications in contact shape reconstruction.

15:05-15:10, Paper ThCT23.2
Three-Axis Flat and Lightweight Force/Torque Sensor for Enhancing Kinesthetic Sensing Capability of Robotic Hand (I)

Park, Sungwoo	Korea University, KIST
Hwang, Donghyun	Korea Institute of Science and Technology
Keywords: Force and Tactile Sensing, Multifingered Hands, In-Hand Manipulation Abstract: When using a robot hand to grip objects, visual information is frequently adopted to determine the external characteristics of objects. However, in reality, purely visual data are insufficient to grasp unspecified objects; somatosensory feedback, which includes tactile and kinesthetic feedbacks, is required to understand the diverse intrinsic properties of objects. We implement a three-axis force/torque (F/T) sensor to measure the interaction force at the joints of a robotic hand and determine the various intrinsic characteristics that aid in grasping previously unknown objects. We primarily focus on a modular, compact, and lightweight structure to embed the sensors in the robotic hand. A novel radially symmetric diaphragm flexure structure is designed to achieve this objective. Several experiments are conducted to validate the performance of the F/T sensor. The functionality of the kinesthetic feedback is demonstrated using a stiffness and weight measurement experiment. The experimental findings demonstrate that the proposed sensor-embedded robotic hand can measure stiffness and weight with errors of 0.0073 N/mm and 8.72 g, respectively.

15:10-15:15, Paper ThCT23.3
RoTipBot: Robotic Handling of Thin and Flexible Objects Using Rotatable Tactile Sensors

Jiang, Jiaqi	Beijing Institute of Technology
Zhang, Xuyang	King's College London
Fernandes Gomes, Daniel	Kings College London
Do, Thanh-Toan	Monash University
Luo, Shan	King's College London
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation, Dexterous Manipulation, Tactile Robotics Abstract: This paper introduces RoTipBot, a novel robotic system for handling thin, flexible objects. Different from previous works that are limited to singulating them using suction cups or soft grippers, RoTipBot can count multiple layers and then grasp them simultaneously in a single grasp closure. Specifically, we first develop a vision-based tactile sensor named RoTip that can rotate and sense contact information around its tip. Equipped with two RoTip sensors, RoTipBot rolls and feeds multiple layers of thin, flexible objects into the centre between its fingers, enabling effective grasping. Moreover, we design a tactile-based grasping strategy that uses RoTip’s sensing ability to ensure both fingers maintain secure contact with the object while accurately counting the number of fed objects. Extensive experiments demonstrate the efficacy of the RoTip sensor and the RoTipBot approach. The results show that RoTipBot not only achieves a higher success rate but also grasps and counts multiple layers simultaneously – capabilities not possible with previous methods. Furthermore, RoTipBot operates up to three times faster than state-of-the-art methods. The success of RoTipBot paves the way for future research in object manipulation using mobilised tactile sensors. All the materials used in this paper are available at https://sites.google.com/view/rotipbot.

15:15-15:20, Paper ThCT23.4
Spherical-Joint Force Measurement Enables Wheel Force Sensing in Vehicles (I)

Shu, Ran	Chongqing University
Chu, Zhigang	Chongqing University
Li, Li	Chongqing University
Shu, Hongyu	Chongqing University
Keywords: Force and Tactile Sensing, Wheeled Robots, Mechanism Design Abstract: This article presents a wheel force sensing method based on spherical-joint force measurement. The force measurement uses strain gauges on L-shaped linkages connecting the wheel to the double wishbone suspension system. It does not require the wireless power supply and signal transmission because it does not rotate with the wheel. Using the measured joint-forces to solve the suspension mechanism configuration, the wheel position and orientation (relative to the vehicle) can be calculated,eliminating the need for position sensors. Through the continuous calculation of the position and orientation, the inertia force and moment can be estimated, which is used in the Newton-Euler formulation for wheel force calculation. Experimental results show the root mean square error of three-axis wheel forces ≤8.26 N and mean error of threeaxis wheel forces ≤4.06 N, which validates the feasibility of the proposed method.

15:20-15:25, Paper ThCT23.5
A Compact, Cost-Effective, and Highly Sensitive Optical Blocking Structure (OBS) Tactile Sensor for Enhanced Robotic Grasping (I)

Mo, Liyan	South China University of Thechnology
Li, Yunquan	South China University of Technology
Xia, Jiutian	South China University of Technology
Fu, Shiling	South China University of Technology
Zhang, Yuan-Fang	South China University of Technology
Yang, Yang	Nanjing University of Information Science and Technology
Ren, Tao	Chengdu University of Technology
Wu, Changchun	The University of Hong Kong
Chen, Yonghua	The University of Hong Kong
Keywords: Force and Tactile Sensing, Soft Sensors and Actuators, Dexterous Manipulation Abstract: Haptic perception is essential for robotic grippers performing delicate handling tasks, yet designing a tactile sensor that is both low-cost and high-performance remains a challenge. This article addresses this issue by developing an optical blocking structure (OBS) sensor for tactile perception in robotic grippers. The OBS sensor is characterized by its ease of fabrication, cost-effectiveness (with a unit cost of 1.45), and high-performance metrics, including a maximum sensitivity of approximately 2.18 V/N, a wide pressure response range (>4 N), high resolution (0.009 N), a fast response time (15 ms), and a minimal dead zone (around 0.026 N). Experimental results demonstrate that a robotic gripper equipped with the OBS sensor can effectively detect object hardness, surface texture, and perform delicate force control tasks. Additionally, the OBS sensor exhibits high scalability. It can be configured into various array forms to enhance the gripper’s sensing capabilities based on its structural characteristics. Validation experiments confirm that robotic grippers integrated with 2 × 2 and 1 × 6 array configurations of the OBS sensor can accurately detect multiple contact positions and forces. Consequently, the OBS sensor offers a straightforward and effective solution for enhancing tactile sensing in robotic grippers.

15:25-15:30, Paper ThCT23.6
MagicGel: A Novel Visual-Based Tactile Sensor Design with Magnetic Gel

Shan, Jianhua	Anhui University of Technology
Zhao, Jie	Anhui University of Technology
Liu, Jiangduo	University of Science and Technology Beijing
Wang, Xiangbo	College of Quality and Technology Supervising, Hebei University,
Xia, Ziwei	China University of Geosciences, Haidian District, Beijing, Chin
Chen, Guangzeng	Harbin Institute of Technology, Shenzhen
Ren, Zeyu	ByteDance
Xu, Guangyuan	Beijing University of Posts and Telecommunications
Fang, Bin	Beijing University of Posts and Telecommunications / Tsinghua Un
Keywords: Force and Tactile Sensing, Soft Sensors and Actuators, Sensor Fusion Abstract: Force estimation is the core indicator for evaluating the performance of tactile sensors, and it is also the key technical path to achieving precise force feedback mechanisms. This study proposes a design method for a visual tactile sensor (VBTS) that integrates a magnetic perception mechanism, and develops a new tactile sensor called MagicGel. The sensor uses strong magnetic particles as markers and captures magnetic field changes in real time through Hall sensors. On this basis, MagicGel achieves the coordinated optimization of multimodal perception capabilities: it not only has fast response characteristics, but also can perceive non-contact status information of home electronic products. Specifically, MagicGel simultaneously analyzes the visual characteristics of magnetic particles and the multimodal data of changes in magnetic field intensity, ultimately improving force estimation capabilities.

15:30-15:35, Paper ThCT23.7
A Novel Fiber Bragg Grating Three-Dimensional Force Sensor for Medical Robotics (I)

Liang, Qiaokang	Hunan University; University of Ontario Institute of Technology
Ouyang, Songtao	Hunan University
Long, Jianyong	Hunan University
Zhou, Li	Hunan University
Zhang, Dan	The Hong Kong Polytechnic University
Keywords: Force and Tactile Sensing, Medical Robots and Systems, Cyborgs Abstract: This article introduces a novel Fiber Bragg Grating (FBG) 3-D force sensor designed for the endeffectors of medical robots. The sensor incorporates a specially designed layered elastic structure, achieving miniaturization, structural self-decoupling, and high-sensitivity 3-D force measurement through a cleverly compact spatial design and a reasoned layout of 5 FBGs. The construction of the theoretical model successfully decouples temperature and force. Simulation experiments determine the sensor’s operational frequency range to be 0–403.6 Hz, validated through rapid prototype verification using 3-D printing technology. Static experiments reveal that the sensor’s maximum measurement range and minimum resolution are ±5 N and 4.95 mN, respectively. The maximum sensitivity and minimum interaxis coupling are determined to be 201.86 pm/N, 0.01%F.S.(Fy–Fz), respectively. In dynamic experiments, The minimum tracking error is 3.89% F.S and successfully adapts to the end-effector of a robotic arm. Remarkably, the sensor excels in force detection during medical palpation and acupuncture procedures, validating its effectiveness and practicality in real-world applications. This study presents a reliable and high-performance end-effector force-sensing solution for the field of medical robotics


ThCT24	102C
Compliance and Control	Regular Session
Co-Chair: Zhang, Chi	Ningbo Institute of Material Technology and Engineering, CAS

15:00-15:05, Paper ThCT24.1
Iterative Learning for Gravity Compensation in Impedance Control (I)

Li, Teng	The Hospital for Sick Children
Zakerimanesh, Amir	University of Alberta
Ou, Yafei	University of Alberta
Badre, Armin	University of Alberta
Tavakoli, Mahdi	University of Alberta
Keywords: Compliance and Impedance Control, Physical Human-Robot Interaction, Surgical Robotics: Laparoscopy Abstract: Robot-assisted arthroscopic surgery has been increasingly receiving attention in orthopedic surgery. To build a robot-assisted system, dynamic uncertainties can be a critical issue that could bring robot performance inaccuracy or even system instability if cannot be appropriately compensated. Disturbance observer is a common tool to be used for disturbance estimation and compensation by taking all uncertainties as disturbances, but this will refuse human–robot interaction since the human-applied force will also be regarded as a disturbance by the observer. Iterative learning for gravity compensation can be another promising way to solve this problem when gravity compensation is the main concern. In this article, a gravity iterative learning (Git) scheme in Cartesian space for gravity compensation, integrating with an impedance controller, is presented. A steady-state scaling strategy is then proposed, which released the updating requirements of the learning scheme and also extended its validity to trajectory-tracking scenarios from set-point regulations. The deriving process and convergence properties of the Git scheme are presented and theoretically analyzed, respectively. A series of simulations and physical experiments are conducted to evaluate the validity of the scaling strategy, the learning accuracy of the Git scheme, and the effectiveness of the learning-based impedance controller. Both simulation and experimental results demonstrate good performance and properties of the Git scheme and the learning-based impedance controller.

15:05-15:10, Paper ThCT24.2
Confutation of the “Counterexample to Passivity Preservation for Variable Impedance Control of Compliant Robots” (I)

Spyrakos-Papastavridis, Emmanouil	King's College London
Dai, Jian	School of Natural and Mathematical Sciences, King's College Lond
Childs, Peter R. N.	Imperial College London
Keywords: Compliance and Impedance Control Abstract: This letter reveals the incorrect argumenta- tion, erroneous mathematical calculations, and misleading observations reported in the article titled “Counterexample to Passivity Preservation for Variable Impedance Control of Compliant Robots” published in volume 28, issue 1 of this journal, in February 2023. All the calculations in the said article are either incorrect or redundant, and this letter apodictically reveals that the peculiar simulation results therein contained were produced from a failure to follow the implementation guidelines delineated in the original publi- cation. This letter further generates a set of examples to confute the so-called “counterexamples,” thereby proving the erroneousness of the article’s argumentation. Further- more, numerous of the incorrect claims contained in the said article stand in stark opposition to results and theories reported in the robotics literature over the last three decades. This letter provides point-to-point refutations of the erroneous comments and calculations contained in all five pages of the article titled “Counterexample to Passivity Preservation for Variable Impedance Control of Compliant Robots” (Medrano-Cerda, 2023), thereby invalidating all its arguments and exposing its want of substantiality to the reader.

15:10-15:15, Paper ThCT24.3
Human-Inspired Robotic Assembly for Multiple Peg-In/out-Hole Tasks in On-Orbit Refueling

Zhang, Rui	Beijing Institute of Control Engineering
Zhang, Qiang	Beijing Institute of Control Engineering
Zhou, Xiaodong	Beijing Institute of Control Engineering
Keywords: Compliant Assembly, Compliance and Impedance Control, Space Robotics and Automation Abstract: On-orbit refueling technology requires robots with multiple peg-in-hole and peg-out-hole capabilities. Complex contact conditions often lead to jamming issues, presenting significant challenges for automated refueling. In this paper, we propose a human-inspired multiple peg-in/out-hole assembly method. This method integrates a variable admittance force controller based on an non-diagonal stiffness matrix and a strategy for managing multiple peg-in/out-hole operations. By coupling position and attitude stiffness, the robot's adaptability in dynamic assembly environments are significantly enhanced. Additionally, the method enables autonomous posture adjustment based on real-time force sensor data and allows the robot to retry operations in case of jamming, thereby eliminating the need for complex motion trajectory planning. Extensive ground-based refueling experiments validate the effectiveness of our approach and its capability to resist external disturbances.

15:15-15:20, Paper ThCT24.4
Sensor-Based Adaptive Robust Torque Control for Flexible Joints

Dai, Junjie	Ningbo Institute of Materials Technology and Engineering, CAS
Chen, Chin-Yin	Ningbo Institute of Material Technology and Engineering, CAS
Zhong, Ying	Jiangxi University of Science and Technology
Guo, He	Jiangxi University of Science and Technology
Yang, Guilin	Ningbo Institute of Material Technology and Engineering, Chines
Zhang, Chi	Ningbo Institute of Material Technology and Engineering, CAS
Keywords: Compliant Joints and Mechanisms, Force Control, Sensor-based Control Abstract: Achieving a fast transient response and high steady state tracking accuracy of joint torque for flexible joint systems with dynamic constraints, various parameter uncertainties, and uncertain nonlinearities is always challenging. To this end, a torque control scheme based on multi-torque sensors is proposed. By analyzing and transforming the joint dynamics, a lumped disturbance signal, including the reducer drive’s torque loss, is constructed, which can be measured by the torque sensors.Then, a nonlinear disturbance observer (NDOB) and an adaptive robust torque control (ARTC) law are introduced to synthesize the control scheme, where NDOB estimates the motor’s friction, and ARTC realizes the stable tracking of the joint torque. The performance of the proposed controller is theoretically ensured by Lyapunov analysis. Finally, a series of hardware experiments are carried out. The results suggest the effectiveness and superior performance of the proposed method for joint torque control by comparing it with other traditional methods.

15:20-15:25, Paper ThCT24.5
A Driving–Clamping Integrated Inchworm Linear Piezoelectric Actuator with Miniaturization and High Thrust Density (I)

Guan, Jinghan	Harbin Institute of Technology
Zhang, Shijing	Harbin Institute of Technology
Deng, Jie	Harbin Institute of Technology
Junkao, Liu	Harbin Institute of Technology
Liu, Yingxiang	Harbin Institute of Technology
Keywords: Compliant Joints and Mechanisms, Mechanism Design, Automation at Micro-Nano Scales Abstract: This article presents an inchworm linear piezoelectric actuator (ILPA) to address the prevalent issues of bulk and complexity in existing ILPAs. It has the characteristics of miniaturization with a size of 44×13×24 mm^3 and a weight of 36.0 g. It integrates multiple functions within a compact form factor. The mechanical structure is simplified to a stator and an output shaft. The integrally machined stator is used as a guide unit without additional guideways and components. A key feature of the ILPA is its integrated driving mechanism that combines driving and clamping actions. This mechanism is designed as a bridge-type displacement amplifier, where the main displacement serves the driving function, while the parasitic displacement provides the clamping action. Experimental results show that the ILPA achieves a maximum speed of 1691.8 μm/s under the condition of square wave excitation with a frequency of 417 Hz and an input voltage of 120 V. Furthermore, it delivers a thrust of 1.5 N and a self-locking force of 2.1 N. Its force-to-volume ratio with a thrust density of 1.09×10^-4 N/mm^3 stands out. It has been successfully employed in several practical applications, including detecting chips and microscopic observation tasks.

15:25-15:30, Paper ThCT24.6
FLEXIV: Adaptive Locomotion Via Morphological Changes in a Flexible Track Vehicle

Kim, Sareum	EPFL
Filimonov, Daniil	Nazarbayev University
Hughes, Josie	EPFL
Keywords: Compliant Joints and Mechanisms, Robust/Adaptive Control, Soft Robot Applications Abstract: Flexible and adaptive tracked robots show the capacity to navigate in unstructured terrain, with a passive adaption of the track or a morphology change for terrain adaptation. Extending the concept to one in which the morphology can be actively changed has the potential to increase the capabilities of a locomoting robot while reducing the need for complex controllers and onboard navigation systems. This article introduces FLEXIV, a 232-g untethered robotic vehicle with flexible, magnet-equipped tracks that achieves adaptive locomotion across diverse geometries of ferrous terrains with appropriate transitioning of its track shape from circular to oblong. By modulating the configuration of a track loop, the robot can adjust its driving capability, such as traction and steering ability, to optimize its behavior to the terrain. This deformable robot is combined with an autonomous controller that leverages only robot posture information through inertial measurement units (IMUs) for terrain estimation to autonomously adapt the robot’s configuration to the environment. This enables the robot to autonomously navigate complex terrains, including diverse slopes and steps, and offers recovery actions for extreme falls.

15:30-15:35, Paper ThCT24.7
Variable Stiffness Actuation Via 3D-Printed Nonlinear Torsional Springs

Höppner, Hannes	Berliner Hochschule Für Technik, BHT
Kirner, Annika	TU Wien
Goettlich, Joshua Jonah	TU Wien
Jakob, Linnéa	Berliner Hochschule Für Technik
Dietrich, Alexander	German Aerospace Center (DLR)
Ott, Christian	TU Wien
Keywords: Compliant Joints and Mechanisms, Mechanism Design, Additive Manufacturing Abstract: Variable Stiffness Actuators (VSAs) are promising for advanced robotic systems, offering advantages, such as improved energy efficiency, impact safety, adaptability in stiffness, mechanical robustness, and dynamic versatility. However, traditional designs often rely on complex mechanical assemblies to achieve nonlinear torque--deflection characteristics, increasing system intricacy and introducing potential points of failure. This paper presents the design, implementation, and validation of a novel VSA that drastically simplifies complexity of the mechanisms by utilizing 3D-printed progressive nonlinear torsional springs (3DNS). By directly 3D-printing torsional springs, we enable precise control over nonlinear behavior through strategic variation of spring geometry. Empirical testing and finite element simulations demonstrate that our springs exhibit low hysteresis, low variance across samples, and they show a strong correlation between simulated and measured behavior. Integrating these springs into a VSA with two brushless DC motors demonstrates the feasibility of achieving high-performance VSAs with low damping, minimal hysteresis, and stiffness that aligns well with modeled predictions. Our findings suggest that this approach offers a cost-effective and accessible solution for the development of high-performance VSAs.

15:35-15:40, Paper ThCT24.8
A Compact Variable Stiffness Actuator for Agile Legged Locomotion (I)

Yu, Lei	Xi'an Jiaotong Liverpool University
Zhao, Haizhou	New York University
Qin, Siying	Xi an Jiaotong Liverpool University
Jin, Gumin	Shanghai Jiao Tong University
Chen, Yuqing	Xi'an Jiaotong-Liverpool University
Keywords: Compliant Joints and Mechanisms, Legged Robots Abstract: The legged robots with variable stiffness actuators (VSAs) can achieve energy-efficient and versatile locomotion. However, equipping legged robots with VSAs in real-world applications is usually restricted by (i) the redundant mechanical structure design, (ii) limited stiffness variation range and speed, (iii) high energy consumption in stiffness modulation, and (iv) the lack of an online stiffness control structure with high performance. In this paper, we present a novel Variable-Length Leaf-Spring Actuator (VLLSA) designed for legged robots that aims to address the aforementioned limitations. The design is based on leaf-spring mechanism and we improve the structural design to make the proposed VSA (i) compact and lightweight in mechanical structure, (ii) precise in theoretical modeling, and (iii) capable of modulating stiffness with wide range, fast speed, low energy consumption and high control performance. Hardware experiments including in-place and forward hopping validate the advantages of the proposed VLLSA.


ThCT25	103A
Soft Robot Applications	Regular Session
Chair: Sun, Jiefeng	Arizona State University
Co-Chair: Stroppa, Fabio	Kadir Has University

15:00-15:05, Paper ThCT25.1
A Hybrid Variable-Stiffness Soft Back Support Device

Khatavkar, Rohan Vijay	Arizona State University
Nguyen, The Bach	Arizona State University
Chen, Yuanhao	Arizona State University
Lee, Hyunglae	Arizona State University
Sun, Jiefeng	Arizona State University
Keywords: Soft Robot Applications, Prosthetics and Exoskeletons, Wearable Robotics Abstract: Back support devices (BSDs) have the potential to mitigate overexertion in industrial tasks and also to provide assistance to people with weak back muscle strength in daily activity. While state-of-the-art active BSDs can offer a high assistive force, they are bulky and heavy, making them uncomfortable for daily use. On the contrary, passive BSDs are compact but need manual adjustment to be versatile. This work presents a hybrid soft BSD that can provide task-oriented assistance by tuning the stiffness (0.58 N/mm, 0.92 N/mm, and 1.7 N/mm) and slack length (0 mm to 67 mm) in a compact design. The tunable stiffness allows for selecting a task-specific force profile, and the slack tuning will ensure that the device enables unhindered movement when assistance is not required. Compared with rigid devices, the device's compliance can potentially increase human comfort. We propose an analytical model that facilitates device design and estimates the device performance. Furthermore, the device's tuning capabilities are evaluated in human squatting and stooping experiments, showing that the desired force profile is correctly applied.

15:05-15:10, Paper ThCT25.2
A Motion Planner for Growing Reconfigurable Inflated Beam Manipulators in Static Environments

Altagiuri, Rawad Elmahdi Hasan	Marmara University
Zaghloul, Omar Hisham Abdelhakam	Kadir Has University
Do, Brian	Oregon State University
Stroppa, Fabio	Kadir Has University
Keywords: Soft Robot Applications, Constrained Motion Planning, Collision Avoidance Abstract: Soft growing robots have the potential to be useful for complex manipulation tasks and navigation for inspection or search and rescue. They are designed with plant-like properties, allowing them to evert and steer multiple links and explore cluttered environments. However, this variety of operations results in multiple paths, which is one of the biggest challenges faced by classic pathfinders. In this letter, we propose a motion planner based on A∗ search specifically designed for soft growing manipulators operating on predetermined static tasks. Furthermore, we implemented a stochastic data structure to reduce the algorithm’s complexity as it explores alternative paths. This allows the planner to retrieve optimal solutions over different tasks. We ran demonstrations on a set of three tasks, observing that this stochastic process does not compromise path optimality.

15:10-15:15, Paper ThCT25.3
Autologous Variable Stiffness Soft Finger Based on Cross-Layer Jamming for Multimode Grasping

Huang, Jie	South China University of Technology
Gai, Ling-Jie	Huazhong University of Science and Technology
Shen, Lingrui	South China University of Technology
Cai, Qingqian	South China University of Technology
Li, Yunquan	South China University of Technology
Li, Yingtian	Shenzhen Institutes of Advanced Technology, Chinese Academy of S
Gao, Xuemin	Guangzhou University
Zong, Xiaofeng	China University of Geosciences
Keywords: Soft Robot Applications, Soft Sensors and Actuators, Grippers and Other End-Effectors Abstract: Layer jamming variable stiffness technologies have been widely explored to enhance the load-bearing capacity of soft robotic fingers. However, these technologies typically require layer-jamming units to be attached to soft actuators as additional components, complicating structural design and limiting flexible multiconfiguration actuation. To address these problems, we propose an autologous variable stiffness soft finger (AVSSF) that integrates the cross-layer jamming joints (CLJJs) as both the variable stiffness unit and the finger itself, actuated by tendons. This design ensures a simple and compact structure while offering a wide range of stiffness adjustment. Additionally, through the coordinated actuation of tendons and the controlled jamming and unjamming of the CLJJ, the AVSSF achieves multiple configurations, providing flexibility in operation. Experimental results demonstrates that the stiffness of the AVSSF increased by a factor of 79.5 after modulation. We develop a two-finger gripper based on this design, capable of executing four distinct grasping modes: power grasp, adaptive pinch, expansion grasp, and hook. This gripper demonstrates adaptability to various objects and environments, highlighting its potential for diverse applications. In summary, this work offers insights into the advancement of autologous variable stiffness technologies and the development of multimodal grasping capabilities.

15:15-15:20, Paper ThCT25.4
CPG-Based Manipulation with Multi-Module Origami Robot Surface

Jiang, Yuhao	Ecole Polytechnique Federale De Lausanne
El Asmar, Serge	École Polytechnique Fédérale De Lausanne
Wang, Ziqiao	EPFL
Demirtas, Serhat	EPFL
Paik, Jamie	Ecole Polytechnique Federale De Lausanne
Keywords: Soft Robot Applications, Modeling, Control, and Learning for Soft Robots, Multi-Robot Systems Abstract: Robotic manipulators often face challenges in handling objects of different sizes and materials, limiting their effectiveness in practical applications. This issue is particularly pronounced when manipulating meter-scale objects or those with varying stiffness, as traditional gripping techniques and strategies frequently prove inadequate. In this letter, we introduce a novel surface-based multi-module robotic manipulation framework that utilizes a Central Pattern Generator (CPG)-based motion generator, combined with a simulation-based optimization method to determine the optimal manipulation parameters for a multi-module origami robotic surface (Ori-Pixel). This approach allows for the manipulation of objects ranging from centimeters to meters in size, with varying stiffness and shape. The optimized CPG parameters are tested through both dynamic simulations and a series of prototype experiments involving a wide range of objects differing in size, weight, shape, and material, demonstrating robust manipulation capabilities.

15:20-15:25, Paper ThCT25.5
Funabot-Sleeve: A Wearable Device Employing McKibben Artificial Muscles for Haptic Sensation in the Forearm

Peng, Yanhong	Nagoya University
Sakai, Yusuke	Nagoya University
Funabora, Yuki	Nagoya University
Yokoe, Kenta	Nagoya University
Aoyama, Tadayoshi	Nagoya University
Doki, Shinji	Nagoya University
Keywords: Soft Robot Applications, Haptics and Haptic Interfaces, Wearable Robotics Abstract: Haptic feedback systems play a critical role in enriching the user experience in human-robot interaction. However, existing devices designed for evoking haptic sensations often face limitations owing to their low degree of freedom of deformation. In this study, we introduce the Funabot-Sleeve, a haptic device based on McKibben artificial muscles, and investigate its potential to evoke a range of haptic sensations using both steady-state and transient air pressure patterns. Our investigation examines the influence of these patterns on evoking distinct haptic sensations and identifies four specific sensations that can be evoked: Embraced, Pinched, a combination of Embraced and Pressed, and Twisted sensations. Across all participants, the evoked sensations showed positive correlations, with most correlations exceeding a value of 0.4, indicating a high degree of agreement in the sensations felt by the subjects. Our research lays the groundwork for the design of fabric actuators, capable of replicating specific stimuli and skin surface effects, thereby enabling a more sophisticated and personalized haptic feedback experience.

15:25-15:30, Paper ThCT25.6
Hybrid Tendon-Actuated and Soft Magnetic Robotic Platform for Pancreatic Applications

Calmé, Benjamin	STORM Lab, School of Electronic and Electrical Engineering, Univ
Metcalf, Adam	University of Leeds
Brockdorff, Michael	University of Leeds
Jang, Haneul	Ewha Womans University
Choi, Yoonsue	Ewha Womans University
Lloyd, Peter Robert	University of Leeds
Ryu, Seok Chang	Ewha Womans University
Valdastri, Pietro	University of Leeds
Keywords: Soft Robot Applications, Surgical Robotics: Steerable Catheters/Needles, Medical Robots and Systems Abstract: Magnetic Soft Continuum Robots (MSCR) are used in a wide variety of surgical interventions, including neurological, pancreatic, and cardiovascular procedures. To function effectively, these MSCRs require complex programmable magnetisation. However, they often suffer from limited manoeuvrability and imprecise positioning of the devices that carry them. Tendon-Driven Continuum Robots (TDCR) have the potential to address these issues. These navigation systems not only enable higher accuracy and precision but also offer the potential for remote control, thereby reducing clinicians' exposure to ionising radiation. Currently, MSCRs are deployed from manual flexible endoscopes without motion compensation, leading to uncertainty and trial-and-error insertion. In this study, the deployment of high aspect ratio MSCRs (60 mm long by 1.3 mm diameter) from a tendon-driven robot (25 cm long with a 2.8 mm diameter) is performed. By precisely positioning the deployment point, this paper evaluates the benefits of different magnetisation profiles. The comparison is carried out for a specific clinical scenario, assessing procedure time, the distance between the external permanent magnet (used for steering) and the MSCR, and the interaction force with the tissue. The clinical relevance is demonstrated through pancreatic and bile duct cannulation in a silicon phantom.

15:30-15:35, Paper ThCT25.7
Plant Mobile Robot Using Mimosa Pudica

Sato, Misao	The University of Electro-Communications
Murakami, Kazuya	The University of Electro-Communications
Ishizaka, Tomoko	The University of Electro-Communications
Sato, Asako	RIKEN
Tanaka, Yo	RIKEN
Shintake, Jun	The University of Electro-Communications
Keywords: Soft Robot Applications, Soft Robot Materials and Design, Soft Sensors and Actuators Abstract: Plants respond physically to external stimuli (such as light and electricity), and these stimuli-responsive physical behaviors facilitate plants being used as actuators for robotic systems. However, achieving robot mobility through plants remains challenging. Moreover, there is a lack of quantitative knowledge of the actuation characteristics of plants. In this study, to achieve the mobility of robots by plants, we employed Mimosa pudica as the target plant and investigated its actuation characteristics. Specifically, we focused on a specific part of the plant called the pinnule, which is a small leaf that exhibits the motions of closing and opening. The experimental results revealed that the average closing and opening times over the tested voltage were 4.5 s and 798 s, respectively. The measured force of the pinnule was up to 0.19 mN, which corresponded to a power density of 0.16 × 10-3 W/kg. We then designed and fabricated a mobile robot that could locomote on the water surface by exploiting the rowing movement of the pinnules. The experimental results indicated that the robot (mass 0.5 g) was able to move on the water surface in response to the voltage input, exhibiting speeds of up to 3.3 × 102 μm/s and a thrust force of up to 52.3 μN. These values were in good agreement with the model predictions. The results of this study further promote the integration of plants into robots, advancing the development of sustainable, environmentally friendly robotics.


ThCT26	103B
Distributed Robot Systems	Regular Session
Chair: Nikolakopoulos, George	Luleå University of Technology
Co-Chair: Wang, Shuai	Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences

15:00-15:05, Paper ThCT26.1
Decentralized Cooperative Localization: A Communication-Efficient Dual-Fusion Consistent Approach

Hao, Ning	Harbin Institute of Technology
He, Fenghua	Harbin Institute of Technology
Hou, Yi	Harbin Institute of Technology
Wan peng, Song	China Construction Third Engineering Bureau Group Co., Ltd
Dong, Xu	China Construction Third Engineering Bureau Group Co., Ltd
Yao, Yu	Harbin Institute of Technology
Keywords: Distributed Robot Systems, Multi-Robot Systems, Localization Abstract: Decentralized cooperative localization poses significant challenges in managing inter-robot correlations, especially in environments with limited communication capacity and unreliable network connectivity. In this paper, we propose a communication-efficient decentralized consistent cooperative localization approach with almost minimal requirements for storage, communication, and network connectivity. A dual-fusion framework is presented that integrates heterogeneous and homogeneous fusion. In this framework, each robot only tracks its own local state and exchanges local estimates with its neighboring robots that possess relative measurements. In the heterogeneous fusion stage, we present an MAP-based decentralized fusion approach to fuse prior estimates of multiple heterogeneous states received from neighboring observed robots and nonlinear measurements in the presence of unknown cross-correlations. In the homogeneous fusion stage, the estimates from neighboring observing robots are further fused based on the CI technique, fully exploiting all available information and thus yielding better estimation results. The proposed algorithm is proved to be consistent. Extensive Monte Carlo simulations and real-world experiments demonstrate that our approach outperforms state-of-the-art methods.

15:05-15:10, Paper ThCT26.2
Robust Robotic Assembly of Reusable, Rectangular Blocks

Huang, Zhongming	Cornell University
Yao, Hongyu	Cornell University
Peng, Haocheng	Cornell University
Lin, Shih-ming	Cornell University
Petersen, Kirstin Hagelskjaer	Cornell University
Napp, Nils	Cornell University
Keywords: Distributed Robot Systems, Multi-Robot Systems Abstract: This paper investigates the importance and design implications for use of rectangular blocks in collective robotic construction systems with distributed control. Specifically, we introduce an automated solver for optimizing the overlaps in user-specified structures; a new robot design capable of manipulating, fastening, and climbing over blocks the width of the robot; detailed analysis of robot primitives and demonstration of rectilinear, curved, cantilever, and corbeled arch structures; and results from a physics simulator showing how overlaps improve structural integrity when the depositions are noisy. This work represents an important step towards efficient and versatile large-scale robotic construction.

15:10-15:15, Paper ThCT26.3
Find Everything: A General Vision Language Model Approach to Multi-Object Search

Choi, Daniel	University of Toronto
Fung, Angus	University of Toronto
Wang, Haitong	University of Toronto
Tan, Aaron Hao	University of Toronto
Keywords: Domestic Robotics, Service Robotics, AI-Enabled Robotics Abstract: Efficient navigation and search in unknown environments for multiple objects is a fundamental challenge in robotics, particularly in applications such as warehouse management, domestic assistance, and search-and-rescue. The Multi-Object Search (MOS) problem involves navigating to a sequence of locations to maximize the likelihood of finding target objects while minimizing travel costs. In this paper, we introduce a novel approach to the MOS problem, called Finder, which leverages vision language models (VLMs) to locate multiple objects across diverse environments. Specifically, our approach introduces multi-channel score maps to track and reason multiple objects simultaneously during navigation, along with a score map technique that combines scene-level and object-level semantic correlations. We validate our approach through extensive experiments in both simulated and real-world environments. The results demonstrate that Finder outperforms existing multi-object search methods using deep reinforcement learning and VLM. Additional ablation and scalability studies highlight the importance of our design choices and show the system’s robustness with increasing number of target objects. Website: https://find-all-my-things.github.io/

15:15-15:20, Paper ThCT26.4
RDMM: Enhancing Household Robotics with On-Device Contextual Memory and Decision Making

Nasrat, Shady	Pusan National University
Jo, Minseong	Pusan National University
Lee, Seonil	Pusan National University
Kim, Myungsu	PusanNationalUniversity
Lee, Jiho	Pusan National University
Jang, Yeoncheol	Pusan National University
Yi, Seung-Joon	Pusan National University
Keywords: Domestic Robotics, AI-Based Methods, Task Planning Abstract: Large language models (LLMs) represent a significant advancement in integrating physical robots with AI-driven systems. In this research, we present a framework that leverages Robotics Decision-Making Models (RDMM) for decision-making in domain-specific contexts, enhancing robotic autonomy. This framework incorporates agent-specific knowledge representation, allowing robots to recall and utilize their capabilities and past experiences for improved decision-making. Unlike other approaches, our method prioritizes real-time, on-device solutions, successfully operating on hardware with as little as 8GB of memory. The framework integrates visual perception models, providing robots with a better understanding of their environment. Additionally, real-time speech recognition capabilities are included, improving the human-robot interaction experience. Experimental results show that the RDMM framework achieves planning accuracy of 93%. Furthermore, we introduce a novel dataset consisting of 27k planning instances and 1.3k annotated text-image samples, specifically curated from real-world robotic tasks in competition scenarios. The framework, benchmarks, datasets, and models developed in this work are publicly available on our project website at https://github.com/shadynasrat/RDMM.

15:20-15:25, Paper ThCT26.5
LLM-Informed Iterative Planning for Object Search and Relocation in Indoor Environments

Blounas, Taxiarchis-Foivos	Luleå University of Technology
Saradagi, Akshit	Luleå University of Technology, Luleå, Sweden
Nikolakopoulos, George	Luleå University of Technology
Keywords: Domestic Robotics, AI-Enabled Robotics, Task Planning Abstract: The process of object search and relocation in an indoor environment, while intuitive for humans, remains a complex challenge for robots. Enabling robots to perform this task autonomously could have a substantial impact towards automation in both domestic and industrial settings. In this article, assuming a familiar environment, a set of target objects with their desired locations, and a robot with limited carrying capacity, we propose a novel methodology for object search and relocation. Given the human-like intuition exhibited by modern large language models (LLMs), they can be leveraged to guide object localization based on environmental context. Our approach integrates LLM-based prediction with graph-based path planning to create a human-like iterative search and relocation framework. The framework consists of an LLM predictor that suggests likely object locations (along with a likelihood score) and an adaptive path planner that dynamically updates the robot’s future path as new information becomes available during the search process. Prior relevant literature that employs LLM inference in indoor environments primarily focuses on assigning new or misplaced objects to appropriate locations. The aspect of enabling a search for a set of missing objects and planning their relocation to desired locations sets this article apart from prior literature. We compare our method to a patrol-based baseline with respect to the distance traversed by the robot in completing the search and relocation mission. In a medium sized indoor environment we demonstrate that it outperforms the baseline on an average by 31.2%.

15:25-15:30, Paper ThCT26.6
Edge Accelerated Robot Navigation with Collaborative Motion Planning (I)

Li, Guoliang	University of Macau
Han, Ruihua	University of Hong Kong
Wang, Shuai	Shenzhen Institute of Advanced Technology, Chinese Academy of Sc
Gao, Fei	Zhejiang University
Eldar, Yonina	Weizmann Institute of Science
Xu, Chengzhong	University of Macau
Keywords: Distributed Robot Systems, Motion and Path Planning Abstract: Low-cost distributed robots suffer from limited onboard computing power, resulting in excessive computation time when navigating in cluttered environments. This paper presents Edge Accelerated Robot Navigation (EARN), to achieve real-time collision avoidance by adopting collaborative motion planning (CMP). As such, each robot can dynamically switch between a conservative motion planner executed locally to guarantee safety (e.g., path-following) and an aggressive motion planner executed non-locally to guarantee efficiency (e.g., overtaking). In contrast to existing motion planning approaches that ignore the interdependency between low-level motion planning and high-level resource allocation, EARN adopts model predictive switching (MPS) that maximizes the expected switching gain with respect to robot states and actions under computation and communication resource constraints. The MPS problem is solved by a tightly-coupled decision making and motion planning framework based on bilevel mixed-integer nonlinear programming and penalty dual decomposition. We validate the performance of EARN in indoor simulation, outdoor simulation, and real-world environments. Experiments show that EARN achieves significantly smaller navigation time and higher success rates than state-of-the-art navigation approaches.

15:30-15:35, Paper ThCT26.7
ACoL: From Abstractions to Grounded Languages for Robust Coordination of Task Planning Robots

Zhang, Yu (Tony)	Arizona State University
Keywords: Distributed Robot Systems, Multi-Robot Systems, Task Planning Abstract: In this paper, we consider a first step to bridge a gap in coordinating task planning robots. Specifically, we study the automatic construction of languages that are maximally flexible while being sufficiently explicative for coordination. To this end, we view language as a machinery for specifying temporal-state constraints of plans. Such a view enables us to reverse-engineer a language from the ground up by mapping these composable constraints to words. Our language expresses a plan for any given task as a "plan sketch" to convey just-enough details while maximizing the flexibility to realize it, leading to robust coordination with optimality guarantees among other benefits. We formulate and analyze the problem, provide approximate solutions, and validate our approach under various scenarios to shed light on its applications.

15:35-15:40, Paper ThCT26.8
Cooperative Moving Target Fencing Control for Two-Layer UAVs with Relative Measurements (I)

Zhao, Shulong	National University of Defense Technology
Zheng, Jiayi	National University of Defense Technology
Liu, Kun	National University of Defense Technology
Liu, Jun	National University of Defense Technology
Wang, Xiangke	National University of Defense Technology
Keywords: Distributed Robot Systems, Cooperating Robots, Multi-Robot Systems Abstract: This paper investigates a two-layer distributed control protocol for multiple unmanned aerial vehicles (UAVs) to fence a moving target cooperatively, using relative measurements. The multi-UAVs are divided into two layers: one is equipped with target detection sensors that can acquire relative bearing information; the other is equipped with relative distance sensors that can only acquire relative positions of neighbors. First, a bearing-based controller is presented for the leader layer to eliminate the singularity problem of observability of a moving target. Then, a distance-based controller is developed to fence the moving target within a convex hull at every moment for the follower layer. The relative spacing within the formation does not need to be given in advance, and the rotation and scaling of the formation can be quickly adjusted according to the design parameters. Third, we prove sufficient conditions for the design of the formation configuration, together with the selection of the fencing scale. The fencing protocol can be directly extended to 3-D space, and the form of the controller for 3-D scenarios is given as well. Finally, two numerical simulations and a hardware-in-loop (HIL) experiment verify the effectiveness of the proposed protocol.


ThCT27	103C
Factory Automation and Failure Detection	Regular Session
Chair: Zhang, Yifang	Istituto Italiano Di Tecnologia
Co-Chair: Zhang, Fan	Honda Research Institute EU

15:00-15:05, Paper ThCT27.1
Industry 6.0: New Generation of Industry Driven by Generative AI and Swarm of Heterogeneous Robots

Lykov, Artem	Skolkovo Institute of Science and Technology
Altamirano Cabrera, Miguel	Skolkovo Institute of Science and Technology (Skoltech), Moscow,
Konenkov, Mikhail	Skolkovo Institute of Science and Technology
Serpiva, Valerii	Skolkovo Institute of Science and Technology
Gbagbe, Koffivi Fidele	Skolkovo Institute of Science and Technology
Alabbas, Ali	Atlantic Technological University
Fedoseev, Aleksey	Skolkovo Institute of Science and Technology
Moreno, Luis	Skoltech
Khan, Muhammad Haris	Intelligent Space Robotics Laboratory, Skoltech
Guo, Ziang	Skolkovo Institute of Science and Technology
Tsetserukou, Dzmitry	Skolkovo Institute of Science and Technology
Keywords: Cognitive Control Architectures, Factory Automation, Cooperating Robots Abstract: This paper presents the concept of Industry 6.0, which introduces the world's first fully automated production system that autonomously handles the entire product design and manufacturing process based on user-provided natural language descriptions. By leveraging generative AI, the system automates critical aspects of production, including product blueprint design, component manufacturing, logistics, and assembly. A heterogeneous swarm of robots, each equipped with individual AI through integration with Large Language Models (LLMs), orchestrates the production process. The robotic system includes manipulator arms, delivery drones, and 3D printers capable of generating assembly blueprints. The system was evaluated using commercial and open source LLMs, functioning through APIs and local deployment. A user study demonstrated that the system reduces the average production time to 119.10 minutes, significantly outperforming a team of expert human developers, who averaged 528.64 minutes (an improvement factor of 4.4). Furthermore, in the product blueprinting stage, the system surpassed human CAD operators by an unprecedented factor of 47, completing the task in 0.5 minutes compared to 23.5 minutes. This breakthrough represents a major leap towards fully autonomous manufacturing.

15:05-15:10, Paper ThCT27.2
SLU-DQN: A Model for Anticipatory Steam Detection for Steamer-Filling in Baijiu Intelligent Distillation Systems

Jia, Yu	Fudan University
Ren, Jiankun	Fudan University
Liang, Hanwen	Fudan University
Wang, Chen	Luzhou Laojiao CO., LTD
Qi, Lizhe	Fudan University
Sun, Yunquan	Fudan University
Keywords: Factory Automation, Industrial Robots, Computer Vision for Automation Abstract: The true implementation of the Anticipatory Steam Detection for Steamer-Filling(ASDSF) process in baijiu intelligent distillation systems, which involves predicting and precisely spreading distillers' grains before steam emerges, remains a critical unresolved challenge. In this study, we introduce the SLU model, which utilizes SwinLSTM as the core feature extraction module and adopts a U-shaped structure. This model achieves spatiotemporal feature extraction and dynamic change prediction. It is further enhanced by integrating a U-Net module for multi-scale feature fusion and optimized through a Deep Q-Network (DQN)-based decision-making process. The SLU-DQN model, specifically designed for anticipatory material spreading planning in the baijiu Steamer-Filling(SF) distillation system, predicts future steam emission areas. Finally, both quantitative and qualitative experimental results demonstrate the excellent performance of the SLU-DQN model in solving the ASDSF problem. The model achieved 91.1% reward accuracy, an F1-Score of 91% for material spreading point prediction, an MSE of 19.02, and an SSIM of 95.8%. These results not only highlight the model's superior accuracy in predicting future steam emission areas but also provide a significant technical breakthrough for intelligent baijiu distillation systems, filling a crucial gap in the field.

15:10-15:15, Paper ThCT27.3
Dynamic Network Topology Analysis, Design, and Evaluation for Multi-Robot Vehicle Transfer in High-Density Storage Yards

Zhang, Lin	Beijing Institute of Technology
Cai, Qiyu	Beijing Institute of Technology
Bao, Runjiao	Beijing Institute of Technology
Niu, Tianwei	Beijing Institute of Technology
Xu, Yongkang	Beijing Institute of Technology
Si, Jinge	Beijing Institute of Technology
Wang, Shoukun	Beijing Institute of Technology
Wang, Junzheng	Beijing Institute of Technology
Keywords: Factory Automation, Methods and Tools for Robot System Design, Path Planning for Multiple Mobile Robots or Agents Abstract: With the rapid advancement of intelligent manufacturing and the rise of emerging markets, global automobile exports have surged, placing unprecedented demands on logistics infrastructure. Efficient coordination of multiple robots for vehicle autonomous transfer is essential in high-density storage environments. However, conventional navigation mode, where autonomous robots navigate the entire space, often leads to inefficiencies, congestion, and increased safety risks. To address these challenges, this paper proposes a dynamic network topology framework to optimize large-scale vehicle transfers in high-density environments. The approach models free space as a network graph with directional, weighted movement costs. Leveraging yard operational characteristics, real-time transfer conditions, and robot specific capabilities, we introduce an event-triggered mechanism to update the network topology dynamically. This method continuously refines drivable space, effectively integrating yard areas with roadways to enhance routing flexibility in robot scheduling. Scenario-based evaluations demonstrate that the proposed approach reduces traveled distance by up to 12.3% and task completion time by 19.3% compared to traditional operational networks, leading to lower operational costs and improved task efficiency. Notably, these benefits become more pronounced as the number of robots increases and the operational environment grows more complex.

15:15-15:20, Paper ThCT27.4
Joint Optimization of Multi-Agent Task Allocation and Path Planning for Continuous Pickup and Delivery Tasks

Fan, Hongkai	Hunan University
Ouyang, Bo	Hunan University
Xie, Qinjing	Hunan University
Wang, Yaonan	Hunan University
Yan, Zhi	Hunan University
He, Jiawen	Hunan University
Tan, Qin	China Mobile Group Hunan Company Limited
Keywords: Factory Automation, Path Planning for Multiple Mobile Robots or Agents, Planning, Scheduling and Coordination Abstract: The multi-agent pickup and delivery problem is central to coordinating multiple agents in real-world applications such as warehouse automation, urban logistics, and robotic delivery networks, where efficient task assignment and pathfinding are vital for maximizing production efficiency. However, existing approaches often struggle to seamlessly integrate task allocation with path planning while also failing to address the demands of continuous pickup and delivery tasks, resulting in suboptimal performance and limited scalability in dynamic environments. To address these problems, we first introduce a novel task allocation approach, which constructs a cost matrix to satisfy pickup and delivery timing constraints for tasks and employs a Mixed-Integer Linear Programming (MILP) model to compute a task assignment matrix queue. Next, the CBS-TAPF framework is proposed, which constructs search forests for tasks and paths to address the joint optimization of task allocation and path planning. This framework is further extended to Continuous Multi-Agent Pickup and Delivery (CMAPD) tasks by dynamically updating the task allocation matrix queue, enhancing robustness and adaptability for real-world, sustained scenarios. Finally, through simulation and real-world experiments, we validated the effectiveness of the proposed methods. The experimental results demonstrate its superiority across diverse environments, ensuring robust performance in various operational scenarios.

15:20-15:25, Paper ThCT27.5
Multimodal Anomaly Detection with a Mixture-Of-Experts

Willibald, Christoph	German Aerospace Center (DLR)
Sliwowski, Daniel	TU Wien
Lee, Dongheui	Technische Universität Wien (TU Wien)
Keywords: Failure Detection and Recovery, Learning from Demonstration, Sensor Fusion Abstract: With a growing number of robots being deployed across diverse applications, robust multimodal anomaly detection becomes increasingly important. In robotic manipulation, failures typically arise from (1) robot-driven anomalies due to an insufficient task model or hardware limitations, and (2) environment-driven anomalies caused by dynamic environmental changes or external interferences. Conventional anomaly detection methods focus either on the first by low-level statistical modeling of proprioceptive signals or the second by deep learning-based visual environment observation, each with different computational and data requirements. To effectively capture anomalies from both sources, we propose a mixture-of-experts framework that integrates the complementary detection mechanisms with a visual-language model for environment monitoring and a Gaussian-mixture regression-based detector for tracking deviations in interaction forces and robot motions. We introduce a confidence-based fusion mechanism that dynamically selects the most reliable detector for each situation. We evaluate our approach on both household and industrial tasks using two robotic systems, demonstrating a 60% reduction in detection delay while improving frame-wise anomaly detection performance compared to individual detectors.

15:25-15:30, Paper ThCT27.6
Failure Forecasting Boosts Robustness of Sim2Real Rhythmic Insertion Policies

Liu, Yuhan	Rutgers University
Zhang, Xinyu	Rutgers University
Chang, Haonan	Rutgers University
Boularias, Abdeslam	Rutgers University
Keywords: Failure Detection and Recovery, Assembly Abstract: This paper addresses the challenges of Rhythmic Insertion Tasks (RIT), where a robot must repeatedly perform high-precision insertions, such as screwing a nut into a bolt with a wrench. The inherent difficulty of RIT lies in achieving millimeter-level accuracy and maintaining consistent performance over multiple repetitions, particularly when factors like nut rotation and friction introduce additional complexity. We propose a sim-to-real framework that integrates a reinforcement learning-based insertion policy with a failure forecasting module. By representing the wrench’s pose in the nut’s coordinate frame rather than the robot’s frame, our approach significantly enhances sim-to-real transferability. The insertion policy, trained in simulation, leverages real-time 6D pose tracking to execute precise alignment, insertion, and rotation maneuvers. Simultaneously, a neural network predicts potential execution failures, triggering a simple recovery mechanism that lifts the wrench and retries the insertion. Extensive experiments in both simulated and real-world environments demonstrate that our method not only achieves a high one-time success rate but also robustly maintains performance over long-horizon repetitive tasks.

15:30-15:35, Paper ThCT27.7
CCDP: Composition of Conditional Diffusion Policies with Guided Sampling

Razmjoo, Amirreza	Idiap Research Institute
Calinon, Sylvain	Idiap Research Institute
Gienger, Michael	Honda Research Institute Europe
Zhang, Fan	Honda Research Institute EU
Keywords: Failure Detection and Recovery, Learning from Demonstration, Motion and Path Planning Abstract: Imitation learning offers a promising approach in robotics by enabling systems to learn directly from data without requiring explicit models, simulations, or detailed task definitions. During inference, actions are sampled from the learned distribution and executed on the robot. However, sampled actions may fail for various reasons, and simply repeating the sampling step until a successful action is obtained can be inefficient. In this work, we propose an enhanced sampling strategy that refines the sampling distribution to avoid previously unsuccessful actions. We demonstrate that by solely utilizing data from successful executions, our method can infer recovery actions without the need for additional simulation or exploratory behavior. Furthermore, we leverage the concept of diffusion model decomposition to break down the primary problem—which may require long-horizon history to manage failures—into multiple smaller, more manageable sub-problems in learning, data collection, and inference, thereby enabling the system to adapt to variable failure counts. Our approach yields a low-level controller that dynamically adjusts its sampling space to improve efficiency when prior samples fall short. We validate our method across several tasks, including door opening with unknown directions, object manipulation, and button-searching scenarios, demonstrating that our approach outperforms traditional baselines.

15:35-15:40, Paper ThCT27.8
On the Calibration, Fault Detection and Recovery of a Force Sensing Device

Zhang, Yifang	Istituto Italiano Di Tecnologia
Ajoudani, Arash	Istituto Italiano Di Tecnologia
Tsagarakis, Nikos	Istituto Italiano Di Tecnologia
Keywords: Failure Detection and Recovery, Force and Tactile Sensing Abstract: Ground reaction force information, which includes the location of the center of pressure (COP) and vertical ground reaction force (vGRF), has various applications, such as in the gait assessment of patients' post-injury or in the control of robot prostheses and exoskeleton devices. At the beginning of this work, we introduce a newly developed force-sensing device for measuring the COP and vGRF. Then, a model-free calibration method is proposed, leveraging Gaussian process regression (GPR) to extract COP and vGRF from raw sensor data. This approach yields remarkably low normalized root mean squared errors (NRMSEs) of 0.029 and 0.020 for COP in the mediolateral and anteroposterior directions, respectively, and 0.024 for vGRF. However, in general, learning-based calibration methods are sensitive to abnormal readings from sensing elements. To improve the robustness of the measurement, a GPR-based fault detection network is outlined for evaluating the sensing state within the fault in individual sensing elements of the force-sensing device. Moreover, a GPR-based recovery method is proposed to retrieve the sensing device's function under the fault conditions. In validation experiments, the effect of the scale factor of the threshold in the fault detection network is experimentally analyzed. The fault detection network can achieve over 90% success rate with a lower than 5 seconds delay on average in detecting the fault when the scale factor is between 1.68 and 1.90. The engagement of GPR-based recovery models under fault conditions demonstrates a substantial enhancement in COP (up to 85.0% improvement) and vGRF (up to 84.8% improvement) estimation accuracy.


ThCT28	104
Rehabilitation Robotics 3	Regular Session
Chair: Natale, Lorenzo	Istituto Italiano Di Tecnologia
Co-Chair: Li, Xiang	Tsinghua University

15:00-15:05, Paper ThCT28.1
Evaluating Computational Approaches to Metabolic Cost Estimation in Gait Assistance with a Passive Exosuit

Firouzi, Vahid	Technical University of Darmstadt
von Stryk, Oskar	Technische Universität Darmstadt
Seyfarth, Andre	TU Darmstadt
Song, Seungmoon	Northeastern
Sharbafi, Maziar	Technische Universität Darmstadt
Keywords: Modeling and Simulating Humans, Prosthetics and Exoskeletons, Wearable Robotics Abstract: Lower limb exoskeletons and exosuits have shown promise in augmenting human physical capabilities, with applications ranging from rehabilitation to performance enhancement. Accurate evaluation of their impact on metabolic energy expenditure is crucial for optimizing design and control strategies. While experimental measurement of metabolic cost via indirect calorimetry provides direct assessment, it is often impractical outside laboratory settings. Computational models offer an alternative, but their effectiveness in predicting metabolic cost changes induced by assistive devices remains underexplored. This study investigates the impact of incorporating different levels of complexity and sensory information, as well as various metabolic cost models, on estimating muscle metabolic cost during walking with a passive biarticular thigh exosuit. We compare three modeling approaches: joint-space dynamics, musculoskeletal simulation with effort minimization, and EMG-informed musculoskeletal simulation, each employing several metabolic models. Results show that EMG-informed musculoskeletal simulation, particularly using the Uchida (2016) metabolic model, provides the highest accuracy in predicting metabolic cost changes. Musculoskeletal simulation with effort minimization also shows promise, offering a viable alternative without the need for EMG data. These findings highlight the potential of computational models in evaluating and optimizing assistive devices.

15:05-15:10, Paper ThCT28.2
STG-Avatar: Animatable Human Avatars Via Spacetime Gaussian

Jiang, Guangan	Dalian University of Technology
Zhang, Tianzi	Fudan University
Li, Dong	University of Macau
Zhao, Zhenjun	The Chinese University of Hong Kong
Li, Haoang	Hong Kong University of Science and Technology (Guangzhou)
Li, Mingrui	Dalian University of Technology
Wang, Hongyu	Dalian University of Technology
Keywords: Modeling and Simulating Humans, Human and Humanoid Motion Analysis and Synthesis, Simulation and Animation Abstract: Realistic animatable human avatars from monoc- ular videos are crucial for advancing human-robot interaction and enhancing immersive virtual experiences. While recent research on 3DGS-based human avatars has made progress, it still struggles with accurately representing detailed features of non-rigid objects (e.g., clothing deformations) and dynamic regions (e.g., rapidly moving limbs). To address these challenges, we present STG-Avatar, a 3DGS-based framework for high- fidelity animatable human avatar reconstruction. Specifically, our framework introduces a rigid-nonrigid coupled deformation framework that synergistically integrates Spacetime Gaussians (STG) with linear blend skinning (LBS). In this hybrid design, LBS enables real-time skeletal control by driving global pose transformations, while STG complements it through spacetime- adaptive optimization of 3D Gaussians. Furthermore, we em- ploy optical flow to identify high-dynamic regions and guide the adaptive densification of 3D Gaussians in these regions. Experimental results demonstrate that our method consistently outperforms state-of-the-art baselines in both reconstruction quality and operational efficiency, achieving superior quanti- tative metrics while retaining real-time rendering capabilities.

15:10-15:15, Paper ThCT28.3
Would You Let a Humanoid Play Storytelling with Your Child? a Usability Study on LLM-Powered Narrative Human-Robot Interaction

Lombardi, Maria	Italian Institute of Technology
Calabrese, Carmela	Italian Institute of Technology
Ghiglino, Davide	Istituto Italiano Di Tecnologia
Foglino, Caterina	Istituto Italiano Di Tecnologia
De Tommaso, Davide	Istituto Italiano Di Tecnologia
Da Lisca, Giulia	Istituto Italiano Di Tecnologia
Natale, Lorenzo	Istituto Italiano Di Tecnologia
Wykowska, Agnieszka	Istituto Italiano Di Tecnologia
Keywords: Multi-Modal Perception for HRI, Rehabilitation Robotics, Acceptability and Trust Abstract: A key challenge in human-robot interaction research lies in developing robotic systems that can effectively perceive and interpret social cues, facilitating natural and adaptive interactions. In this work, we present a novel framework for enhancing the attention of the iCub humanoid robot by integrating advanced perceptual abilities to recognise social cues, understand surroundings through generative models, such as ChatGPT, and respond with contextually appropriate social behaviour. Specifically, we propose an interaction task implementing a narrative protocol (storytelling task) in which the human and the robot create a short imaginary story together, exchanging in turn cubes with creative images placed on them. To validate the protocol and the framework, experiments were performed to quantify the degree of usability and the quality of experience perceived by participants interacting with the system. Such a system can be beneficial in promoting effective human–robot collaborations, especially in assistance, education and rehabilitation scenarios where the social awareness and the robot responsiveness play a pivotal role.

15:15-15:20, Paper ThCT28.4
UltraDP: Generalizable Carotid Ultrasound Scanning with Force-Aware Diffusion Policy

Chen, Ruoqu	Tsinghua University
Yan, Xiangjie	Tsinghua University
Lv, Kangchen	Tsinghua University
Huang, Gao	Tsinghua University
Li, Zheng	The Chinese University of Hong Kong
Li, Xiang	Tsinghua University
Keywords: Medical Robots and Systems, Physical Human-Robot Interaction, Learning from Experience Abstract: Ultrasound scanning is a critical imaging technique for real-time, non-invasive diagnostics. However, variations in patient anatomy and complex human-in-the-loop interactions pose significant challenges for autonomous robotic scanning. Existing ultrasound scanning robots are commonly limited to relatively low generalization and inefficient data utilization. To overcome these limitations, we present UltraDP, a Diffusion-Policy-based method that receives multi-sensory inputs (ultrasound images, wrist camera images, contact wrench, and probe pose) and generates actions that are fit for multi-modal action distributions in autonomous ultrasound scanning of carotid artery. We propose a specialized guidance module to enable the policy to output actions that center the artery in ultrasound images. To ensure stable contact and safe interaction between the robot and the human subject, a hybrid force-impedance controller is utilized to drive the robot to track such trajectories. Also, we have built a large-scale training dataset for carotid scanning comprising 210 scans with 460k sample pairs from 21 volunteers of both genders. By exploring our guidance module and DP's strong generalization ability, UltraDP achieves a 95% success rate in transverse scanning on previously unseen subjects, demonstrating its effectiveness.

15:20-15:25, Paper ThCT28.5
Physical Human-Robot Collaboration-Assisted Acetabular Preparation for Total Hip Replacement Surgery

Wang, Ziqi	University of Technology Sydney
Li, Tiancheng	University of Technology Sydney
Carmichael, Marc	Centre for Autonomous Systems
Huang, Shoudong	University of Technology, Sydney
Keywords: Medical Robots and Systems, Physically Assistive Devices, Human-Robot Collaboration Abstract: When performing total hip replacement (THR) surgery, high-quality preparation of acetabulum is critical as it contributes to the patient’s recovery speed and the consistency of bone ingrowth. Conventionally, surgeons prepare the acetabulum manually by reaming it with a handheld electric drill and a reamer. It not only increases the surgeon’s workload but more importantly, it is difficult to control the reaming depth and direction accurately. Utilizing an admittance-controlled (AC) collaborative robot (cobot) to enable physical human-robot collaboration (pHRC) possesses a promising solution. For primitive AC, a compromise must be made between compliance and task accuracy. In this paper, we present a novel variable admittance control (VAC) design that considers the reactive force of bone while ensuring system passivity and stability during pHRC-assisted acetabular preparation. Qualitative results show that VAC was more desirable by users than the conventional manual reaming method. Compared to other pHRC controls, quantitative results on user’s energy consumption, reaming error and smoothness showed the proposed VAC achieved a balance between the physical workload and acetabular quality. Compared to manual reaming, VAC reduced the reaming error by 67.47% and improved the final acetabulum’s surface smoothness by 18.30%.

15:25-15:30, Paper ThCT28.6
High-Accuracy Early Recognition of Upper-Limb Motions for Exoskeleton-Assisted Mirror Rehabilitation

Wang, Honggang	Harbin Institute of Technology, Harbin 150001, China
Yao, Yufeng	Harbin Institute of Technology
Lei, Huashuo	Harbin Institute of Technology, Weihai
Shi, Yuxiao	Harbin Institute of Technology, Weihai
Pei, Shuo	Harbin Institute of Technology
Keywords: Modeling and Simulating Humans, Recognition, Rehabilitation Robotics Abstract: Upper-limb early motion recognition (EMR) significantly enhances human-computer interaction and skill transfer in exoskeleton-assisted mirror rehabilitation. However, achieving early and accurate recognition of upper-limb motion remains a challenge, limiting the transfer of natural movements from the healthy to the affected side. To address these challenges and limitations, this study introduces a novel high-accuracy upper-limb EMR method and implements it within an exoskeleton-assisted mirror rehabilitation system. Specifically, a new architecture is designed to model and parameterise upper-limb motion, transforming it from a three-dimensional Cartesian coordinate system into a four-dimensional parametric space. The parameterized results were then evaluated with 11 algorithms, using public datasets, P-BTBS dataset, and Arm-CODA. Experimental results show that the proposed method achieves over 99% recognition accuracy for both full and first 30% upper-limb motion sequences while saving at least 80% of recognition time. Comparative analysis identifies RF, XGBoost, KNN, and deep learning as the most promising algorithms, with bidirectional encoder representations from transformers (BERT) pioneering advancements in upper-limb motion recognition. These findings indicate that the proposed architecture enables high-accuracy upper-limb EMR at the earliest possible stage (30%), offering a new paradigm for human-computer interaction, personalised medicine, and mirror motion rehabilitation.

15:30-15:35, Paper ThCT28.7
S²-RTPIC: A State-Switching Remote Therapist Patient Interaction Control for Telerehabilitation (I)

Yang, Ziyi	Jilin University
Keywords: Rehabilitation Robotics, Telerobotics and Teleoperation, Physical Human-Robot Interaction Abstract: The telerehabilitation robotic system has been envisioned as an alternative to conventional hospital-centered therapy because of convenient training and offering equal opportunity to access medical resources for patients in different areas. However, due to the internet communication latency, how to realize safe, stable, and biomechanics-perceptible remote therapist–patient interaction (RTPI) remains a significant challenge for the therapist-in-the-loop telerehabilitation (TILT) system. To address this issue, a novel position-position/stiffness (P-PK) telerehabilitation architecture was proposed in this article, which exchanges the position information of the therapist and patients and feeds back the reference stiffness of the patient's affected limb to the therapist's side. Furthermore, a novel state-switching RTPI control (S²-RTPIC) scheme is first presented for this P-PK architecture to induce the active participation of the patients during the online TILT training by the variable stiffness voluntary control and their biomechanical states could be synchronously perceived by the therapists over distances for teleassessments. The stability and transparency criteria of the S²-RTPIC scheme under asymmetric time delay conditions were comprehensively analyzed and theoretically proved. Experimental results showed the proposed S²-RTPIC scheme can provide safe RTPI training with effective biomechanical perceptions and participation-inducing training assistance to facilitate teleassessment and telerehabilitation.

15:35-15:40, Paper ThCT28.8
Towards Data-Driven Adaptive Exoskeleton Assistance for Post-Stroke Gait

Weigend, Fabian Clemens	Harvard University
Choe, Dabin	Harvard University / Wyss Institute
Canete, Santiago	Harvard University
Walsh, Conor James	Harvard University
Keywords: Rehabilitation Robotics, Wearable Robotics, Physically Assistive Devices Abstract: Recent work has shown that exoskeletons controlled through data-driven methods can dynamically adapt assistance to various tasks for healthy young adults. However, applying these methods to populations with neuromotor gait deficits, such as post-stroke hemiparesis, is challenging. This is due not only to high population heterogeneity and gait variability but also to a lack of post-stroke gait datasets to train accurate models. Despite these challenges, data-driven methods offer a promising avenue for control, potentially allowing exoskeletons to function safely and effectively in unstructured community settings. This work presents a first step towards enabling adaptive plantarflexion and dorsiflexion assistance from data-driven torque estimation during post-stroke walking. We trained a multi-task Temporal Convolutional Network (TCN) using collected data from four post-stroke participants walking on a treadmill (R² of 0.74 ± 0.13). The model uses data from three inertial measurement units (IMU) and was pretrained on healthy walking data from 6 participants. We implemented a wearable prototype for our ankle torque estimation approach for exoskeleton control and demonstrated the viability of real-time sensing, estimation, and actuation with one post-stroke participant.


ThCT29	105
Whole-Body Motion Planning and Control	Regular Session
Chair: Sheng, Weihua	Oklahoma State University

15:00-15:05, Paper ThCT29.1
Real-Time Whole-Body Motion Planning Based on Optimized NMPC in Static and Dynamic Environments for Mobile Manipulator

Wu, Wei	Dalian University of Technology
Zhou, Ximeng	Dalian University of Technology
Yan, Fei	Dalian University of Technology
Zhang, Shouxing	Haikou University of Economics
Zhuang, Yan	Dalian University of Technology
Xin, Guiyang	Dalian University of Technology
Keywords: Whole-Body Motion Planning and Control, Mobile Manipulation, Motion and Path Planning Abstract: Recently, the research on mobile manipulators has attracted increasing attention. Ensuring that mobile manipulators can meet obstacle avoidance constraints and efficiently accomplish assigned tasks in dynamic environments remains a significant challenge. To address this issue, this paper proposes an integrated framework for environment perception, real-time planning, and control optimization. Firstly, we develop a fusion map that combines euclidean signed distance field (ESDF) with clustered point clouds occupying cubes, enabling robots to perceive more precise environmental information in complex and changing conditions. Secondly, we introduce a novel rapid generation strategy for 6-DOF guide point sequences, which directs the mobile manipulator to follow the most efficient path to the target location while making real-time adjustments to avoid dynamic obstacles. Additionally, utilizing optimized nonlinear model predictive control (NMPC), we design a whole-body motion controller for the mobile manipulator to prevent the system from becoming trapped in local optima, thereby allowing the manipulator to adjust its state tracking guide points promptly in complex indoor environments. Finally, the proposed algorithm was implemented on a mobile manipulator with an Ackerman base and tested through both simulations and real-world experiments.

15:05-15:10, Paper ThCT29.2
Vehicle Drifting Planning and Control Framework for Flexible U-Turns in Space-Limited Environments

Yang, Shuaicong	Beijing Institute of Technology
He, Wei	Beijing Institute of Technology
Yang, Yi	Beijing Institute of Technology
Zhang, Ting	Beijing Institute of Technology
Fu, Mengyin	Beijing Institute of Technology
Keywords: Whole-Body Motion Planning and Control, Body Balancing Abstract: Space-limited U-shape bend is a safe-critical scenario which requires the high maneuverability of vehicles. However, due to the non-holonomic nature of the vehicle, it is difficult to perform flexible U-turns without intricate adjustments, which is detrimental to the efficient execution of tasks. To address these issues, this work incorporates the drifting maneuver of the vehicle and proposes a planning and control framework for time-space efficient passing in constrained U-shape bends. First, a dual-track, 3-Dof vehicle model is developed, incorporating load transfer effects and nonlinear tire forces to enhance trajectory precision. Based on this model, a nonlinear optimization-based planner generates time-optimal, space-efficient and drift-compatible trajectories while ensuring dynamic feasibility. Finally, a multi-layer controller is designed for precise trajectory tracking, integrating a trajectory error feedback compensator, a dynamic state feedforward-feedback regulator, and a model inversion-based actuator controller. Simulation experiments in CarSim validate the proposed framework, demonstrating significant improvements in spatial efficiency and completion time. The results highlight its effectiveness in enhancing autonomous vehicle maneuverability for high-performance applications in constrained environments.

15:10-15:15, Paper ThCT29.3
Whole-Body Stabilization of Wheeled Bipedal Robots Via Decoupled Control of Wheels and Legs

Jeon, Jechan	Korea Institute of Science and Technology , Korea University
An, Jaewoo	Korea Institute of Science and Technology(KIST), Korea Universit
Cha, Youngsu	Korea University
Oh, Yonghwan	Korea Institute of Science & Technology (KIST)
Keywords: Whole-Body Motion Planning and Control, Legged Robots, Wheeled Robots Abstract: Wheeled-legged robots offer significant mobility advantages, yet their control is complicated by the coupled dynamics of the wheel and leg systems. To address this challenge, we propose a whole-body control framework built upon a decoupled architecture. In this structure, a two-wheeled inverted pendulum (TWIP) template exclusively manages wheel motion, freeing the whole-body controller to focus solely on the leg dynamics. To validate the generality of our approach, we conducted extensive simulations across various robot configurations, including both closed-loop and open-loop leg structures. The results demonstrate the robot's ability to maintain stability across several challenging scenarios: a high-speed (5 m/s) slalom on flat ground, a low-speed (0.5 m/s) slalom on terrain with 10 cm height variations, and immediate stabilization after a 2 m free-fall. These findings highlight the potential of decoupled control as a promising direction for developing more agile and resilient robotic systems.

15:15-15:20, Paper ThCT29.4
Planning and Control for Active Morphing Tensegrity Aerial Vehicles in Confined Spaces

Hao, Siyuan	Beijing Institute of Technology
Tao, Zichen	Beijing Institute of Technology
Gui, Yun	Beijing Institute of Technology
Liu, Songyuan	Beijing Institute of Technology
Shi, Jiaxu	Beijing Institute of Technology
Cao, Xu	Beijing Institute of Technology
Yang, Qingkai	Beijing Institute of Technology
Keywords: Whole-Body Motion Planning and Control, Modeling, Control, and Learning for Soft Robots Abstract: Morphing quadrotors are capable of adapting to constrained environments through geometric reconfiguration. However, existing systems are limited by mechanical complexity and rigid links, which affect both safety and performance in such environments. In this paper, we propose a strut-actuated tensegrity aerial vehicle that integrates shape adaptation with collision resilience. By incorporating deformable struts and a cable network, our vehicle enables real-time morphological adjustments during flight while maintaining stability. We present a hierarchical planning framework that ensures the entire vehicle remains confined within an icosahedral space, thereby guaranteeing full-body safety. An on-manifold Model Predictive Controller (MPC) is employed to track these optimized trajectories and compensate for inertia shifts during shape deformation. Simulation results validate the effectiveness of the proposed framework, demonstrating its capability to navigate in restricted scenarios.

15:20-15:25, Paper ThCT29.5
Experimental Comparison of Whole-Body Control Formulations for Humanoid Robots in Task Acceleration and Task Force Spaces

Sovukluk, Sait	TU Wien
Zambella, Grazia	TU Wien
Egle, Tobias	TU Wien
Ott, Christian	TU Wien
Keywords: Whole-Body Motion Planning and Control, Humanoid and Bipedal Locomotion, Humanoid Robot Systems Abstract: This paper studies the experimental comparison of two different whole-body control formulations for humanoid robots: inverse dynamics whole-body control (ID-WBC) and passivity-based whole-body control (PB-WBC). The two controllers fundamentally differ from each other as the first is formulated in task acceleration space and the latter is in task force space with passivity considerations. Even though both control methods predict stability under ideal conditions in closed-loop dynamics, their robustness against joint friction, sensor noise, unmodeled external disturbances, and non-perfect contact conditions is not evident. Therefore, we analyze and experimentally compare the two controllers on a humanoid robot platform through swing foot position and orientation control, squatting with and without unmodeled additional weights, and jumping. We also relate the observed performance and characteristic differences with the controller formulations and highlight each controller's advantages and disadvantages.

15:25-15:30, Paper ThCT29.6
Context-Aware Behavior Learning with Heuristic Motion Memory for Underwater Manipulation

Buchholz, Markus	Heriot-Watt University
Carlucho, Ignacio	Heriot-Watt University
Grimaldi, Michele	University of Girona
Koskinopoulou, Maria	Heriot-Watt University
Petillot, Yvan R.	Heriot-Watt University
Keywords: Whole-Body Motion Planning and Control, Marine Robotics, Motion and Path Planning Abstract: Autonomous motion planning is critical for effi- cient and safe underwater manipulation in dynamic marine environments. Current motion planning methods often fail to effectively utilize prior motion experiences and adapt to real- time uncertainties inherent in underwater settings. In this paper, we introduce an Adaptive Heuristic Motion Planner framework that integrates a Heuristic Motion Space (HMS) with Bayesian Networks to enhance motion planning for autonomous underwater manipulation. Our approach employs the Probabilistic Roadmap (PRM) algorithm within HMS to optimize paths by minimizing a composite cost function that accounts for distance, uncertainty, energy consumption, and execution time. By leveraging HMS, our framework significantly reduces the search space, thereby boosting computational performance and enabling real-time planning capabilities. Bayesian Networks are utilized to dynamically update uncertainty estimates based on real-time sensor data and environmental conditions, thereby refining the joint probability of path success. Through extensive simulations and real-world test scenarios, we showcase the advantages of our method in terms of enhanced performance and robustness. This probabilistic approach significantly advances the capability of autonomous underwater robots, ensuring optimized motion planning in the face of dynamic marine challenges.

15:30-15:35, Paper ThCT29.7
A Hierarchical MPC for End-Effector Tracking Control of Legged Mobile Manipulators (I)

Wang, Dongqi	Zhejiang University
Yu, Jiyu	Zhejiang University
Wu, Shuangpeng	Zhejiang University
Li, Zhang	China Nuclear Power Operation Technology Corporation，LTD
Li, Chao	Hangzhou Deeprobotics Co.Ltd
Xiong, Rong	Zhejiang University
Qu, Shaoxing	Zhejiang University
Wang, Yue	Zhejiang University
Keywords: Whole-Body Motion Planning and Control, Motion Control Abstract: This article presents a hierarchical model predictive control (MPC) framework for the end-effector tracking control problem of a legged mobile manipulator. In the high-level part, a kinematic MPC over a long-time horizon computes both base and joint trajectories, then a quadratic program (QP) based optimization solves ground-reaction-forces (GRFs) satisfying the robot's centroidal dynamics and the friction cone constraints. In the low-level part, a kinodynamic MPC over a short-time horizon tracks command end-effector trajectories and outputs from the high-level part while satisfying all nonlinear dynamics constraints. Due to the complexity of MPC formulations and high real-time requirements, traditional MPC for legged mobile manipulators can only generate short-time horizon solutions for tracking tasks over longer time horizons, which may lead to the optimization falling into bad local minima. In our method, the long-term trajectories from the high-level part can guide the optimization of the short-term kinodynamic MPC to generate a better solution. We validate the effectiveness of our method through several simulation and hardware experiments. In comparison to traditional MPC, the proposed method improves the trajectory tracking accuracy of the robot’s end-effector while reducing the violations of the system's physical limit constraints and environment-collision avoidance constraints.

15:35-15:40, Paper ThCT29.8
Approximate Convex Decomposition-Based Whole-Body Trajectory Optimization for Robots in Dense Environments

Gong, Linao	Harbin Institute of Technology
He, Fenghua	Harbin Institute of Technology
Hao, Ning	Harbin Institute of Technology
Keywords: Whole-Body Motion Planning and Control, Collision Avoidance, Motion and Path Planning Abstract: Whole-body planning is critical for enabling robots to navigate effectively in complex and dense environments . Traditional obstacle-based planning methods methods often restrict the representation of both robots and obstacles to simple convex polyhedra. This limitation may fail to adequately address intricate geometries of real-world obstacles involved in constructing compact convex polyhedral envelopes around more intricate obstacle shapes found in such environments. In this paper, we propose an approximate convex decomposition (ACD) based method to generate convex polyhedral maps that effectively represent the non-convex shapes of robots as assemblies of multiple convex objects. Furthermore, we propose a differentiable convex polyhedron collision evaluation method to facilitate collision detection. Extensive experiments demonstrate that our method not only enhances the accuracy of collision detection in cluttered environments but also expands the potential applications of robotics in complex scenarios.


ThCT30	106
Telerobotics and Navigation	Regular Session
Chair: Hamel, Tarek	UNSA-CNRS

15:00-15:05, Paper ThCT30.1
OpenObject-NAV: Open-Vocabulary Object-Oriented Navigation Based on Dynamic Carrier-Relationship Scene Graph

Tang, Yujie	Beijing Institute of Technology
Wang, Meiling	Beijing Institute of Technology
Deng, Yinan	Beijing Institute of Technology
Zheng, Zibo	University of Nottingham of Ningbo China
Zhong, Jiagui	Beijing Institute of Technology
Liu, Tiancheng	Beijing Institute of Technology
Zhao, ChenJie	Beijing Institute of Technology
Yue, Yufeng	Beijing Institute of Technology
Keywords: Vision-Based Navigation, Mapping Abstract: In everyday life, frequently used objects like cups often have unfixed positions and multiple instances within the same category, and their carriers frequently change as well. As a result, it becomes challenging for a robot to efficiently navigate to a specific instance. To tackle this challenge, the robot must capture and update scene changes and plans continuously. However, current object navigation approaches primarily focus on the semantic level and lack the ability to dynamically update scene representation. To address these limitations, this paper captures the relationships between frequently used objects and their static carriers. Specifically, it constructs an open-vocabulary Carrier-Relationship Scene Graph (CRSG) and updates the carrying status during robot navigation to reflect the dynamic changes of the scene. Based on the CRSG, we further propose an instance navigation strategy that models the navigation process as a Markov Decision Process. At each step, decisions are informed by Large Language Model's commonsense knowledge and visual-language feature similarity. We designed a series of long-sequence navigation tasks for frequently used everyday items in the Habitat simulator. The results demonstrate that by updating the CRSG, the robot can efficiently navigate to moved targets. Additionally, we deployed our algorithm on a real robot and validated its practical effectiveness. The project page can be found here: https://OpenObject-Nav.github.io.

15:05-15:10, Paper ThCT30.2
A Simple Algebraic Solution for Estimating the Pose of a Camera from Planar Point Features

Bouazza, Tarek	Laboratoire I3S UCA-CNRS
Hamel, Tarek	UNSA-CNRS
Samson, Claude	INRIA
Keywords: Vision-Based Navigation, Localization, Visual Tracking Abstract: This paper presents a simple algebraic method to estimate the pose of a camera relative to a planar target from n geq 4 reference points with known coordinates in the target frame and their corresponding bearing measurements in the camera frame. The proposed approach follows a hierarchical structure; first, the unit vector normal to the target plane is determined, followed by the camera's position vector, its distance to the target plane, and finally, the full orientation. To improve the method's robustness to measurement noise, an averaging methodology is introduced to refine the estimation of the target's normal direction. The accuracy and robustness of the approach are validated through extensive experiments.

15:10-15:15, Paper ThCT30.3
Real-Time Photorealistic Mapping for Situational Awareness in Robot Teleoperation

Page, Ian	UGA, Gipsa-Lab, Inria, Framatome
Susbielle, Pierre	Gipsa-Lab CNRS
Aycard, Olivier	Grenoble Institute of Technology
Wieber, Pierre-Brice	INRIA
Keywords: Telerobotics and Teleoperation, SLAM, RGB-D Perception Abstract: Achieving efficient remote teleoperation is particularly challenging in unknown environments, as the teleoperator must rapidly build an understanding of the site’s layout. Online 3D mapping is a proven strategy to tackle this challenge, as it enables the teleoperator to progressively explore the site from multiple perspectives. However, traditional online map-based teleoperation systems struggle to generate visually accurate 3D maps in real-time due to the high computational cost involved, leading to poor teleoperation performances. In this work, we propose a solution to improve teleoperation efficiency in unknown environments. Our approach proposes a novel, modular and efficient GPU-based integration between recent advancement in gaussian splatting SLAM and existing online map-based teleoperation systems. We compare the proposed solution against state-of-the-art teleoperation systems and validate its performances through real-world experiments using an aerial vehicle. The results show significant improvements in decision-making speed and more accurate interaction with the environment, leading to greater teleoperation efficiency. In doing so, our system enhances remote teleoperation by seamlessly integrating photorealistic mapping generation with real-time performances, enabling effective teleoperation in unfamiliar environments.

15:15-15:20, Paper ThCT30.4
Correspondence-Free Pose Estimation with Patterns: : A Unified Approach for Multi-Dimensional Vision

Quan, Quan	Beihang University
Dai, Dun	Beihang Univeristy
Keywords: Vision-Based Navigation, Localization, SLAM Abstract: 6D pose estimation is a central problem in robot vision. Compared with pose estimation based on point correspondences or its robust versions, correspondence-free methods are often more flexible. However, existing correspondence-free methods often rely on feature representation alignment or end-to-end regression. For such a purpose, a new correspondence-free pose estimation method and its practical algorithms are proposed, whose key idea is the elimination of unknowns by process of addition to separate the pose estimation from correspondence. By taking the considered point sets as patterns, feature functions used to describe these patterns are introduced to establish a sufficient number of equations for optimization. The proposed method is applicable to nonlinear transformations such as perspective projection and can cover various pose estimations from 3D-to-3D points, 3D-to-2D points, and 2D-to-2D points. Experimental results on both simulation and actual data are presented to demonstrate the effectiveness of the proposed method.

15:20-15:25, Paper ThCT30.5
Assessing Trust and Cognitive Load in Teleoperated Robotic Systems across Different Information Conditions

García Cárdenas, Juan José	ENSTA - Institute Polytechinique De Paris
Tapus, Adriana	ENSTA Paris, Institut Polytechnique De Paris
Keywords: Telerobotics and Teleoperation Abstract: This paper investigates how different levels of information impact user trust and cognitive load in teleoperated robotic systems. Participants performed tasks under three conditions: (C1) minimal information after initial visual feedback, (C2) verbal guidance through a graphical interface, and (C3) a combination of visual and verbal guidance via a graphical user interface (GUI) that shows the direction in which the user should move the robot to complete the task. Measurements included physiological responses such as galvanic skin response (GSR), eye blink rate, and facial temperature, along with task performance. The findings revealed that increased cognitive load reduced user trust and performance. When only minimal information was provided, participants experienced the highest cognitive load and lowest trust levels. Verbal guidance significantly reduced cognitive load and increased trust, whereas the combination of visual and verbal guidance caused cognitive overload, counteracting the expected increase in trust. This study underscores the importance of balancing information quantity and quality to enhance user experience and the efficiency of teleoperated robotic systems.

15:25-15:30, Paper ThCT30.6
Repetitive Motion Control for Redundant Manipulator under False Data Injection Attacks

Zhao, Yanqiong	Jinan University
Zhang, Yinyan	Jinan University
Keywords: Telerobotics and Teleoperation Abstract: Repetitive motion control (RMC) for redundant manipulators has been extensively studied from the kinematic perspective, whereas security concerns under malicious adversaries have received limited attention. In network-controlled manipulators, when control commands sent from the control center to the remote manipulator are subject to false data injection attacks (FDIAs), serious incidents and potential harm to individuals can occur. This paper proposes a novel resilient controller such that the manipulator can successfully complete motion tracking tasks and address the non-repetitive motion problem, even in the presence of FDIAs. The problem is first reformulated as a convex optimization problem with an unknown parameter relative to FDIAs, where the RMC criteria serves as the objective function and physical limitations are incorporated as inequality constraints. A recurrent neural network (RNN) is then introduced to solve the problem, improving computational efficiency. Additionally, a detection mechanism is integrated to estimate the unknown attack parameter, allowing the RNN to find the optimal control command. Simulations and experiments are conducted on an RM65-B manipulator to validate the efficacy of the proposed method, and comparisons with existing approaches highlight its superior performance.

15:30-15:35, Paper ThCT30.7
CA^{2}Point: Learning Keypoint Detection and Description with Context Aggregation and Cross Augmentation

Meng, Xuebin	Chinese Academy of Sciences
Li, Wei	Institute of Computing Technology, Chinese Academy of Sciences
Hu, Yu	Institute of Computing Technology Chinese Academy of Sciences
Han, Yinhe	Institute of Computing Technology, Chinese Academy of Sciences
Keywords: Visual Learning, Localization, SLAM Abstract: Keypoint detection and description are fundamental tasks for a variety of computer vision applications. Due to the limited receptive field of convolutional neural networks, most existing methods based on deep learning mainly focus on the local features, instead of taking into account the global context from entire image. The purpose of this work is to enhance the detection and description process of keypoints by leveraging global information obtained from Transformer, and to boost the consistence between keypoints and descriptors through their interaction. Specifically, the above two improvements are respectively implemented through the Local & Global Context Aggregation (LGCA) Module and Point & Descriptor Cross Augmentation (PDCA) Module proposed in this article. The LGCA module, which can model the long-range context, is inserted a Feature Pyramid Network (FPN) to extract features which contain diverse scales and different receptive fields. Moreover, the PDCA module enhances descriptors by the geometry information of keypoints detected, while enhancing the keypoint detection process by the position coordinates of correctly matched descriptors. Finally, we design a lightweight model to improve the running efficiency. Extensive experiments on various tasks demonstrate that our method achieves a substantial performance improvement over the current feature extraction methods. Code is available at: https://github.com/meng152634/CA2Point.

15:35-15:40, Paper ThCT30.8
Adaptive Anomaly Recovery for Telemanipulation: A Diffusion Model Approach to Vision-Based Tracking

Wang, Haoyang	Oklahoma State University
Guo, Haoran	Oklahoma State University
Li, Zhengxiong	University of Colorado Denver
Tao, Lingfeng	Kennesaw State University
Keywords: Telerobotics and Teleoperation, Visual Tracking, Dexterous Manipulation Abstract: Dexterous telemanipulation critically relies on the continuous and stable tracking of the human operator’s commands to ensure robust operation. Vison-based tracking methods are widely used but have low stability due to anomalies such as occlusions, inadequate lighting, and loss of sight. Traditional filtering, regression, and interpolation methods are commonly used to compensate for explicit information such as angles and positions. These approaches are restricted to low-dimensional data and often result in information loss compared to the original high-dimensional image and video data. Recent advances in diffusion-based approaches, which can operate on high-dimensional data, have achieved remarkable success in video reconstruction and generation. However, these methods have not been fully explored in continuous control tasks in robotics. This work introduces the Diffusion-Enhanced Telemanipulation (DET) framework, which incorporates the Frame-Difference Detection (FDD) technique to identify and segment anomalies in video streams. These anomalous clips are replaced after reconstruction using diffusion models, ensuring robust telemanipulation performance under challenging visual conditions. We validated this approach in various anomaly scenarios and compared it with the baseline methods. Experiments show that DET achieves an average RMSE reduction of 17.2% compared to the cubic spline and 51.1% compared to FFT-based interpolation for different occlusion durations.


ThDT1	401
Gesture, Posture and Facial Expressions 2	Regular Session
Chair: Chen, Jiahao	ShanghaiTech University
Co-Chair: Jianqin, Yin	School of Artificial Intelligence, Beijing University of Posts and Telecommunications

16:40-16:45, Paper ThDT1.1
MaskSem: Semantic-Guided Masking for Learning 3D Hybrid High-Order Motion Representation

Wei, Wei	Beijing University of Posts and Telecommunication
Zhang, Shaojie	Beijing University of Posts and Telecommunications
Dang, Yonghao	Beijing University of Posts and Telecommunications
Jianqin, Yin	School of Artificial Intelligence, Beijing University of Posts A
Keywords: Gesture, Posture and Facial Expressions, Deep Learning for Visual Perception, Representation Learning Abstract: Human action recognition is a crucial task for intelligent robotics, particularly within the context of human-robot collaboration research. In self-supervised skeleton-based action recognition, the mask-based reconstruction paradigm learns the spatial structure and motion patterns of the skeleton by masking joints and reconstructing the target from unlabeled data. However, existing methods focus on a limited set of joints and low order motion patterns, limiting the model’s ability to understand complex motion patterns. To address this issue, we introduce MaskSem, a novel semantic-guided masking method for learning 3D hybrid high-order motion representations. This novel framework leverages Grad-CAM based on relative motion to guide the masking of joints, which can be represented as the most semantically rich temporal orgions. The semantic- guided masking process can encourage the model to explore more discriminative features. Furthermore, we propose using hybrid high-order motion as the reconstruction target, enabling the model to learn multi-order motion patterns. Specifically, low-order motion velocity and high-order motion acceleration are used together as the reconstruction target. This approach offers a more comprehensive description of the dynamic motion process, enhancing the model’s understanding of motion patterns. Experiments on the NTU60, NTU120, and PKU-MMD datasets show that MaskSem, combined with a vanilla transformer, improves skeleton-based action recognition, making it more suitable for real-time robotic applications. The source code of our MaskSem is available at https://github.com/JayEason66/MaskSem.

16:45-16:50, Paper ThDT1.2
One-Shot Gesture Recognition for Underwater Diver-To-Robot Communication

Joshi, Rishikesh	University of Minnesota
Sattar, Junaed	University of Minnesota
Keywords: Gesture, Posture and Facial Expressions, Marine Robotics, Human-Robot Collaboration Abstract: Reliable human-robot communication is essential for effective underwater human-robot interaction (U-HRI), yet traditional methods such as acoustic signaling and predefined gesture-based models suffer from limitations in adaptability and robustness. In this work, we propose One-Shot Gesture Recognition (OSG), a novel method that enables real-time, pose-based, temporal gesture recognition underwater from a single demonstration, eliminating the need for extensive dataset collection or model retraining. OSG leverages shape-based classification techniques, including Hu moments, Zernike moments, and Fourier descriptors, to robustly recognize gestures in visually-challenging underwater environments. Our system achieves high accuracy on real-world underwater video data and operates efficiently on embedded hardware commonly found on autonomous underwater robots (AUVs), demonstrating its feasibility for deployment on-board robots. Compared to deep learning approaches, OSG is lightweight, computationally efficient, and highly adaptable, making it ideal for diver-to-robot communication. We evaluate OSG’s performance on an augmented gesture dataset and real-world underwater video data, comparing its accuracy against deep learning methods. Our results show OSG’s potential to enhance U-HRI by enabling the immediate deployment of user-defined gestures without the constraints of predefined gesture languages.

16:50-16:55, Paper ThDT1.3
Towards Open-World Human Action Segmentation Using Graph Convolutional Networks

Xing, Hao	Technical University of Munich (TUM)
Boey, Kai Zhe	Technical University of Munich (TUM)
Cheng, Gordon	Technical University of Munich
Keywords: Gesture, Posture and Facial Expressions, Intention Recognition, Deep Learning Methods Abstract: Current methods for human-object interaction segmentation excel in closed-world settings but struggle to generalize to open-world scenarios where novel actions emerge. Since collecting exhaustive training data for all possible dynamic human activities is impractical, a model capable of detecting and segmenting novel, out-of-distribution (OOD) actions without manual annotation is needed. To address this, we formally define the open-world action segmentation problem and propose a novel framework featuring three key components: 1) an Enhanced Pyramid Graph Convolutional Network with a new decoder for robust spatiotemporal upsampling, 2) hybrid-based training synthesizing OOD data to eliminate reliance on manual labels, and 3) a temporal clustering loss that groups in-distribution actions while distancing OOD samples. We evaluate our framework on two challenging human-object interaction recognition datasets: Bimanual Actions and Two Hands and Object datasets. Experimental results demonstrate significant improvements over state-of-the-art action segmentation models across multiple open-set evaluation metrics, achieving 16.9% and 34.6% relative gains in open-set segmentation (F1@50) and out-of-distribution detection performances (AUROC), respectively. Additionally, we conduct an in-depth ablation study to assess the impact of each proposed component, identifying the optimal framework configuration for open-world action segmentation.

16:55-17:00, Paper ThDT1.4
ExFace: Expressive Facial Control for Humanoid Robots with Diffusion Transformers and Bootstrap Training

Zhang, Dong	ShanghaiTech University
Peng, Jingwei	ShanghaiTech University
Jiao, Yuyang	Shanghaitech University
Gu, Jiayuan	ShanghaiTech University
Yu, Jingyi	ShanghaiTech University
Chen, Jiahao	ShanghaiTech University
Keywords: Gesture, Posture and Facial Expressions, Deep Learning Methods, Data Sets for Robot Learning Abstract: This paper presents a novel Expressive Facial Control (ExFace) method based on Diffusion Transformers, which achieves precise mapping from human facial blend- shapes to bionic robot motor control. By incorporating an innovative model bootstrap training strategy, our approach not only generates high-quality facial expressions but also significantly improves accuracy and smoothness. Experimental results demonstrate that the proposed method outperforms previous methods in terms of accuracy, frames per second (FPS), and response time. Furthermore, we develop the ExFace dataset driven by human facial data. ExFace shows excellent real-time performance and natural expression rendering in applications such as robot performances and human-robot interactions, offering a new solution for bionic robot interaction.

17:00-17:05, Paper ThDT1.5
GPT-Driven Gestures: Leveraging Large Language Models to Generate Expressive Robot Motion for Enhanced Human-Robot Interaction

Roy, Liam	Monash University
Croft, Elizabeth	University of Victoria
Ramirez-Serrano, Alejandro	4Front Robotics Ltd
Kulic, Dana	Monash University
Keywords: Gesture, Posture and Facial Expressions, Human-Robot Collaboration, Natural Machine Motion Abstract: Expressive robot motion is a form of nonverbal communication that enables robots to convey their internal states, fostering effective human-robot interaction. A key step in designing expressive robot motions is developing a mapping from the desired states the robot will express to the robot's hardware and available degrees of freedom (design space). This paper introduces a novel framework to autonomously generate this mapping by leveraging a large language model (LLM) to select motion parameters and their values for target robot states. We evaluate expressive robot body language displayed on a Unitree Go1 quadruped as generated by a Generative Pre-trained Transformer (GPT) provided with a set of adjustable motion parameters. Through a two-part study (N = 120), we compared LLM-generated expressive motions with both randomly selected and human-selected expressions. Our results show that participants viewing LLM-generated expressions achieve a significantly higher state classification accuracy over random baselines and perform comparably with human-generated expressions. Additionally, in our post-hoc analysis we find that the Earth Movers Distance provides a useful metric for identifying similar expressions in the design space that lead to classification confusion.

17:05-17:10, Paper ThDT1.6
Subject-Embedded Vision Transformer with Transfer Learning for Cross-Subject Dynamic Hand Gesture Recognition Using HD-sEMG

Feng, Jirou	Korea Advanced Institute of Science and Technology
Bao, Xingce	École Polytechnique Fédérale De Lausanne
Choi, Junhwan	Korea Advanced Institute of Science and Technology, (KAIST)
Kyeong, Seulki	Sejong University
Kim, Jung	KAIST
Keywords: Gesture, Posture and Facial Expressions, Intention Recognition, Neurorobotics Abstract: Hand gesture recognition (HGR) is crucial in developing advanced prosthetics, neurorobotics, and human-robot interaction (HRI). Surface electromyography (sEMG) and high-density sEMG (HD-sEMG) have gained attention for their ability to capture the muscle activity underlying hand gestures. Although many models achieve high performance within the same subjects, generalizing across different subjects remains a significant challenge, limiting the practical application of these systems in real-world settings. Furthermore, most conventional approaches primarily focus on the steady phase of gestures, which slows down real-time prediction. To address these issues, we propose a cross-subject dynamic hand gesture recognition (DHGR) framework based on the Vision Transformer (ViT) architecture, referred to as ViT-DHGR. Our model focuses explicitly on the signal transient phase before gesture stabilization to reduce gesture prediction latency and counteract system control delays. By incorporating subject embeddings and transfer learning strategies, the proposed ViT-DHGR framework for 34 dynamic hand gestures achieved an accuracy of 76.44% for 10 subjects using only 1 repetition of gesture data, which improved to 85.03% with 2 repetitions. In addition, our proposed framework achieves over 16% higher average accuracy across test subjects using 1 repetition of data compared to training subject-specific models from scratch. This work demonstrates the potential of HD-sEMG for capturing dynamic hand gestures and highlights the benefits of cross-user knowledge transfer in reducing data requirements and enhancing practicality for robotic applications.


ThDT2	402
Intelligent and Flexible Manufacturing	Regular Session
Chair: Gao, Anzhu	Shanghai Jiao Tong University
Co-Chair: Wang, Yongjing	University of Birmingham

16:40-16:45, Paper ThDT2.1
Multimodal Autonomous Robotic Long-Horizon Task Planning Via Embodied Language Model and Behavior Trees

Chen, Hongpeng	Hong Kong Polytechnic University
Liu, Shimin	The Hong Kong Polytechnic University
Li, Zhiyuan	Hong Kong Polytechnic University
Navarro-Alarcon, David	The Hong Kong Polytechnic University
Zheng, Pai	The Hong Kong Polytechnic University
Keywords: Intelligent and Flexible Manufacturing, Task Planning, Human-Centered Robotics Abstract: Enabling robotic systems to perform long-horizon manipulation planning in real-world environments based on multimodal embodied perception and comprehension remains a longstanding challenge. Recent advancements in large language models (LLMs) have spurred the development of LLM-based planners; however, these approaches often rely on human-provided textual representations or extensive prompt engineering, lacking the ability to quantitatively interpret the environment. To overcome these limitations, we propose a novel framework that leverages LLMs and vision-language models (VLMs) to perform abstract reasoning and extract task-relevant representations from the environment using grounding mechanisms. To further enhance robotic capabilities, we introduce a systematic approach to constructing robotic skill libraries, enabling efficient generation of feasible and optimal actions. Unlike prior work, our LLM-based task planner reformulates user instructions into Planning Domain Description Language (PDDL) problems and employs Behavior Trees to represent the hierarchical structure of tasks, offering interpretable and modular task execution. Extensive evaluations on diverse real-world long-horizon manipulation tasks demonstrate the effectiveness of the proposed method, achieving an average success rate exceeding 80%. Furthermore, the framework functions as a high-level planner, empowering robots with substantial autonomy in unstructured environments by leveraging multimodal sensor inputs.

16:45-16:50, Paper ThDT2.2
Automatic MILP Model Construction for Multi-Robot Task Allocation and Scheduling Based on Large Language Models

Peng, Mingming	Huazhong University of Science and Technology
Chen, Zhendong	Huazhong University of Science and Technology
Yang, Jie	Huazhong University of Science and Technology
Huang, Jin	Huazhong University of Science and Technology
Shi, Zhengqi	Huazhong University of Science and Technology
Liu, Qihao	Huazhong University of Science and Technology
Li, Xinyu	Huazhong University of Science and Technology
Gao, Liang	Huazhong Univ. of Sci. & Tech
Keywords: Intelligent and Flexible Manufacturing, AI-Based Methods, Multi-Robot Systems Abstract: With the accelerated development of Industry 4.0, intelligent manufacturing systems increasingly require efficient task allocation and scheduling in multi-robot systems. However, existing methods rely on domain expertise and face challenges in adapting to dynamic production constraints. Additionally, enterprises have high privacy requirements for production scheduling data, which prevents the use of cloud-based large language models (LLMs) for solution development. To address these challenges, there is an urgent need for an automated modeling solution that meets data privacy requirements.This study proposes a knowledge-augmented mixed integer linear programming (MILP) automated formulation framework, integrating local LLMs with domain-specific knowledge bases to generate executable code from natural language descriptions automatically. The framework employs a knowledge-guided DeepSeek-R1-Distill-Qwen-32B model to extract complex spatiotemporal constraints (82% average accuracy) and leverages a supervised fine-tuned Qwen2.5-Coder-7B-Instruct model for efficient MILP code generation (90% average accuracy). Experimental results demonstrate that the framework successfully achieves automatic modeling in the aircraft skin manufacturing case while ensuring data privacy and computational efficiency. This research provides a low-barrier and highly reliable technical path for modeling in complex industrial scenarios.

16:50-16:55, Paper ThDT2.3
Augmenting Robotic Disassembly Skill: Combining Compliance Control Strategy with Reinforcement Learning for Twist-Pulling Disassembly

Zang, Yue	School of Engineering, University of Birmingham
Xu, Xiazhen	University of Birmingham
Zhang, Yongquan	School of Mechanical and Electronic Engineering, Wuhan Universit
Hajiyavand, Amir M	University of Birmingham
Ye, Jiaqi	University of Birmingham
Wang, Yongjing	University of Birmingham
Keywords: Intelligent and Flexible Manufacturing, Industrial Robots, Sustainable Production and Service Automation Abstract: Efficient robotic disassembly of end-of-life products is often impeded by inherent uncertainties in product condition and unknown internal structures. Conventional disassembly methods face challenges when adaptive exploration is required—particularly in cap-shaft disassembly, where connection mechanisms are frequently concealed. This paper proposes a novel robotic twist-and-pull disassembly strategy that integrates compliance control with reinforcement learning (RL). By enabling the robot to adapt to unknown connection geometries and systematic misalignments, the approach enhances the capabilities of robotic skill and reduces dependence on precisely pre-programmed trajectories. Experimental results confirm that the proposed strategy substantially improves robotic disassembly performance, improves RL training success rate, and demonstrates strong domain transferability, supporting its application across varied disassembly contexts.

16:55-17:00, Paper ThDT2.4
Multi-Robot Assembly of Deformable Linear Objects Using Multi-Modal Perception

Chen, Kejia	Technical University of Munich
Dettmering, Celina	Technical University of Munich (TUM)
Pachler, Florian	Technical University of Munich
Liu, Zhuo	Technical University of Munich
Zhang, Yue	Technical University of Munich
Cheng, Tailai	Technische Universität München
Dirr, Jonas	Technical University of Munich
Bing, Zhenshan	Technical University of Munich
Knoll, Alois	Tech. Univ. Muenchen TUM
Daub, Rüdiger	Technical University of Munich (TUM, Fraunhofer IGCV
Keywords: Intelligent and Flexible Manufacturing, Assembly, Perception for Grasping and Manipulation Abstract: Industrial assembly of deformable linear objects (DLOs) such as cables offers great potential for many industries. However, DLOs pose several challenges for robot-based automation due to the inherent complexity of deformation and, consequentially, the difficulties in anticipating the behavior of DLOs in dynamic situations. Although existing studies have addressed isolated subproblems like shape tracking, grasping, and shape control, there has been limited exploration of integrated workflows that combine these individual processes. To address this gap, we propose an object-centric perception and planning framework to achieve a comprehensive DLO assembly process throughout the industrial value chain. The framework utilizes visual and tactile information to track the DLO's shape as well as contact state across different stages, which facilitates effective planning of robot actions. Our approach encompasses robot-based bin picking of a target DLO from cluttered environments, followed by a coordinated handover to two additional robots that mount the DLO onto designated fixtures. Real-world experiments employing a setup with multiple robots demonstrate the effectiveness of the approach and its relevance to industrial scenarios.

17:00-17:05, Paper ThDT2.5
Error Sensitivity Flexibility Compensation of Joints for Improving the Positioning Accuracy of Industrial Robots (I)

Li, Yingjie	Kunming University of Science and Technology
Gao, Guanbin	Kunming University of Science and Technology
Keywords: Industrial Robots, Calibration and Identification Abstract: Flexibility models based on the virtual joint approach (VJA) are essential for error compensation to improve the positioning accuracy of industrial robots across a range of payloads. However, current flexibility models are not accurate enough due to less consideration of deformation, or incorporate too many factors leading to difficulties in practical applications. This paper proposes a flexibility model based on the error sensitivity analysis to improve the positioning accuracy and stability of industrial robots. First, the effects of the six directions flexible deformation of the joint on the positioning error are analyzed by introducing the Sobol's method. It indicates that the rotational deformations around X, Y and Z-axes cause the majority of positioning errors, and only a tiny minority is originated from translational deformations along X, Y, and Z-axes. Then, a mapping equation between the flexible deformations around X, Y and Z-axes and the positioning error is derived based on this observation. Finally, a flexibility model for N degrees of freedom (DoF) industrial robots is established and an identification method is presented for flexibility coefficients. The verification experiments are performed on a 6-DoF robot, and an application example is provided for error compensation in robotic assembly tasks. The experimental results show that the proposed model has higher accuracy and stability but lower calculation cost than conventional models. Moreover, after compensation, the pose error is reduced to 0.1mm and 0.03° meeting the assembly requirements in the application example.

17:05-17:10, Paper ThDT2.6
3D-UnOutDet: A Fast and Efficient Unsupervised Snow Removal Algorithm for 3D LiDAR Point Clouds

Raisuddin, Abu Mohammed	Halmstad University
Gouigah, Idriss	Halmstad University
Aksoy, Eren Erdal	Halmstad University
Keywords: Intelligent and Flexible Manufacturing, Deep Learning for Visual Perception, Deep Learning Methods Abstract: In this work, we propose a novel, fast, and memory-efficient unsupervised statistical method, combined with an unsupervised deep learning (DL) model, for de-snowing 3D LiDAR point clouds in a fully unsupervised fashion. The results obtained on the real-scanned Winter Adverse Driving dataSet (WADS) show that our DL model achieves a 6.3% improvement in mIoU over the current state-of-the-art unsupervised DL methods and performs comparable to supervised counterparts, substantially narrowing the performance gap between supervised and unsupervised approaches. In addition to that, our model also outperforms its closest competitor by 12.8% mIoU when tested on our Canadian Adverse Driving Conditions (CADC) dataset annotations. Additionally, our de-snowing algorithm enhances downstream semantic segmentation and object detection tasks without even requiring any modifications to the base segmentation and detection models. The source code, trained models, and the online supplementary information are available at the following URL: https://sporsho.github.io/3DUnOutDet


ThDT3	403
Autonomous Agents 2	Regular Session
Chair: Zhang, Fumin	Hong Kong University of Science and Technology

16:40-16:45, Paper ThDT3.1
SILM: A Subjective Intent Based Low-Latency Framework for Multiple Traffic Participants Joint Trajectory Prediction

Qu, Weiming	Peking Universitiy
Wang, Jia	Peking University
Du, Jiawei	Peking University
Zhu, Yuanhao	China Automotive Innovation Corporation
Yu, Jianfeng	China Automotive Innovation Corporation
Xia, Rui	China Automotive Innovation Corporation
Cao, Song	China Automotive Innovation Corporation
Wu, Xihong	Peking University
Luo, Dingsheng	Peking University
Keywords: Autonomous Agents, Agent-Based Systems, Semantic Scene Understanding Abstract: Trajectory prediction is a fundamental technology for advanced autonomous driving systems and represents one of the most challenging problems in the field of cognitive intelligence. Accurately predicting the future trajectories of each traffic participant is a prerequisite for building high safety and high reliability decision-making, planning, and control capabilities in autonomous driving. However, existing methods often focus solely on the motion of other traffic participants without considering the underlying intent behind that motion, which increases the uncertainty in trajectory prediction. autonomous vehicles operate in real-time environments, meaning that trajectory prediction algorithms must be able to process data and generate predictions in real-time. While many existing methods achieve high accuracy, they often struggle to effectively handle heterogeneous traffic scenarios. In this paper, we propose a Subjective Intent-based Low-latency framework for Multiple traffic participants joint trajectory prediction. Our method explicitly incorporates the subjective intent of traffic participants based on their key points, and predicts the future trajectories jointly without map, which ensures promising performance while significantly reducing the prediction latency. Additionally, we introduce a novel dataset designed specifically for trajectory prediction. Related code and dataset will be available soon.

16:45-16:50, Paper ThDT3.2
DPGP: A Hybrid 2D-3D Dual Path Potential Ghost Probe Zone Prediction Framework for Safe Autonomous Driving

Qu, Weiming	Peking Universitiy
Du, Jiawei	Peking University
Yuan, Shenghai	Nanyang Technological University
Wang, Jia	Peking University
Sun, Yang	Southeast University
Liu, Shengyi	Southeast University
Zhu, Yuanhao	China Automotive Innovation Corporation
Rao, Jiayi	China Sanjiang Space Group Co., Ltd
Yu, Jianfeng	China Automotive Innovation Corporation
Cao, Song	China Automotive Innovation Corporation
Xia, Rui	China Automotive Innovation Corporation
Tang, Xiaoyu	Southeast University
Wu, Xihong	Peking University
Luo, Dingsheng	Peking University
Keywords: Autonomous Agents, Agent-Based Systems, Semantic Scene Understanding Abstract: Modern robotics often needs to coexist with humans in dense urban environments. One critical challenge is the ghost probe problem, where pedestrians or other objects unexpectedly rush into existing vehicle traffic paths. This issue is inevitable, affecting not only autonomous vehicles but also human drivers. Existing works have proposed vehicle-to-everything (V2X) strategies and non-line-of-sight (NLOS) imaging devices for ghost probe zone detection. However, most of these methods either require high computational power or rely on specialized hardware, making them impractical in many real-world scenarios. Furthermore, many existing approaches do not explicitly address this problem. To tackle this challenge, we propose DPGP, a hybrid 2D-3D fusion framework for ghost probe zone prediction using only a single monocular camera throughout both the training and inference stages. With unsupervised depth prediction, we observe that ghost probe zones are often associated with regions of depth discontinuity. However, different depth representations provide varying levels of robustness. To leverage this, we design a pipeline that fuses multiple feature embeddings to improve the prediction of potential ghost probe zones. To validate our method, we created a new dataset with 12K images annotated with ghost probe zones. This dataset was carefully constructed from multiple sources and cross-checked for accuracy. Experimental results show that our framework achieves superior performance while maintaining cost-effectiveness. To the best of our knowledge, this is the first work to extend ghost probe zone prediction beyond just vehicles, addressing scenarios where various non-vehicle objects can contribute to this problem. We will opensource both source code and dataset to benefit the society.

16:50-16:55, Paper ThDT3.3
On-Board Vision-Language Models (VLMs) for Personalized Motion Control of Autonomous Vehicles

Cui, Can	Purdue
Yang, Zichong	Purdue University
Zhou, Yupeng	Purdue University West Lafayette
Peng, Juntong	Purdue University
Park, Sung-Yeon	Purdue University
Zhang, Cong	Purdue University
Ma, Yunsheng	Purdue University
Cao, Xu	New York University
Ye, Wenqian	University of Virginia
Feng, Yiheng	Purdue Univerisyt
Panchal, Jitesh	Purdue Univerisy
Li, Lingxi	Purdue University
Chen, Yaobin	Purdue Univerisy
Wang, Ziran	Purdue University
Keywords: Autonomous Agents, Computer Vision for Transportation, Human-Centered Automation Abstract: Personalized driving refers to an autonomous vehicle's ability to adapt its driving behavior or control strategies to match individual users' preferences and driving styles while maintaining safety and comfort standards. However, existing works either fail to capture every individual's preference precisely or become computationally inefficient as the user base expands. Vision-Language Models (VLMs) offer promising solutions to this front through their natural language understanding and scene reasoning capabilities. In this work, we propose a lightweight yet effective on-board VLM framework that provides low-latency personalized driving performance while maintaining strong reasoning capabilities. Our solution incorporates a Retrieval-Augmented Generation (RAG)-based memory module that enables continuous learning of individual driving preferences through human feedback. Through comprehensive real-world vehicle experiments, our system has demonstrated the ability to provide safe, comfortable, and personalized driving experiences across various scenarios and significantly reduce takeover rates by up to 76.9%. To the best of our knowledge, this work represents the first personalized VLM motion control system in real-world autonomous vehicles. The demo video can be watched at https://tinyurl.com/4xsnz79n.

16:55-17:00, Paper ThDT3.4
EfficientEQA: An Efficient Approach to Open-Vocabulary Embodied Question Answering

Cheng, Kai	Purdue University
Li, Zhengyuan	Purdue University
Sun, Xingpeng	Purdue University
Min, Byung-Cheol	Purdue University
Bedi, Amrit Singh	University of Maryland, College Park
Bera, Aniket	Purdue University
Keywords: Autonomous Agents, Vision-Based Navigation Abstract: Embodied Question Answering (EQA) is an essential yet challenging task for robot assistants. Large vision-language models (VLMs) have shown promise for EQA, but existing approaches either treat it as static video question answering without active exploration or restrict answers to a closed set of choices. These limitations hinder real-world applicability, where a robot must explore efficiently and provide accurate answers in open-vocabulary settings. To overcome these challenges, we introduce EfficientEQA, a novel framework that couples efficient exploration with free-form answer generation. EfficientEQA features three key innovations: (1) Semantic-Value-Weighted Frontier Exploration (SFE) with Verbalized Confidence (VC) from a black-box VLM to prioritize semantically important areas to explore, enabling the agent to gather relevant information faster; (2) a BLIP relevancy-based mechanism to stop adaptively by flagging highly relevant observations as outliers to indicate whether the agent has collected enough information; and (3) a Retrieval-Augmented Generation (RAG) method for the VLM to answer accurately based on pertinent images from the agent’s observation history without relying on predefined choices. Our experimental results show that EfficientEQA achieves over 15% higher answer accuracy and requires over 20% fewer exploration steps than state-of-the-art methods. Our code is available at: https://github.com/chengkaiAcademyCity/EfficientEQA

17:00-17:05, Paper ThDT3.5
ManeuverGPT Agentic Control for Safe Autonomous Stunt Maneuvers

Azdam, Shawn	NYU ASAS Lab, Azdam AI
Doma, Pranav	New York University
Arab, Aliasghar	NYU
Keywords: Autonomous Agents, AI-Based Methods, Collision Avoidance Abstract: The next generation of active safety features in autonomous vehicles should be capable of safely executing evasive hazard-avoidance maneuvers to achieve rapid motion at the limits of vehicle handling. This paper presents a novel framework, ManeuverGPT, for generating and executing highly dynamic stunt maneuvers in autonomous vehicles using large language model (LLM)-based agents as controllers. We target aggressive maneuvers, such as J-turns, within the CARLA simulation environment and demonstrate an iterative, prompt-based approach to refine vehicle control parameters, starting tabula rasa without retraining model weights. We propose an agentic architecture composed of three specialized agents: (1) Query Enricher Agent for contextualizing user commands, (2) Driver Agent for generating maneuver parameters, and (3) Parameter Validator Agent that enforces physics-based and safety constraints. Experimental results demonstrate successful J-turn execution across multiple vehicle models through textual prompts that adapt to differing vehicle dynamics. We evaluate performance via established success criteria and discuss limitations regarding numeric precision and scenario complexity. Our findings underscore the potential of LLM-driven control for high-agility maneuvers, while highlighting the importance of hybrid approaches that combine language-based reasoning with algorithmic validation. We provide an open-source implementation at https://github.com/SHi-ON/ManeuverGPT to foster further research within the broader community.

17:05-17:10, Paper ThDT3.6
DriveGPT4: Interpretable End-To-End Autonomous Driving Via Large Language Model

Xu, Zhenhua	The Hong Kong University of Science and Technology
Zhang, Yujia	Zhejiang University
Xie, Enze	The University of Hong Kong
Zhao, Zhen	Shanghai AI Lab
Guo, Yong	Max Planck Institute for Informatics
Wong, Kwan-Yee Kenneth	The University of Hong Kong
Li, Zhenguo	Huawei Noah's Ark Lab
Zhao, Hengshuang	The University of Hong Kong
Keywords: Autonomous Agents, Intelligent Transportation Systems, Visual Learning Abstract: Multimodal large language models (MLLMs) have emerged as a prominent area of interest within the research community, given their proficiency in handling and reasoning with non-textual data, including images and videos. This study seeks to extend the application of MLLMs to the realm of autonomous driving by introducing DriveGPT4, a novel interpretable end-to-end autonomous driving system based on LLMs. Capable of processing multi-frame video inputs and textual queries, DriveGPT4 facilitates the interpretation of vehicle actions, offers pertinent reasoning, and effectively addresses a diverse range of questions posed by users. Furthermore, DriveGPT4 predicts low-level vehicle control signals in an end-to-end fashion. These advanced capabilities are achieved through the utilization of a bespoke visual instruction tuning dataset, specifically tailored for autonomous driving applications, in conjunction with a mix-finetuning training strategy. DriveGPT4 represents the pioneering effort to leverage LLMs for the development of an interpretable end-to-end autonomous driving solution. Evaluations conducted on the BDD-X dataset showcase the superior qualitative and quantitative performance of DriveGPT4. Additionally, the fine-tuning of domain-specific data enables DriveGPT4 to yield close or even improved results in terms of autonomous driving grounding when contrasted with GPT4-V. The webpage of this paper is available at url{https://tonyxuqaq.github.io/projects/DriveGPT4/}.


ThDT4	404
AI-Enabled Robotics 4	Regular Session
Chair: Schwertfeger, Sören	ShanghaiTech University
Co-Chair: Wang, Yu	Tsinghua University

16:40-16:45, Paper ThDT4.1
BEVDriver: Leveraging BEV Maps in LLMs for Robust Closed-Loop Driving

Winter, Katharina	Munich University of Applied Sciences
Azer, Mark	Hochschule München
Flohr, Fabian	Munich University of Applied Sciences
Keywords: AI-Based Methods, Autonomous Agents, Task and Motion Planning Abstract: Autonomous driving has the potential to set the stage for more efficient future mobility, requiring the research domain to establish trust through safe, reliable and transparent driving. Large Language Models (LLMs) possess reasoning capabilities and natural language understanding, presenting the potential to serve as generalized decision-makers for ego-motion planning that can interact with humans and navigate environments designed for human drivers. While this research avenue is promising, current autonomous driving approachesare challenged by combining 3D spatial grounding and the reasoning and language capabilities of LLMs. We introduce BEVDriver, an LLM-based model for end-to-end closed-loop driving in CARLA that utilizes latent BEV features as perception input. BEVDriver includes a BEV Encoder to efficiently process multi-view images and 3D LiDAR point clouds. Within a common latent space, the BEV features are propagated through a Q-Former to align with natural language instructions and passed to the LLM that predicts and plans precise future trajectories while considering navigation instructions and critical scenarios. On the LangAuto benchmark, our model reaches up to 18.9% higher performance on the Driving Score compared to SoTA methods.

16:45-16:50, Paper ThDT4.2
MALMM: Multi-Agent Large Language Models for Zero-Shot Robotic Manipulation

Singh, Harsh	Mohamed Bin Zayed University of Artificial Intelligence
Das, Rocktim Jyoti	MBZUAI, Abu Dhabi
Han, Mingfei	Mohamed Bin Zayed University of Artificial Intelligence
Nakov, Preslav	Mohamed Bin Zayed University of Artificial Intelligence
Laptev, Ivan	INRIA
Keywords: AI-Based Methods, Manipulation Planning, Agent-Based Systems Abstract: Large Language Models (LLMs) have demonstrated remarkable planning abilities across various domains, including robotic manipulation and navigation. While recent work in robotics deploys LLMs for high-level and low-level planning, existing methods often face challenges with failure recovery and suffer from hallucinations in long-horizon tasks. To address these limitations, we propose a novel multi-agent LLM framework, Multi-Agent Large Language Model for Manipulation (MALMM). Notably, MALMM distributes planning across three specialized LLM agents, namely high-level planning agent, low-level control agent, and a supervisor agent. Moreover, by incorporating environment observations after each step, our framework effectively handles intermediate failures and enables adaptive re-planning. Unlike existing methods, MALMM does not rely on pre-trained skill policies or in-context learning examples and generalizes to unseen tasks. In our experiments, MALMM demonstrates excellent performance in solving previously unseen long-horizon manipulation tasks, and outperforms existing zero-shot LLM-based methods in RLBench by a large margin. Experiments with the Franka robot arm further validate our approach in real-world settings.

16:50-16:55, Paper ThDT4.3
Navi2Gaze: Leveraging Foundation Models for Navigation and Target Gazing

Zhu, Jun	Tsinghua University
Du, Zihao	Tsinghua University
Xu, Haotian	Tsinghua University
Lan, Fengbo	Tsinghua University
Zheng, Zilong	BIGAI
Ma, Bo	Tsinghua University
Wang, Shengjie	Tsinghua University
Zhang, Tao	Tsinghua University
Keywords: AI-Enabled Robotics, Vision-Based Navigation, Domestic Robotics Abstract: Task-aware navigation continues to be a challenging area of research, especially in scenarios involving open vocabulary. Previous studies primarily focus on finding suitable locations for task completion, often overlooking the importance of the robot's pose. However, the robot's orientation is crucial for successfully completing tasks because of how objects are arranged (e.g., to open a refrigerator door). Humans intuitively navigate to objects with the right orientation using semantics and common sense. For instance, when opening a refrigerator, we naturally stand in front of it rather than to the side. Recent advances suggest that Vision-Language Models (VLMs) can provide robots with similar common sense. Therefore, we develop a VLM-driven method called Navigation-to-Gaze (Navi2Gaze) for efficient navigation and object gazing based on task descriptions. This method uses the VLM to score and select the best pose from numerous candidates automatically. In evaluations on multiple photorealistic simulation benchmarks, Navi2Gaze significantly outperforms existing approaches by precisely determining the optimal orientation relative to target objects, resulting in a 68.8% reduction in Distance to Goal (DTG). Real-world video demonstrations can be found on the supplementary website.

16:55-17:00, Paper ThDT4.4
Intelligent LiDAR Navigation: Leveraging External Information and Semantic Maps with LLM As Copilot

Xie, Fujing	Shanghaitech University
Zhang, Jiajie	Shanghaitech University
Schwertfeger, Sören	ShanghaiTech University
Keywords: AI-Enabled Robotics, Motion and Path Planning, Reactive and Sensor-Based Planning Abstract: Traditional robot navigation systems primarily utilize occupancy grid maps and laser-based sensing technologies, as demonstrated by the popular move_base package in ROS. Unlike robots, humans navigate not only through spatial awareness and physical distances but also by integrating external information, such as elevator maintenance updates from public notification boards and experiential knowledge, like the need for special access through certain doors. With the development of Large Language Models (LLMs), which possesses text understanding and intelligence close to human performance, there is now an opportunity to infuse robot navigation systems with a level of understanding akin to human cognition. In this study, we propose using osmAG (Area Graph in OpenStreetMap textual format), an innovative semantic topometric hierarchical map representation, to bridge the gap between the capabilities of ROS move_base and the contextual understanding offered by LLMs. Our methodology employs LLMs as an actual copilot in robot navigation, enabling the integration of a broader range of informational inputs, while maintaining the robustness of traditional robotic navigation systems. Our code, demo, map, and experiment results can be accessed at https://github.com/xiexiexiaoxiexie/Intelligent-LiDAR-Navig ation-LLM-as-Copilot.

17:00-17:05, Paper ThDT4.5
Multi-UAV Formation Control with Static and Dynamic Obstacle Avoidance Via Reinforcement Learning

Xie, Yuqing	Tsinghua University
Yu, Chao	Tsinghua University
Zang, Hongzhi	Tsinghua University
Gao, Feng	Tsinghua University
Tang, Wenhao	Tsinghua University
Huang, Jingyi	University of Oxford
Chen, Jiayu	Tsinghua University
Xu, Botian	Tsinghua University
Wu, Yi	Tsinghua University
Wang, Yu	Tsinghua University
Keywords: AI-Enabled Robotics, Reinforcement Learning, Aerial Systems: Perception and Autonomy Abstract: This paper tackles the challenging task of maintaining formation among multiple unmanned aerial vehicles (UAVs) while avoiding both static and dynamic obstacles during directed flight. The complexity of the task arises from its multi-objective nature, the large exploration space, and the sim-to-real gap. To address these challenges, we propose a two-stage reinforcement learning (RL) pipeline. In the first stage, we randomly search for a reward function that balances key objectives: directed flight, obstacle avoidance, formation maintenance, and zero-shot policy deployment. The second stage applies this reward function to more complex scenarios and utilizes curriculum learning to accelerate policy training. Additionally, we incorporate an attention-based observation encoder to improve formation maintenance and adaptability to varying obstacle densities. Experimental results in both simulation and real-world environments demonstrate that our method outperforms both planning-based and RL-based baselines in terms of collision-free rates and formation maintenance across static, dynamic, and mixed obstacle scenarios. Ablation studies further confirm the effectiveness of our curriculum learning strategy and attention-based encoder. Animated demonstrations are available at: url{https://sites.google.com/view/uav-formation-with-avoidance/}.

17:05-17:10, Paper ThDT4.6
Robust Deep Reinforcement Learning in Robotics Via Adaptive Gradient-Masked Adversarial Attacks

Zhang, Zongyuan	The University of Hong Kong
Duan, Tianyang	The University of Hong Kong
Lin, Zheng	Fudan University
Huang, Dong	The University of Hong Kong
Fang, Zihan	Fudan University
Sun, Zekai	The University of Hong Kong
Xiong, Ling	Xihua University
Liang, Hongbin	Southwest Jiaotong University
Cui, Heming	The University of Hong Kong
Cui, Yong	Tsinghua University
Gao, Yue	Fudan University
Keywords: AI-Enabled Robotics, Agent-Based Systems, Autonomous Agents Abstract: Deep reinforcement learning (DRL) has emerged as a promising approach for robotic control, but its real-world deployment remains challenging due to its vulnerability to environmental perturbations. Existing white-box adversarial attack methods, adapted from supervised learning, fail to effectively target DRL agents as they overlook temporal dynamics and indiscriminately perturb all state dimensions, limiting their impact on long-term rewards. To address these challenges, we propose the Adaptive Gradient-Masked Reinforcement (AGMR) Attack, a white-box attack method that combines DRL with a gradient-based soft masking mechanism to dynamically identify critical state dimensions and optimize adversarial policies. AGMR selectively allocates perturbations to the most impactful state features and incorporates a dynamic adjustment mechanism to balance exploration and exploitation during training. Extensive experiments demonstrate that AGMR outperforms state-of-the-art adversarial attack methods in degrading the performance of the victim agent and enhances the victim agent's robustness through adversarial defense mechanisms.


ThDT5	407
Tendon/Wire Mechanism	Regular Session
Chair: Kawaharazuka, Kento	The University of Tokyo
Co-Chair: Le Mesle, Valentin	Technical University of Munich

16:40-16:45, Paper ThDT5.1
Design Optimization of Three-Dimensional Wire Arrangement Considering Wire Crossings for Tendon-Driven Robots

Kawaharazuka, Kento	The University of Tokyo
Inoue, Shintaro	The University of Tokyo
Sahara, Yuta	The University of Tokyo
Yoneda, Keita	The University of Tokyo
Suzuki, Temma	The University of Tokyo
Okada, Kei	The University of Tokyo
Keywords: Tendon/Wire Mechanism, Methods and Tools for Robot System Design, Mechanism Design Abstract: Tendon-driven mechanisms are useful from the perspectives of variable stiffness, redundant actuation, and lightweight design, and they are widely used, particularly in hands, wrists, and waists of robots. The design of these wire arrangements has traditionally been done empirically, but it becomes extremely challenging when dealing with complex structures. Various studies have attempted to optimize wire arrangement, but many of them have oversimplified the problem by imposing conditions such as restricting movements to a 2D plane, keeping the moment arm constant, or neglecting wire crossings. Therefore, this study proposes a three-dimensional wire arrangement optimization that takes wire crossings into account. We explore wire arrangements through a multi-objective black-box optimization method that ensures wires do not cross while providing sufficient joint torque along a defined target trajectory. For a 3D link structure, we optimize the wire arrangement under various conditions, demonstrate its effectiveness, and discuss the obtained design solutions.

16:45-16:50, Paper ThDT5.2
Vibration-Assisted Hysteresis Mitigation for Achieving High Compensation Efficiency

Park, Myeongbo	DGIST
An, Chunggil	DGIST
Park, Junhyun	DGIST
Kang, Jonghyun	DGIST
Hwang, Minho	Daegu Gyeongbuk Instituute of Science and Technology (DGIST)
Keywords: Tendon/Wire Mechanism, Modeling, Control, and Learning for Soft Robots, Flexible Robotics Abstract: Tendon-sheath mechanisms (TSMs) are widely used in minimally invasive surgical (MIS) applications, but their inherent hysteresis—caused by friction, backlash, and tendon elongation—leads to significant tracking errors. Conventional modeling and compensation methods struggle with these nonlinearities and require extensive parameter tuning. To address this, we propose a vibration-assisted hysteresis compensation approach, where controlled vibrational motion is applied along the tendon’s movement direction to mitigate friction and reduce dead zones. Experimental results demonstrate that the exerted vibration consistently reduces hysteresis across all tested frequencies, decreasing RMSE by up to 23.41% (from 2.2345 mm to 1.7113 mm) and improving correlation, leading to more accurate trajectory tracking. When combined with a Temporal Convolutional Network (TCN)-based compensation model, vibration further enhances performance, achieving an 85.2% reduction in MAE (from 1.334 mm to 0.1969 mm). Without vibration, the TCN-based approach still reduces MAE by 72.3% (from 1.334 mm to 0.370 mm) under the same parameter settings. These findings confirm that vibration effectively mitigates hysteresis, improving trajectory accuracy and enabling more efficient compensation models with fewer trainable parameters. This approach provides a scalable and practical solution for TSM-based robotic applications, particularly in MIS.

16:50-16:55, Paper ThDT5.3
An RGB-D Camera-Based Multi-Small Flying Anchors Control for Wire-Driven Robots Connecting to the Environment

Inoue, Shintaro	The University of Tokyo
Kawaharazuka, Kento	The University of Tokyo
Yoneda, Keita	The University of Tokyo
Yuzaki, Sota	The University of Tokyo
Sahara, Yuta	The University of Tokyo
Suzuki, Temma	The University of Tokyo
Okada, Kei	The University of Tokyo
Keywords: Tendon/Wire Mechanism, Multi-Robot Systems, Software-Hardware Integration for Robot Systems Abstract: In order to expand the operational range and payload capacity of robots, wire-driven robots that leverage the external environment have been proposed. It can exert forces and operate in spaces far beyond those dictated by its own structural limits. However, for practical use, robots must autonomously attach multiple wires to the environment based on environmental recognition—an operation so difficult that many wire-driven robots remain restricted to specialized, pre-designed environments. Here, in this study, we propose a robot that autonomously connects multiple wires to the environment by employing a multi-small flying anchor system, as well as an RGB-D camera-based control and environmental recognition method. Each flying anchor is a drone with an anchoring mechanism at the wire tip, allowing the robot to attach wires by flying into position. Using the robot's RGB-D camera to identify suitable attachment points and a flying anchor position, the system can connect wires in environments that are not specially prepared, and can also attach multiple wires simultaneously. Through this approach, a wire-driven robot can autonomously attach its wires to the environment, thereby realizing the benefits of wire-driven operation at any location.

16:55-17:00, Paper ThDT5.4
Robotic Manipulation of a Rotating Chain with Bottom End Fixed

Chen, Qi Jing	Nanyang Technological University
Shan, Shilin	Nanyang Technological University
Pham, Quang-Cuong	NTU Singapore
Keywords: Tendon/Wire Mechanism, Manipulation Planning, Dynamics Abstract: This paper studies the problem of using a robot arm to manipulate a uniformly rotating chain with its bottom end fixed. Existing studies have investigated ideal rotational shapes for practical applications, yet they do not discuss how these shapes can be consistently achieved through manipulation planning. Our work presents a manipulation strategy for stable and consistent shape transitions. We find that the configuration space of such a chain is homeomorphic to a three-dimensional cube. Using this property, we suggest a strategy to manipulate the chain into different configurations, specifically from one rotation mode to another, while taking stability and feasibility into consideration. We demonstrate the effectiveness of our strategy in physical experiments by successfully transitioning from rest to the first two rotation modes. The concepts explored in our work have critical applications in ensuring safety and efficiency of drill string and yarn spinning operations.

17:00-17:05, Paper ThDT5.5
Accurate Simulation and Parameter Identification of Deformable Linear Objects Using Discrete Elastic Rods in Generalized Coordinates

Chen, Qi Jing	Nanyang Technological University
Bretl, Timothy	University of Illinois at Urbana-Champaign
Pham, Quang-Cuong	NTU Singapore
Keywords: Tendon/Wire Mechanism, Actuation and Joint Mechanisms, Flexible Robotics Abstract: This paper presents a fast and accurate model of a deformable linear object (DLO) -- e.g., a rope, wire, or cable -- integrated into an established robot physics simulator, MuJoCo. Most accurate DLO models with low computational times exist in standalone numerical simulators, which are unable or require tedious work to handle external objects. Based on an existing state-of-the-art DLO model -- Discrete Elastic Rods (DER) -- our implementation provides an improvement in accuracy over MuJoCo's own native cable model. To minimize computational load, our model utilizes force-lever analysis to adapt the Cartesian stiffness forces of the DER into its generalized coordinates. As a key contribution, we introduce a novel parameter identification pipeline designed for both simplicity and accuracy, which we utilize to determine the bending and twisting stiffness of three distinct DLOs. We then evaluate the performance of each model by simulating the DLOs and comparing them to their real-world counterparts and against theoretically proven validation tests.

17:05-17:10, Paper ThDT5.6
Novel Cable Driven Fitness Gym Devices for Whole Body Weight Training

Park, Junghoon	Samsung Electronics Co., Ltd
Kim, Yongtae Giovanni	Samsung Research
Kim, Dong Hyun	Samsung Research
Shin, Gyowook	Samsung Research
Kim, Sang-Hun	Samsung Research
Hyung, SeungYong	Samsung Electronics Co., Ltd
Keywords: Soft Robot Applications, Tendon/Wire Mechanism, Soft Robot Materials and Design Abstract: We previously proposed cable-driven wearable devices for exercise and gait assistance. It was a lightweight and suit-type device with programmable resistance or assistance adjustment capabilities. In this paper, we introduce 1) a wearable gym device designed to focus on lower limb muscles and 2) a stationary gym device for comprehensive strength (weight) training. The actuation module can be used interchangeably for both devices. This actuation module includes a control board and a cable-driven actuator with a smaller size, improved strength, and greater speed compared to previous version. Its compact size makes easy implementation into our proposed devices. To evaluate the effectiveness of these devices, we conducted surface electromyography (sEMG) experiments during exercises comparing the effects of the developed devices with traditional dumbbells to confirm their efficacy. In addition, the Borg rating of perceived exertion was also analyzed to confirm the effectiveness of our devices as weight training devices.

17:10-17:15, Paper ThDT5.7
Braking Control in Clutched-Elastic Robots: Coordinating the Underactuation-To-Actuation Transition

Rakcevic, Vasilije	Technical University of Munich
Ossadnik, Dennis	Technical University of Munich
Pozo Fortunić, Edmundo	Technical University of Munich
Yildirim, Mehmet Can	Technical University of Munich
Le Mesle, Valentin	Technical University of Munich
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Keywords: Underactuated Robots, Hardware-Software Integration in Robotics, Optimization and Optimal Control Abstract: Robots with intrinsic joint elasticity can perform highly dynamic manoeuvres by leveraging energy storage and release, enabling explosive motions such as throwing. By augmenting elastic robots with clutch mechanisms, link decoupling can be used to fully exploit inertial coupling effects and gravitational acceleration in motion while effectively circumventing spring deflection limits. However, braking such systems in a decoupled state presents a challenge, as re-engaging the link risks damaging the joint. While optimal control strategies could be applied, they are not inherently safe due to model uncertainties. To address this, we propose a feedback-based two-stage method that coordinates the transition through the hybrid modes of the system. These modes are characterized by underactuated and actuated dynamics. First, a decoupled link is braked via inertial coupling until a safe velocity for clutching is reached, after which the link is re-coupled and actively braked. We demonstrate the effectiveness of this method through simulations comparing it with optimal control and validate it experimentally using a physical prototype.

17:15-17:20, Paper ThDT5.8
Model Predictive Control for Cable-Driven Remote Actuation Systems with Friction and Compliance

Forouhar, Moein	Technische Universität München
Sadeghian, Hamid	Technical University of Munich
Li, Yu	Technical University of Munich
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Keywords: Tendon/Wire Mechanism, Actuation and Joint Mechanisms Abstract: In this work, a cable-driven remote-actuated system with friction and compliance in its dynamics is modeled and controlled using a Model Predictive Controller (MPC) for a regulation problem. In contrast to flexible joint systems, where there is no friction in the compliant elements, friction exists between two compliant cable-sheath connecting the motor and the link in the presented model. Three controllers are designed for the system based on the cascade control theory for flexible joint systems and the MPC framework. The performances of the controllers are compared in simulation as well as experiments on a designed testbed. The results show that the MPC-Cascade control scheme exhibits the best performance with relatively fast convergence behaviour.


ThDT6	301
Data Sets for Robotics 2	Regular Session
Chair: Ding, Wenbo	Tsinghua University
Co-Chair: Albu-Schäffer, Alin	DLR - German Aerospace Center

16:40-16:45, Paper ThDT6.1
Sensing Differently: Unifying Vision, Language, Posture and Tactile in Robotic Perception

Zhou, Yanmin	Tongji University
Jin, Yiyang	Tongji University
Jiang, Rong	Tongji University
Li, Xin	Tongji University
Sang, HongRui	Tongji University
Jiang, Shuo	Tongji University
Wang, Zhipeng	Tongji University
He, Bin	Tongji University
Keywords: Data Sets for Robot Learning, Force and Tactile Sensing, Grasping Abstract: Multi-modal fusion perception enhances robotic performance in complex tasks by providing more comprehensive information than single modality. While tactile and proprioceptive sensing are effective for direct contact tasks like grasping, current research mainly focuses on vision-language fusion, neglecting other embodied modalities. The primary challenges of this limitation are the difficulty in generating natural language labels for embodied information like tactile and proprioception and aligning them with vision and language. To address this, we introduce VLaPT, a novel multi-modal grasping dataset that aligns vision and language (VL) with posture and tactile (PT), enabling robots to sense differently from environment to self. VLaPT includes 75 objects, 1,533 grasps, and over 78K synchronized vision-language-posture-tactile pairs. The dataset incorporates structured, rich-text descriptions generated using modality-level language annotation templates, ensuring effective cross-modality alignment. Leveraging this dataset, we trained a lightweight multi-modal alignment framework, CLIP-ME, which enhances the performance of several downstream tasks with only a 5% increase in parameters. The VLaPT is publicly available in https://huggingface.co/datasets/xsdfasfgsa/VLaPT.

16:45-16:50, Paper ThDT6.2
WHALES: A Multi-Agent Scheduling Dataset for Enhanced Cooperation in Autonomous Driving

Wang, Richard	Tsinghua University
Chen, Siwei	Tsinghua University
Ziyi, Song	Tsinghua University
Zhou, Sheng	Tsinghua University
Keywords: Data Sets for Robot Learning, Agent-Based Systems, Autonomous Agents Abstract: Cooperative perception research is hindered by the limited availability of datasets that capture the complexity of real-world Vehicle-to-Everything (V2X) interactions, particularly under dynamic communication constraints. To address this gap, we introduce WHALES (Wireless enHanced Autonomous vehicles with Large number of Engaged agentS), the first large-scale V2X dataset explicitly designed to benchmark communication-aware agent scheduling and scalable cooperative perception. WHALES introduces a new benchmark that enables state-of-the-art (SOTA) research in communication-aware cooperative perception, featuring an average of 8.4 cooperative agents per scene and 2.01 million annotated 3D objects across diverse traffic scenarios. It incorporates detailed communication metadata to emulate real-world communication bottlenecks, enabling rigorous evaluation of scheduling strategies. To further advance the field, we propose the Coverage-Aware Historical Scheduler (CAHS), a novel scheduling baseline that selects agents based on historical viewpoint coverage, improving perception performance over existing SOTA methods. WHALES bridges the gap between simulated and real-world V2X challenges, providing a robust framework for exploring perception-scheduling co-design, cross-data generalization, and scalability limits. The WHALES dataset and code are available at https://github.com/chensiweiTHU/WHALES.

16:50-16:55, Paper ThDT6.3
Extraction of Robotic Surface Processing Strategies from Human Demonstrations

Eiband, Thomas	German Aerospace Center (DLR)
Leimbach, Lars	TU Munich
Nottensteiner, Korbinian	German Aerospace Center (DLR)
Albu-Schäffer, Alin	DLR - German Aerospace Center
Keywords: Data Sets for Robot Learning, Datasets for Human Motion, Human and Humanoid Motion Analysis and Synthesis Abstract: Learning from Demonstration (LfD) is a widely used approach for teaching robot motion, but more sophisticated strategies are required to address complex tasks such as surface processing. Sanding is an example where comprehensive strategies are necessary to ensure complete and efficient coverage of the surface of a workpiece. In this paper, we present a system that captures human motions and contact forces during surface processing using a powered sanding tool. We provide a publicly available dataset that consists of demonstrations for various geometric shapes with the goal to extract robot execution strategies through LfD from a variety of users. This is in contrast to conventional LfD, which generates a policy directly from one or multiple trajectories provided by a single user. Further, we provide a data analysis that reveals key insights into how humans adapt their strategies to different surface geometries and extract robot execution strategies from it. Finally, we conduct two basic robotic experiments justifying the approach of strategy extraction. Our findings contribute to the understanding of human surface-processing behavior and lay the foundation for developing more effective robotic surface processing strategies.

16:55-17:00, Paper ThDT6.4
DriveLMM-O1: A Step-By-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding

Ishaq, Ayesha	Mohamed Bin Zayed University of Artificial Intelligence
Lahoud, Jean	MBZUAI
More, Ketan	Mohamed Bin Zayed University of Artificial Intelligence
Thawakar, Omkar	Mohamed Bin Zayed University of Artificial Intelligence
Thawkar, Ritesh	Mohamed Bin Zayed University of Artificial Intelligence
Dissanayake, Dinura	Mohamed Bin Zayed University of Artificial Intelligence
Ahsan, Noor	Mohamed Bin Zayed University of Artificial Intelligence
Li, Yuhao	Mohamed Bin Zayed University of Artificial Intelligence
Khan, Fahad	Linkoping University
Cholakkal, Hisham	MBZUAI
Laptev, Ivan	INRIA
Anwer, Rao	MBZUAI
Khan, Salman	CSIRO
Keywords: Data Sets for Robot Learning, Computer Vision for Transportation, Big Data in Robotics and Automation Abstract: While large multimodal models have demon strated strong performance across various visual question answering (VQA) tasks, certain challenges require complex multi-step reasoning to reach accurate answers. One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made. In this domain, a sequential and interpretive understanding of visual cues is essential for effective perception, prediction, and planning. Nevertheless, common visual question answering benchmarks often focus on the accuracy of the final answer, while overlooking the reasoning process that enables the generation of accurate responses. Moreover, existing methods lack a comprehensive framework for evaluating step-by-step reasoning in realistic driving scenarios. To address this gap, we propose DriveLMM-o1, a new dataset and benchmark specifically designed to advance step-wise visual reasoning for autonomous driving. Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning, each enriched with step-by-step reasoning to ensure logical inference in autonomous driving scenarios. We further introduce a large multimodal model that is fine-tuned on our reasoning dataset, demonstrating robust performance in complex driving scenarios. In addition, we benchmark various open-source and closed-source methods on our proposed dataset, systematically comparing their reasoning capabilities for autonomous driving tasks. Our framework, dataset, and model will be made publicly available.

17:00-17:05, Paper ThDT6.5
TBAP: Tapping-Based Auditory Perception for Identifying Container Materials

Li, Zehao	Tsinghua University
Guo, Hao	University of Science and Technology of China
Li, Shoujie	Tsinghua Shenzhen International Graduate School
Ding, Wenbo	Tsinghua University
Keywords: Data Sets for Robot Learning, Robot Audition Abstract: In this study, in order to address the robotic auditory perception problem, we propose a novel framework for object material recognition of common containers, which combines deep learning with active auditory perception to achieve breakthrough results. We developed a modular robotic system for acoustic data acquisition that employs a hybrid mechanism of vertical translation and horizontal rotation that is capable of performing full-scale tapping in three dimensions. The system is capable of creating an acoustic dataset consisting of 50 containers made of five materials, which improves the data acquisition efficiency by 93.9% compared to manual operations. In addition, we propose an end-to-end transfer learning model, TBAP, which is trained on a crawler-generated pre-training dataset and 50 real scene samples, and achieves a recognition accuracy of 91.0% for unseen materials. To improve reliability, we design a dynamic confidence assessment mechanism that generates confidence indices through probability distribution analysis and feature stability assessment to support robust robot decision-making. Experimental results show that the framework greatly improves data acquisition efficiency while maintaining high recognition accuracy, providing a valuable tool for advancing acoustic perception research.

17:05-17:10, Paper ThDT6.6
Shaken, Not Stirred: A Novel Dataset for Visual Understanding of Glasses in Human-Robot Bartending Tasks

Gajdošech, Lukáš	Comenius University
Ali, Hassan	University of Hamburg
Habekost, Jan-Gerrit	University of Hamburg
Madaras, Martin	Comenius University Bratislava
Kerzel, Matthias	Uni Hamburg
Wermter, Stefan	University of Hamburg
Keywords: Data Sets for Robotic Vision, Human-Robot Collaboration, RGB-D Perception Abstract: Datasets for object detection often do not account for enough variety of glasses, due to their transparent and reflective properties. Specifically, open-vocabulary object detectors, widely used in embodied robotic agents, fail to distinguish subclasses of glasses. This scientific gap poses an issue for robotic applications that suffer from accumulating errors between detection, planning, and action execution. This paper introduces a novel method for acquiring real-world data from RGB-D sensors that minimizes human effort. We propose an auto-labeling pipeline that generates labels for all the acquired frames based on the depth measurements. We provide a novel real-world glass object dataset (GlassNICOLDataset) that was collected on the Neuro-Inspired COLlaborator (NICOL), a humanoid robot platform. The dataset consists of 7850 images recorded from five different cameras. We show that our trained baseline model outperforms state-of-the-art open-vocabulary approaches. In addition, we deploy our baseline model in an embodied agent approach to the NICOL platform, on which it achieves a success rate of 81% in a human-robot bartending scenario.

17:10-17:15, Paper ThDT6.7
TartanGround: A Large-Scale Dataset for Ground Robot Perception and Navigation

Patel, Manthan	ETH Zurich
Yang, Fan	ETH Zurich
Qiu, Yuheng	Carnegie Mellon University
Cadena, Cesar	ETH Zurich
Scherer, Sebastian	Carnegie Mellon University
Hutter, Marco	ETH Zurich
Wang, Wenshan	Carnegie Mellon University
Keywords: Data Sets for Robotic Vision, Data Sets for SLAM Abstract: We present TartanGround, a large-scale, multi-modal dataset to advance the perception and autonomy of ground robots operating in diverse environments. This dataset, collected in various photorealistic simulation environments includes multiple RGB stereo cameras for 360-degree coverage, along with depth, optical flow, stereo disparity, LiDAR point clouds, ground truth poses, semantic segmented images, and occupancy maps with semantic labels. Data is collected using an integrated automatic pipeline, which generates trajectories mimicking the motion patterns of various ground robot platforms, including wheeled and legged robots. We collect 806 trajectories across 60 environments, resulting in 1.4 million samples. Evaluations on occupancy prediction and SLAM tasks reveal that state-of-the-art methods trained on existing datasets struggle to generalize across diverse scenes. TartanGround can serve as a testbed for training and evaluation of a broad range of learning-based tasks, including occupancy prediction, SLAM, neural scene representation, perception-based navigation, and more, enabling advancements in robotic perception and autonomy towards achieving robust models generalizable to more diverse scenarios. The dataset and codebase are available on the webpage: https://tartanair.org/tartanground

17:15-17:20, Paper ThDT6.8
RH20T-P: A Primitive-Level Robotic Manipulation Dataset towards Composable Generalization Agents in Real-World Scenarios

Chen, Zeren	Beihang University
Shi, Zhelun	Beihang University
Lu, Xiaoya	Shanghai Jiao Tong University
He, Lehan	Beihang University
Qian, Sucheng	Shanghai Jiao Tong University
Zhou, Enshen	BeiHang University
Yin, Zhenfei	The University of Sydney
Ouyang, Wanli	The University of Sydney
Shao, Jing	Shanghai AI Laboratory
Qiao, Yu	Shenzhen Institutes of Advanced Technology, ChineseAcademyof Sci
Lu, Cewu	ShangHai Jiao Tong University
Sheng, Lu	Beihang University (BUAA)
Keywords: Data Sets for Robotic Vision, Task and Motion Planning, Deep Learning Methods Abstract: Achieving generalizability in solving out-of-distribution tasks is one of the ultimate goals of learning robotic manipulation. Recent progress of Vision-Language Models (VLMs) has shown that VLM-based task planners can alleviate the difficulty of solving novel tasks, by decomposing the compounded tasks as a plan of sequentially executing primitive-level skills that have been already mastered. It is also promising for robotic manipulation to adapt such composable generalization ability, in the form of composable generalization agents (CGAs). However, the community lacks of reliable design of primitive skills and a sufficient amount of primitive-level data annotations. Therefore, we propose RH20T-P, a primitive-level robotic manipulation dataset, which contains about 38k video clips covering 67 diverse manipulation tasks in real-world scenarios. Each clip is manually annotated according to a set of meticulously designed primitive skills that are common in robotic manipulation. Furthermore, we standardize a plan-execute CGA paradigm and implement an exemplar baseline called RA-P on our RH20T-P, whose positive performance on solving unseen tasks validates that the proposed dataset can offer composable generalization ability to robotic manipulation agents. Project homepage: https://sites.google.com/view/rh20t-primitive/main.


ThDT7	307
Human Detection and Tracking	Regular Session
Chair: Ma, Zhuo-Chen	Shanghai Jiao Tong University

16:40-16:45, Paper ThDT7.1
IMM-MOT: A Novel 3D Multi-Object Tracking Framework with Interacting Multiple Model Filter

Liu, Xiaohong	Xidian University
Zhao, Xulong	Xidian University
Liu, Gang	School of Computer Science and Technology, Xidian University
Wu, Zili	School of Computer Science and Technology, Xidian University
Wang, Tao	School of Computer Science and Technology, Xidian University
Meng, Lei	School of Computer Science and Technology, Xidian University
Wang, Yuhan	School of Computer Science and Technology, Xidian University
Keywords: Human Detection and Tracking, Visual Tracking Abstract: 3D Multi-Object Tracking (MOT) provides the trajectories of surrounding objects, assisting robots or vehi- cles in smarter path planning and obstacle avoidance. Ex- isting 3D MOT methods based on the Tracking-by-Detection framework typically use a single motion model to track an object throughout its entire tracking process. However, objects may change their motion patterns due to variations in the surrounding environment. In this paper, we introduce the Interacting Multiple Model filter in IMM-MOT, which accurately fits the complex motion patterns of individual objects, overcoming the limitation of single-model tracking in existing approaches. In addition, we incorporate a Damping Window mechanism into the trajectory lifecycle management, leveraging the continuous association status of trajectories to control their creation and termination, reducing the occurrence of overlooked low-confidence true targets. Furthermore, we propose the Distance-Based Score Enhancement module, which enhances the differentiation between false positives and true positives by adjusting detection scores, thereby improving the effectiveness of the Score Filter. On the NuScenes Val dataset, IMM-MOT outperforms most other single-modal models using 3D point clouds, achieving an AMOTA of 73.8%. Our project is available at https://github.com/Ap01lo/IMM-MOT.

16:45-16:50, Paper ThDT7.2
ELPTNet: An Efficient LiDAR-Based 3D Pedestrian Tracking Network for Autonomous Navigation Social Robots

Guang, Jinzheng	Nankai University
Cao, Zhenzhong	Nankai University
Song, Yinuo	NanKai University
Liu, Jingtai	Nankai University
Keywords: Human Detection and Tracking, Service Robotics, Human-Centered Robotics Abstract: Autonomous navigation social robots need to track pedestrian movements in real-time with high precision to optimize path planning and avoid collisions. However, the main challenge of pedestrian tracking lies in the significant variations in human posture, which differ from rigid-body structures like vehicles. In this paper, we propose an Efficient LiDAR-based 3D Pedestrian Tracking Network (ELPTNet). First, our ELPTNet employs a 3D object detector to extract directional 3D pedestrian bounding boxes from LiDAR point clouds. Then, our ELPTNet employs a Constant Acceleration (CA) model and prediction confidence for target trajectory prediction. During the data association process, it integrates geometric, appearance, and motion features to enhance the robustness and real-time performance of 3D MOT when targets are temporarily occluded. Experimental results demonstrate that our ELPTNet achieves the highest ranking on the large-scale JRDB dataset for the 3D tracking task, outperforming previous state-of-the-art (SOTA) methods with improvements of 8.4% in MOTA and 6.6% in HOTA. Additionally, our ELPTNet attains an inference speed of 61 frames per second (FPS) on a single CPU. Therefore, our method enables accurate and real-time tracking of multiple pedestrians. The code is publicly available at https://github.com/jinzhengguang/ELPTNet.

16:50-16:55, Paper ThDT7.3
Dynamicity Adaptation for Multi-Object Tracking and Segmentation: Toward Improved Association Correction

Chen, Longtao	Huaqiao University
Liao, Guoxing	Huaqiao University
Lou, Jing	Changzhou Vocational Institute of Mechatronic Technology
Xu, Fenglei	Suzhou University of Science and Technology
Hu, Bingwen	Anhui University of Technology
Chen, Lineng	School of Electronic and Information Engineering Guangxi Normal
Zeng, Huanqiang	Huaqiao University
Keywords: Human Detection and Tracking, Computer Vision for Transportation, Visual Tracking Abstract: Dynamicity is a critical and highly challenging aspect in Multi-Object Tracking and Segmentation (MOTS), significantly impeding the effective integration of diverse association cues. High dynamicity, such as severe occlusion or deformation, can distort appearance cues, leading to inaccurate inter-object relationships and misleading results. Conversely, in low dynamicity states, spatiotemporal consistency of appearance cues aids in recovering object states. To address this issue, we propose a straightforward, effective, and versatile Dynamicity Adaptation for Multi-object Tracking and Segmentation, named HD-Track. First, we leverage the sensitivity of appearance cues to dynamicity through pre-association, capturing dynamic behavior in objects. Second, Dynamicity Adaptation incorporates Dynamicity Selection to identify reliable appearance cues based on pre-association results and Occlusion Dynamicity Fusing to adaptively integrate appearance and motion cues by analyzing historical mask variations. Experiments on MOTS20 and KITTI MOTS datasets demonstrate HD-Track’s robust and reliable performance across diverse scenarios, including varying motion speeds, object categories, and camera perspectives.

16:55-17:00, Paper ThDT7.4
Opt-In Camera: Person Identification in Video Via UWB Localization and Its Application to Opt-In Systems

Ishige, Matthew	CyberAgent, Inc
Yoshimura, Yasuhiro	CyberAgent
Yonetani, Ryo	CyberAgent
Keywords: Human Detection and Tracking, Surveillance Robotic Systems, Localization Abstract: This paper presents opt-in camera, a concept of privacy-preserving camera systems capable of recording only specific individuals in a crowd who explicitly consent to be recorded. Our system utilizes a mobile wireless communication tag attached to personal belongings as proof of opt-in and as a means of localizing tag carriers in video footage. Specifically, the on-ground positions of the wireless tag are first tracked over time using the unscented Kalman filter (UKF). The tag trajectory is then matched against visual tracking results for pedestrians found in videos to identify the tag carrier. Technically, we devise a dedicated trajectory matching technique based on constrained linear optimization, as well as a novel calibration technique that handles wireless tag-camera calibration and hyperparameter tuning for the UKF, which mitigates the non-line-of-sight (NLoS) issue in wireless localization. We implemented the proposed opt-in camera system using ultra-wideband (UWB) devices and an off-the-shelf webcam. Experimental results demonstrate that our system can perform opt-in recording of individuals in real-time at 10~fps, with reliable identification accuracy in crowds of 8-23 people in a confined space.

17:00-17:05, Paper ThDT7.5
Human Arm Pose Estimation with a Shoulder-Worn Force-Myography Device for Human-Robot Interaction

Atari, Rotem	Tel-Aviv University
Bamani Beeri, Eran	Tel Aviv University
Sintov, Avishai	Tel-Aviv University
Keywords: Human Detection and Tracking, Human-Robot Collaboration Abstract: Accurate human pose estimation is essential for effective Human-Robot Interaction (HRI). By observing a user's arm movements, robots can respond appropriately, whether it's providing assistance or avoiding collisions. While visual perception offers potential for human pose estimation, it can be hindered by factors like poor lighting or occlusions. Additionally, wearable inertial sensors, though useful, require frequent calibration as they do not provide absolute position information. Force-myography (FMG) is an alternative approach where muscle perturbations are externally measured. It has been used to observe finger movements, but its application to full arm state estimation is unexplored. In this letter, we investigate the use of a wearable FMG device that can observe the state of the human arm for real-time applications of HRI. We propose a Transformer-based model to map FMG measurements from the shoulder of the user to the physical pose of the arm. The model is also shown to be transferable to other users with limited decline in accuracy. Through real-world experiments with a robotic arm, we demonstrate collision avoidance without relying on visual perception.

17:05-17:10, Paper ThDT7.6
PhysioSense: An Open-Source Multi-Modal Monitoring Framework for Human Movement and Behavior Analysis

El Makrini, Ilias	Vrije Universiteit Brussel
Turcksin, Tom	Vrije Universiteit Brussel
Incirci, Taner	Kalealtinay Robotics and Automation Inc
Thiery, Elias	Vrije Universiteit Brussel
Kindt, Stijn	Vrije Universiteit Brussel (VUB)
Lovecchio, Rossana	Micro Medical Instruments
Cao, Hoang-Long	Vrije Universiteit Brussel
Denayer, Menthy	Vrije Universiteit Brussel
Lamine, Erard	Flanders Make
Huysentruyt, Stijn	Flanders Make
Verstraten, Tom	Vrije Universiteit Brussel
Vanderborght, Bram	Vrije Universiteit Brussel
Keywords: Human Detection and Tracking, Human Factors and Human-in-the-Loop, Wearable Robotics Abstract: Accurate assessment of human movement and behavior is essential in fields such as ergonomics, rehabilitation, and human-robot interaction. This paper presents PhysioSense, an open-source framework for synchronized multi-modal data acquisition and management. Built on the Lab Streaming Layer (LSL), PhysioSense integrates heterogeneous data streams from kinematic, dynamic, and physiological sensors in real time, ensuring millisecond-level synchronization. Unlike general-purpose tools such as LabVIEW, OpenSignals, or ROS, PhysioSense is specifically tailored to human-centric research, offering a streamlined interface for sensor configuration, recording, visualization, and data export. The framework’s modular design supports extensibility and reproducibility, making it suitable for a range of experimental setups. Two case studies—an ergonomics analysis and a drilling task assessment—demonstrate the framework’s capabilities in real-world scenarios. PhysioSense addresses key challenges in multi-sensor integration and paves the way for more accessible and scalable movement analysis in both research and applied settings.

17:10-17:15, Paper ThDT7.7
The Impact of Autonomy Levels and System Errors on Cognitive Load and Trust in Human-Robot Collaborative Tasks

García Cárdenas, Juan José	ENSTA - Institute Polytechinique De Paris
Tapus, Adriana	ENSTA Paris, Institut Polytechnique De Paris
Keywords: Acceptability and Trust, Telerobotics and Teleoperation, Human Factors and Human-in-the-Loop Abstract: Trust plays a crucial role in user performance during collaborative human-robot interaction. This study examines how varying levels of autonomy and system errors affect user trust and cognitive load in collaborative tasks between robots and humans. Participants performed a collaborative task using a UR5 robotic arm to place four bottles of different shapes into a box within a three-minute time frame under three conditions: (C1) full manual control by the user, (C2) autonomous operation with few errors—where the robot fails to correctly place one out of four bottles and the user can intervene upon detecting failures, and (C3) autonomous operation with frequent errors—where the robot fails to correctly place three out of four bottles, with user intervention allowed upon failure detection. Physiological indicators such as blink rate, galvanic skin response (GSR), and facial temperature, along with task performance metrics such as success rate and completion time were tracked. The results showed that participants experienced the highest cognitive load in Condition 1, as indicated by higher NASA-TLX scores, increased blink rates (average of 65 blinks per minute), elevated facial temperatures, and higher GSR readings. Trust levels were lowest in Condition 3, with 74% of participants reporting low trust, highlighting the significant impact of robot reliability on user's trust. A strong negative correlation was found between cognitive load and trust in Condition 3 suggesting that increased cognitive load due to frequent robot errors leads to decreased trust. These findings contribute to understanding how system errors and autonomy levels influence cognitive load and trust in collaborative human-robot tasks. The insights gained can inform the design of collaborative robotic systems that balance autonomy and reliability, enhancing user experience and performance.

17:15-17:20, Paper ThDT7.8
Training People to Reward Robots

Sun, Endong	King's College London
Zhu, Yuqing	King's College London
Howard, Matthew	King's College London
Keywords: Human Factors and Human-in-the-Loop, Human-Robot Collaboration, Reinforcement Learning Abstract: Learning from demonstration (LfD) is a technique that allows expert teachers to teach task-oriented skills to robotic systems. However, the most effective way of guiding novice teachers to approach expert-level demonstrations quantitatively for specific teaching tasks remains an open question. To this end, this paper investigates the use of machine teaching(MT) to guide novice teachers to improve their teaching skills based on reinforcement learning from demonstrations (RLfD). The paper reports an experiment in which novices receive MT-derived guidance to train their ability to teach a given motor skill with only 8 demonstrations and generalise this to previously unseen ones. Results indicate that the MT-guidance not only enhances robot learning performance by 89% on the training skills but also causes a 70% improvement in robot learning performance on skills not seen by subjects during training. These findings highlight the effectiveness of MT-guidance in upskilling human teaching behaviours, ultimately improving demonstration quality in RLfD.


ThDT8	308
Human-Centered Robotics 2	Regular Session
Chair: Christen, Sammy	Disney Research
Co-Chair: Cheng, Linlin	Vrije Universiteit Amsterdam

16:40-16:45, Paper ThDT8.1
Beware of the Tablet: A Dominant Distractor in Human-Robot Interaction

Cheng, Linlin	Vrije Universiteit Amsterdam
Belopolsky, Artem V.	Vrije Universiteit Amsterdam
de Bruijn, Mark	University of Massachusetts Lowell
Hindriks, Koen	Vrije Universiteit Amsterdam
Keywords: Multi-Modal Perception for HRI, Social HRI, Design and Human Factors Abstract: The present study aims at investigating how humans engage with common communication modalities—speech, tablet, and gesture—when interacting with a humanoid robot. To explore this, we designed a live interaction experiment using a congruence paradigm, where participants engaged with a robot presenting two out of three modalities simultaneously: one as the primary cue and the other as a distracting cue. We measured participants’ task performance (response time, error rate) and fixation distribution (fixation count and duration proportions) across different roles (primary, distracting, neither) and areas of interest (face, tablet, gesture). Additionally, we compared fixation patterns between the performance and baseline phases. Our findings reveal that while the tablet is the most effective modality for task engagement, it also serves as a strong attentional distractor, dominating gaze allocation regardless of its informational value. This underscores the importance of carefully balancing tablet integration in HRI design. Notably, our results demonstrate that gaze patterns alone do not fully reveal attentional focus, emphasizing the need to consider both overt and covert cognitive processes in multimodal HRI. These insights provide valuable guidelines for designing more effective and engaging human-robot interactions.

16:45-16:50, Paper ThDT8.2
Can Real-Time Lipreading Improve Speech Recognition? a Systematic Exploration Using Human-Robot Interaction Data

Goetzee, Sander	Vrije Universiteit Van Amsterdam
Li, Yue	Vrije Universiteit Amsterdam
Hindriks, Koen	Vrije Universiteit Amsterdam
Keywords: Multi-Modal Perception for HRI, Recognition, Human-Centered Robotics Abstract: Speech recognition in Human-Robot Interaction (HRI) fully relies on audio-based Automatic Speech Recognition. However, speech recognition that relies solely on audio faces significant challenges in noisy environments and may lead to poor performance in such environments. One approach to address this is to also use lipreading in combination with traditional speech recognition. Recent work has shown that audiovisual speech recognition (AVSR) can achieve a Word Error Rate (WER) of only 0.9% on the dataset LRS3. In this paper, we assess the potential of combining audio with lipreading on a social robot platform, Pepper, which has not yet been widely tested for AVSR. Given that prior research has focused on non-robotic domains, it remains unclear whether such models can generalize well to social robot environments. We systematically evaluate and compare the performance of established offline and real-time audiovisual models with their audio-only counterparts. The experiments were conducted in both a controlled laboratory setting and a dynamic and noisy public environment. We evaluated the data using WER and also measured the inference latency of real-time models via Real-Time Factor and Words Per Second rates. The results demonstrate real-time performance for audio-only speech recognition across all metrics and near real-time performance for models that feature lipreading, based on the latency metrics. We also explored factors that might influence the inference performance of these models to understand how much video contributes to the audio. This includes factors related to (1) environmental and temporal variations, (2) model behavior, and (3) implementation choices. Our findings indicate that for now the audio-only models outperform the audiovisual models on a social robot platform, in contrast to what has been reported in the benchmarked literature. We conclude that more work is still needed to benefit from lipreading in HRI.

16:50-16:55, Paper ThDT8.3
Diff-MSM: Differentiable MusculoSkeletal Model for Simultaneous Identification of Human Muscle and Bone Parameters

Zhou, Yingfan	University of Southern Denmark
Sanderink, Philip	University of Southern Denmark
Lemming, Sigurd Jager	University of Southern Denmark
Fang, Cheng	University of Southern Denmark
Keywords: Modeling and Simulating Humans, Calibration and Identification, Human-Centered Robotics Abstract: High-fidelity personalized human musculoskeletal models are crucial for simulating realistic behavior of physically coupled human-robot interactive systems and verifying their safety-critical applications in simulations before actual deployment, such as human-robot co-transportation and rehabilitation through robotic exoskeletons. Identifying subject-specific Hill-type muscle model parameters and bone dynamic parameters is essential for a personalized musculoskeletal model, but very challenging due to the difficulty of measuring the internal biomechanical variables in vivo directly, especially the joint torques. In this paper, we propose using Differentiable MusculoSkeletal Model (Diff-MSM) to simultaneously identify its muscle and bone parameters with an end-to-end automatic differentiation technique differentiating from the measurable muscle activation, through the joint torque, to the resulting observable motion without the need to measure the internal joint torques. Through extensive comparative simulations, the results manifested that our proposed method significantly outperformed the state-of-the-art baseline methods, especially in terms of accurate estimation of the muscle parameters (i.e., initial guess sampled from a normal distribution with the mean being the ground truth and the standard deviation being 10% of the ground truth could end up with an average of the percentage errors of the estimated values as low as 0.05%). In addition to human musculoskeletal modeling and simulation, the new parameter identification technique with the Diff-MSM has great potential to enable new applications in muscle health monitoring, rehabilitation, and sports science.

16:55-17:00, Paper ThDT8.4
WebRTC and 5G Based Remote Control System for a Vascular Intervention Robot

Cao, Sheng	Beijing Institute of Technology
Guo, Shuxiang	Southern University of Science and Technology
Guo, Jian	Shenzhen Institute of Advanced Biomedical Robot Co., Ltd
Wang, Jian	Shenzhen Institute of Advanced Biomedical Robot Co., Ltd
Junbo, Ge	Zhongshan Hospital
Keywords: Medical Robots and Systems, Telerobotics and Teleoperation, Control Architectures and Programming Abstract: Cardiovascular and cerebrovascular diseases are significant health issues that threaten human life. They typically develop insidiously and progress gradually, but when an event occurs, the consequences can be severe. These conditions often manifest suddenly and acutely, necessitating prompt treatment due to a very short therapeutic window. Vascular interventional surgery is the preferred treatment because of its rapid efficacy and brief recovery time. However, the procedure demands a high level of expertise and exposes physicians to radiation, and there is an uneven distribution of skilled doctors across different regions. The advent of vascular interventional surgical robots not only protects physicians from radiation exposure and improves procedural accuracy and safety but also enables remote interventions. Based on a robotic platform for vascular interventions, our team has developed a remote vascular interventional system that leverages WebRTC and 5G networks. This system ensured that the transmission latency for control commands and imaging meets clinical requirements, and our remote clinical experiments have demonstrated its feasibility and safety.

17:00-17:05, Paper ThDT8.5
A Multi-Modal Hand Imitation Dataset for Dexterous Hand

Wang, Shaochen	University of Science and Technology of China
Wu, Qilin	Jiang Xi Normal University
Chen, Kang	University of Science and Technology of China
Huang, Qing	Jiangxi Normal University
Cheng, Zhuo	Jiangxi Normal University
Xia, Beihao	Huazhong University of Science and Technology
Keywords: Datasets for Human Motion, Data Sets for Robotic Vision, Multifingered Hands Abstract: Multimodal data is indispensable for advancing imitation learning, particularly in the context of dexterous hands. However, existing dataset predominantly rely on single-modality inputs, such as RGB images, which inherently lack the capacity to capture the spatial and temporal dynamics essential for achieving human-like dexterity. To address this limitation, we introduce Multi-Modal Dex, a dataset that integrates multimodal sensory data to enable the effective learning of dexterous skills from human demonstrations. By combining visual, point cloud, and kinematic modalities, our dataset provides a richer representation of hand interactions, thereby facilitating a more nuanced understanding of dexterous imitation. Our framework leverages neural rendering and kinematic optimization to align human and robotic hand poses in a shared canonical space, enabling geometrically consistent skill transfer. Furthermore, we further analyze the dataset’s potential to advance dexterous robots in perception, imitation learning, and real-world dexterous skill transfer.


ThDT9	309
Visual Learning	Regular Session
Chair: Li, Jian	National University of Defense Technology

16:40-16:45, Paper ThDT9.1
ETA: Learning Optical Flow with Efficient Temporal Attention

Wang, Bo	National University of Defense Technology
Sun, Zhenping	National University of Defense Technology
Yu, Yang	National University of Defense Technology
Liu, Li	National University of Defense Technology
Li, Jian	National University of Defense Technology
Hu, Dewen	National University of Defense Technology
Keywords: Visual Tracking, Deep Learning for Visual Perception Abstract: Considering the potential of using multi-frame information to solve the occlusion problem, we introduce a novel idea of multi-frame information integration, which uses the attention mechanism to fuse the temporal information from the previous frame. The idea can effectively improve the estimation accuracy in occluded regions and optimize the inference speed under multi-frame settings. Meanwhile, we suggest the concept of attention confidence to provide an explicit value criterion for the model to utilize useful attention information more efficiently. Furthermore, we propose an Efficient Temporal Attention network (ETA), which achieves promising results on Sintel and KITTI benchmarks, especially with a 9.4% error reduction compared to the baseline method GMA on Sintel (test) Clean.

16:45-16:50, Paper ThDT9.2
Improved 2D Hand Trajectory Prediction with Multi-View Consistency

Ma, Junyi	Beijing Institute of Technology
Zhang, Erhang	SHANDONG University
Xu, Jingyi	Shanghai Jiao Tong University
Chen, Xieyuanli	National University of Defense Technology
Wang, Hesheng	Shanghai Jiao Tong University
Keywords: Visual Learning, Deep Learning Methods Abstract: Forecasting how human hands would move around target objects on egocentric videos can provide prior knowledge to enhance the path planning capabilities of service robots and assistive wearable devices. During the hand-object interaction process, head movements always occur concurrently to provide observations for the interaction scene from different egocentric views. Although some prior works have successfully integrated head motion information into hand trajectory prediction (HTP), they basically overlook the multi-view consistency (MVC) inherent in headset camera egomotion. We argue that multi-view consistency reveals geometric and semantic relationships during hand-object interaction, and can be regarded as additional supervision signals for predicting more realistic hand trajectories. Therefore, in this work, we propose a novel learning scheme dubbed EER to improve diffusion-based 2D hand trajectory prediction methods, which involves exploiting the geometric consistency, enhancing the multi-canvas consistency, and reconstructing the semantic consistency inherent in MVC. The experimental results show that our proposed EER scheme significantly improves the prediction accuracy of existing diffusion-based 2D HTP methods on the publicly available datasets. We will release the code as open-source at https://github.com/IRMVLab/EER-HTP.

16:50-16:55, Paper ThDT9.3
Self-Supervised Geometry-Guided Initialization for Robust Monocular Visual Odometry

Kanai, Takayuki	Toyota Motor Corporation
Vasiljevic, Igor	Toyota Research Institute
Guizilini, Vitor	Toyota Research Institute
Shintani, Kazuhiro	Toyota Motor Corporation
Keywords: Visual Learning, Deep Learning for Visual Perception, SLAM Abstract: Monocular visual odometry is a key technology in various autonomous systems. Traditional feature-based methods suffer from failures due to poor lighting, insufficient texture, and large motions. In contrast, recent learning-based dense SLAM methods exploit iterative dense bundle adjustment to address such failure cases, and achieve robust and accurate localization in a wide variety of real environments, without depending on domain-specific supervision. However, despite its potential, the methods still struggle with scenarios involving large motion and object dynamics. In this study, we diagnose key weaknesses in a popular learning-based dense SLAM model (DROID-SLAM) by analyzing major failure cases on outdoor benchmarks and exposing various shortcomings of its optimization process. We then propose the use of self-supervised priors leveraging a frozen large-scale pre-trained monocular depth estimator to initialize the dense bundle adjustment process, leading to robust visual odometry without the need to fine-tune the SLAM backbone. Despite its simplicity, the proposed method demonstrates significant improvements on KITTI odometry, as well as the challenging DDAD benchmark.

16:55-17:00, Paper ThDT9.4
DriveBLIP2: Attention-Guided Explanation Generation for Complex Driving Scenarios

Ling, Shihong	University of Pittsburgh
Wan, Yue	University of Pittsburgh
Jia, Xiaowei	University of Pittsburgh
Du, Na	University of Pittsburgh
Keywords: Visual Learning, Multi-Modal Perception for HRI, Human-Robot Collaboration Abstract: This paper introduces a new framework, DriveBLIP2, built upon the BLIP2-OPT architecture, to generate accurate and contextually relevant explanations for emerging driving scenarios. While existing vision-language models perform well in general tasks, they encounter difficulties in understanding complex, multi-object environments, particularly in real-time applications such as autonomous driving, where the rapid identification of key objects is crucial. To address this limitation, an Attention Map Generator is proposed to highlight significant objects relevant to driving decisions within critical video frames. By directing the model's focus to these key regions, the generated attention map helps produce clear and relevant explanations, enabling drivers to better understand the vehicle's decision-making process in critical situations. Evaluations on the DRAMA dataset reveal significant improvements in explanation quality, as indicated by higher BLEU, ROUGE, CIDEr, and SPICE scores compared to baseline models. These findings underscore the potential of targeted attention mechanisms in vision-language models for enhancing explainability in real-time autonomous driving.

17:00-17:05, Paper ThDT9.5
AlignCAPE: Support and Query Feature Aligning for Category-Agnostic Pose Estimation

Chen, Zhuoran	Beijing University of Posts and Telecommunications
Tang, Jin	Beiing University of Posts and Telecommunications
Xu, Guoliang	Beijing University of Posts and Telecommunications
Zhang, Shaojie	Beijing University of Posts and Telecommunications
Zhang, Zhicheng	Beijing University of Posts and Telecommunications
Yin, Jianqin	Beijing University of Posts and Telecommunications
Keywords: Visual Learning, Recognition, Deep Learning Methods Abstract: Recent advancements in category-agnostic pose estimation have focused on developing a unified model capable of localizing keypoint coordinates across arbitrary categories, which enables robots to accurately interact with diverse objects by understanding their poses. While existing methods predominantly concentrate on local features surrounding the keypoints of the support image, they often overlook the importance of global features, leading to potential misalignment between the support and query image. To address the inherent conflicts between the two images, we propose AlignCAPE, a novel approach designed to mitigate such misalignment and enhance the model performance. Our method formulates a two-stage pipeline, generating initial proposals in the first stage, followed by another stage to refine iteratively. Specifically, we introduce two modules, Feature Alignment Module(FAM) and Keypoint Perception Module(KPM). FAM utilizes bidirectional crossattention operation to align the support image feature and query image feature, thereby compensating for the limitations of previous methods. KPM employs self-attention mechanism to capture the interactions among keypoints, facilitating to localize keypoints in the query image. Experiments on MP-100 benchmark demonstrate that our method outperforms the widely-used baseline model in CAPE by 0.68% in PCK@0.2 metric under 1-shot setting.

17:05-17:10, Paper ThDT9.6
Efficient and Accurate Low-Resolution Transformer Tracking

Dong, Shaohua	University of North Texas
Feng, Yunhe	University of North Texas
Liang, James	Rochester Institute of Technology
Yang, Qing	University of North Texas
Lin, Yuewei	Brookhaven National Laboratory
Fan, Heng	University of North Texas
Keywords: Visual Tracking, Deep Learning for Visual Perception Abstract: High-performance Transformer trackers have exhibited excellent results, yet they often bear a heavy computational load. Observing that a smaller input can immediately and conveniently reduce computations without changing the model, an easy solution is to adopt a low-resolution input for efficient Transformer tracking. Albeit faster, this hurts tracking accuracy much due to the information loss in low resolution tracking. In this paper, we aim to mitigate such information loss to boost performance of low-resolution Transformer tracking via dual knowledge distillation from a frozen high-resolution (but not a larger) Transformer tracker. The core lies in two simple yet effective distillation modules, including query-key-value knowledge distillation (QKV-KD) and discrimination knowledge distillation (Disc-KD), across resolutions. The former, from the global view, allows the low-resolution tracker to inherit features and interactions from the high-resolution tracker, while the later, from the target-aware view, enhances the target-background distinguishing capacity via imitating discriminative regions from its high-resolution counterpart. With dual knowledge distillation, our Low-Resolution Transformer Tracker, dubbed LoReTrack, enjoys not only high efficiency owing to reduced computation but also enhanced accuracy by distilling knowledge from the high-resolution tracker. In extensive experiments, LoReTrack with a 256x256 resolution consistently improves baseline with the same resolution, and shows competitive or better results compared to the 384x384 high-resolution Transformer tracker, while running 52% faster and saving 56% MACs. Moreover, LoReTrack is resolution-scalable. With a 128x128 resolution, it runs 25 fps on a CPU with SUC scores of 64.9%/46.4% on LaSOT/LaSOText, surpassing other CPU real-time trackers. Code is released at https://github.com/ShaohuaDong2021/LoReTrack.

17:15-17:20, Paper ThDT9.8
LLplace: Embodied 3D Indoor Layout Synthesis Framework with Large Language Model

Yang, Yixuan	SUStech
Lu, Junru	University of Warwick
Zhao, Zixiang	Xi'an Jiaotong University
Luo, Zhen	Southern University of Science and Technology
Dong, Wanxi	Southern University of Science and Technology
Sanchez, Victor	University of Warwick
Zheng, Feng	SUSTech
Keywords: Visual Learning, AI-Based Methods, Big Data in Robotics and Automation Abstract: Designing 3D indoor layouts is a crucial task with significant applications in embodied robot intelligence, virtual reality, and interior design. Existing methods for 3D layout design either rely on diffusion models, which utilize spatial relationship priors, or heavily leverage the inferential capabilities of proprietary Large Language Models (LLMs) , which require extensive prompt engineering and in-context exemplars via black-box trials. These methods often face limitations in generalization and dynamic scene editing. In this paper, we introduce LLplace, a novel 3D indoor scene layout designer based on lightweight, fine-tuned, open-source LLM Llama3. LLplace circumvents the need for spatial relationship priors and in-context exemplars, enabling efficient and credible room layout generation based solely on user inputs specifying the room type and desired objects. We curated a new dialogue dataset based on the 3D-Front dataset, expanding the original data volume and incorporating dialogue data for adding and removing objects. This dataset can enhance the LLM’s spatial understanding. Furthermore, through dialogue, LLplace activates the LLM's capability to understand 3D layouts and perform dynamic scene editing, enabling the addition and removal of objects. Our approach demonstrates that LLplace can effectively generate and edit 3D indoor layouts interactively and outperform existing methods in delivering high-quality 3D design solutions.


ThDT10	310
Visual Servoing and Application	Regular Session
Chair: Jiang, Yiming	Hunan University
Co-Chair: Zeng, Chao	University of Liverpool

16:40-16:45, Paper ThDT10.1
Pathfinder for Low-Altitude Aircraft with Binary Neural Network

Yin, Kaijie	University of Macau
Gao, Tian	Nanjing University of Science and Technology
Kong, Hui	University of Macau
Keywords: Visual Learning, Computer Vision for Automation, AI-Enabled Robotics Abstract: A prior global topological map (e.g., the OpenStreetMap, OSM) can boost the performance of autonomous mapping by a ground mobile robot. However, the prior map is usually incomplete due to lacking labeling in partial paths. To solve this problem, this paper proposes an OSM maker using airborne sensors carried by low-altitude aircraft, where the core of the OSM maker is a novel efficient pathfinder approach based on LiDAR and camera data, i.e., a binary dual-stream road segmentation model. Specifically, a multi-scale feature extraction based on the UNet architecture is implemented for images and point clouds. To reduce the effect caused by the sparsity of point cloud, an attention-guided gated block is designed to integrate image and point-cloud features. To optimize the model for edge deployment that significantly reduces storage footprint and computational demands, we propose a binarization streamline to each model component, including a variant of vision transformer (ViT) architecture as the encoder of the image branch, and new focal and perception losses to optimize the model training. The experimental results on two datasets demonstrate that our pathfinder method achieves SOTA accuracy with high efficiency in finding paths from the low-level airborne sensors, and we can create complete OSM prior maps based on the segmented road skeletons. Code and data are available at: https://github.com/IMRL/Pathfinder.

16:45-16:50, Paper ThDT10.2
Multi-View Normal and Distance Guidance Gaussian Splatting for Surface Reconstruction

Jia, Bo	Beijing Information Science and Technology University
Guo, Yanan	Beijing Information Science and Technology University
Chang, Ying	Aerospace Information Research Institute, CAS
Zhang, Benkui	Aerospace Information Research Institute, CAS
Xie, Ying	Beijing Information Science and Technology University
Du, Kangning	Beijing Information Science and Technology University
Cao, Lin	Beijing Information Science and Technology University
Keywords: Visual Learning, Computer Vision for Manufacturing, View Planning for SLAM Abstract: 3D Gaussian Splatting (3DGS) achieves remarkable results in the field of surface reconstruction. However, when Gaussian normal vectors are aligned within the single-view projection plane, while the geometry appears reasonable in the current view, biases may emerge upon switching to nearby views. To address the distance and global matching challenges in multi-view scenes, we design multi-view normal and distance-guided Gaussian splatting. This method achieves geometric depth unification and high-accuracy reconstruction by constraining nearby depth maps and aligning 3D normals. Specifically, for the reconstruction of small indoor and outdoor scenes, we propose a multi-view distance reprojection regularization module that achieves multi-view Gaussian alignment by computing the distance loss between two nearby views and the same Gaussian surface. Additionally, we develop a multi-view normal enhancement module, which ensures consistency across views by matching the normals of pixel points in nearby views and calculating the loss. Extensive experimental results demonstrate that our method outperforms the baseline in both quantitative and qualitative evaluations, significantly enhancing the surface reconstruction capability of 3DGS.

16:50-16:55, Paper ThDT10.3
Clevis and Tenon Assembly Using Visual Guiding Fields

Chabert, Gilles	IRT Jules Verne
Chaumette, Francois	Inria Center at University of Rennes
Suarez Roos, Adolfo	IRT Jules Verne
Keywords: Visual Servoing Abstract: This paper presents a visual servoing strategy for a clevis and tenon joint assembly as typically performed in aircraft manufacturing. The desired velocity in the image acquired by the cameras observing the parts is extracted from a vector field that guides the motion of the moving part so that it follows an accurate 3D straight line trajectory. The vector field is designed so that it is continuous and globally exponentially stable whatever the initial configuration. The strategy has been successfully implemented and validated on a full-scale demonstrator, with coarse camera localization, while ensuring an assembly without any collision thanks to a maximal error less than 1 mm along a 10 cm trajectory.

16:55-17:00, Paper ThDT10.4
Optimization Based Human-Guided Variable-Stiffness Visual Impedance Control for Contact-Rich Tasks

Jiang, Jiao	Hunan University
Wang, Yaonan	Hunan University
Jiang, Yiming	Hunan University
Zeng, Danping	Hunan University
Zeng, Chao	University of Liverpool
Yang, Chenguang	University of Liverpool
Zhang, Hui	Hunan University
Keywords: Visual Servoing, Compliance and Impedance Control, Force Control Abstract: In contact-rich tasks such as polishing and drilling, inevitable physical interactions often lead to task deviations due to interference, typically resulting in excessive contact forces and eventual task failure. To tackle these challenges, we propose an innovative human-guided visual-impedance control framework. Specifically, we first introduce an interaction model within image feature space, which models the dynamics of human-robot-environment interactions. Subsequently, human operation skills are characterized through human-guided wrenches, and acts on visual features through a projection matrix, thus integrating human-guided wrenches with visual-impedance interaction dynamics. Finally, leveraging this framework, we develop a novel variable-stiffness visual-impedance control strategy. The impedance parameters are optimized online via Quadratic Program, ensuring that the end-tool contact force converges to desired value while adhering to safety constraints. The validity of the proposed framework was established through polish experiments.

17:00-17:05, Paper ThDT10.5
A No-Code Approach for Intuitive Robot Programming for Process-Aligned Surface Processing

Halim, Jayanto	Fraunhofer Institute for Machine Tools and Forming Technology
Bdiwi, Mohamad	Fraunhofer Institute for Machine Tools and Forming Technology IW
Ihlenfeldt, Steffen	Fraunhofer Institute for Machine Tools and Forming Technology IW
Keywords: Visual Learning, Computer Vision for Manufacturing, Intelligent and Flexible Manufacturing Abstract: This work introduces a novel method for the generation of process-aligned robotic pathways specifically designed for surface processing applications. The proposed approach integrates the interpretation of sensor data, computer vision algorithms, and process knowledge modeling to address the complexities inherent in robotic programming. To mitigate programming challenges, the method incorporates intuitive interaction techniques, including hand gestures and human-computer interaction (HCI), thereby facilitating the efficient generation of robotic paths. Additionally, it augments the user's teaching experience by enabling the seamless deployment of the methodology in serial production settings, accommodating the variability of both workpieces and environmental conditions. The proposed framework ensures smooth integration of robotic systems into complex workflows by aligning robotic paths with the unique requirements of surface processing tasks.

17:05-17:10, Paper ThDT10.6
Human-Robot Shared Visual Servoing Based on Game Theory

Fang, Zitai	Shanghai Jiao Tong University
Cao, Chong	Shanghai Jiao Tong University
Han, Lijun	Shanghai Jiao Tong University
Keywords: Visual Servoing, Physical Human-Robot Interaction, Human-Robot Collaboration Abstract: Human-robot shared visual servoing systems can combine the precise control ability of the robot and the human decision-making ability. However, integrating human input into such systems remains a challenging endeavor. This work studies the robot visual servoing control system in the human-robot shared environment and proposes a human-robot shared visual servoing framework based on game theory. Game theory is used to model the relationship between humans and robots. According to the observation of human input, the human intention is adaptively estimated using a radial basis function neural network (RBFNN), and the robot control objective is dynamically adjusted to realize human-robot coordination. The Lyapunov theory is used to prove the stability of the system. Experiments are conducted to verify the effectiveness of the proposed method.

17:10-17:15, Paper ThDT10.7
DR-MPC: Disturbance-Resilient Model Predictive Visual Servoing Control for Quadrotor UAV Pipeline Inspect

Li, Wen	Southeast University
Su, Jinya	Southeast University
Liu, Cunjia	Loughborough University
Chen, Wen-Hua	Loughborough University
Li, Shihua	Southeast University
Keywords: Visual Servoing, Optimization and Optimal Control, Robust/Adaptive Control Abstract: Unmanned Aerial Vehicles (UAVs) are gaining attention for inspections due to their improved safety, efficiency, and accuracy, alongside reduced costs and environmental risks. Visual servoing is crucial for autonomous UAV flight in GPS-degraded environments, guiding the UAV by minimizing errors between observed and desired visual features. This study focuses on Image-Based Visual Servoing (IBVS) control for quadrotor UAVs under complex dynamics and environmental disturbances. A nonlinear model predictive control (MPC) framework is first integrated with visual servoing to handle dynamics nonlinearity, control optimality, and constraints. To address uncertainties and disturbances, a Generalized Extended State Observer (GESO) is incorporated into the MPC, forming the Disturbance-Resilient (DR-) MPC. The GESO estimates the lumped disturbance to improve model predictions within the MPC horizon. The proposed algorithm is validated in a realistic Gazebo environment for UAV pipeline inspection in 3D scenarios, showing better control accuracy and reduced inspection time compared to three baseline methods: IBVS, IBVS-MPC(K) with kinematics, and IBVS-MPC(D) with dynamics.

17:15-17:20, Paper ThDT10.8
Data-Driven Visual Servoing of Flexible Continuum Robots in Constrained Environments

Chen, Wei	The Chinese University of Hong Kong
Wu, Haiwen	The Chinese University of Hong Kong
Dong, Xiyue	Hong Kong Center for Logistics Robotics
Yang, Bohan	The Chinese University of Hong Kong
Liu, Yunhui	Chinese University of Hong Kong
Keywords: Visual Servoing, Modeling, Control, and Learning for Soft Robots, Model Learning for Control Abstract: Flexible continuum robots operating in constrained and dynamic environments face significant challenges, especially when interacting with uncertain and potentially unknown conditions. Traditional model-based methods face significant difficulties due to the inherent nonlinearities and uncertainties in robot dynamics, as well as the complexities introduced by environmental interactions. This work presents a new data-driven, model-free control strategy for flexible continuum robots operating in constrained environments, leveraging Lie bracket approximations to achieve effective regulation. The method enables effective visual servoing without requiring explicit kinematic or dynamic models, making it highly adaptable to diverse scenarios where environmental constraints and robot deformation impact system performance. Additionally, it does not rely on initial state estimation, further enhancing its suitability for dynamic, uncertain environments. The effectiveness of the proposed method is validated through simulations and experiments, showing enhanced robustness and adaptability in real-time control scenarios.


ThDT11	311A
Physically Assistive Devices	Regular Session
Chair: Ang, Wei Tech	Nanyang Technological University

16:40-16:45, Paper ThDT11.1
Assisting Gait Stability in Walking Aid Users Exploiting Biomechanical Variables Correlation

Fortuna, Andrea	Politecnico Di Milano
Lorenzini, Marta	Istituto Italiano Di Tecnologia
Cho, Younggeol	Istituto Italiano Di Tecnologia (IIT)
Arbaud, Robin	HRI2 Lab., Istituto Italiano Di Tecnologia ; Dept. of Informatic
Castiglia, Stefano Filippo	Department of Medico-Surgical Sciences and Biotechnologies, "Sap
Serrao, Mariano	Sapienza University of Rome
Ranavolo, Alberto	INAIL
De Momi, Elena	Politecnico Di Milano
Ajoudani, Arash	Istituto Italiano Di Tecnologia
Keywords: Physically Assistive Devices, Human and Humanoid Motion Analysis and Synthesis, Sensor-based Control Abstract: Walking aids for individuals with musculoskeletal frailty or motor disabilities must ensure adequate physical support and assistance to their users. To this end, sensor-enabled human state monitoring and estimation are crucial. This paper proposes an innovative approach to assessing users' stability while walking with WANDER, a novel gait assistive device, by exploiting the correlation between the eXtrapolated Center of Mass (XCoM) and the Base of Support (BoS) edges. First, the soundness of this metric in monitoring gait stability is proven. Experiments on 25 healthy individuals show that the median value of Pearson's correlation coefficient (p-value < 0.05) remained high during the forward walk for all subjects. Next, a correlation-based variable admittance (CVA) controller is implemented, whose parameters are tuned to physically support users when a gait perturbation is detected (i.e. low values of Pearson's correlation coefficient). To validate this approach, 13 healthy subjects were asked to compare our controller with a force threshold-based (FVA) one. The CVA controller's performance in discriminating stable and perturbed gait conditions showed a high sensitivity value, comparable to FVA, and improved performance in terms of specificity. The number of false and missed detections of gait perturbation was considerably reduced, independently of walking speed, exhibiting a higher level of safety and smoothness compared to the FVA controller. Overall, the outcome of this study gives promising evidence of the proposed metric capability in identifying user stability and triggering WANDER's assistance.

16:45-16:50, Paper ThDT11.2
SkinGrip: An Adaptive Soft Robotic Manipulator with Capacitive Sensing for Whole-Limb Bed Bathing Assistance

Liu, Fukang	Georgia Institute of Technology
Puthuveetil, Kavya	Carnegie Mellon University
Padmanabha, Akhil	Carnegie Mellon University
Khokar, Karan	Carnegie Mellon University
Temel, Zeynep	Carnegie Mellon University
Erickson, Zackory	Carnegie Mellon University
Keywords: Physically Assistive Devices, Physical Human-Robot Interaction, Soft Robot Applications Abstract: Robotics presents a promising opportunity for enhancing bathing assistance, potentially to alleviate labor shortages and reduce care costs, while offering consistent and gentle care for individuals with physical disabilities. However, ensuring flexible and efficient cleaning of the human body poses challenges as it involves direct physical contact between the human and the robot, and necessitates simple, safe, and effective control. In this paper, we introduce a soft, expandable robotic manipulator with embedded capacitive proximity sensing arrays, designed for safe and efficient bed bathing assistance. We conduct a thorough evaluation of our soft manipulator, comparing it with a baseline rigid end effector in a human study involving 12 participants across 96 bathing trails. Our soft manipulator achieves an an average cleaning effectiveness of 88.8% on arms and 81.4% on legs, far exceeding the performance of the baseline. Participant feedback further validates the manipulator's ability to maintain safety, comfort, and thorough cleaning.

16:50-16:55, Paper ThDT11.3
Personalized Robotic Achilles Tendon Utilizing a Semi-Passive Spring with Switching Stiffness

Seong, Mingyu	Yeungnam University
Heo, Hayeong	Yeungnam University
Lee, Haseok	Yeungnam University
Choi, Jungsu	Yeungnam University
Keywords: Physically Assistive Devices, Wearable Robotics, Human-Centered Robotics Abstract: Wearable robotic devices have been demonstrated to reduce muscle activation and metabolic cost during walking, but conventional motorized systems often impose significant weight and bulk, leading to user discomfort and limited portability. To address these limitations, the Robotic Achilles Tendon (RAT) was developed as a lightweight, semi-passive spring system that delivers ankle assistance exclusively during the stance phase. The RAT integrates a double-acting pneumatic cylinder and a solenoid valve to emulate spring behavior when the valve is closed and to permit unrestricted ankle motion when the valve is open. Gait-phase detection is achieved via a single inertial measurement unit mounted on the wrist, exploiting the conserved angular momentum that couples arm and leg movements. System architecture was optimized by eliminating motors and minimizing sensor count, resulting in a device weight of 0.45 kg per leg and a total weight of 1.4 kg. Performance evaluation involved surface electromyography and metabolic cost measurements in a cohort of healthy young adults. Compared to unassisted walking, the RAT reduced plantar-flexor muscle activation by 16.9% and decreased metabolic cost by 10.6%. These findings confirm that intent-based actuation of a semi-passive spring can provide effective ankle assistance with minimal hardware complexity. Future work will investigate alternative sensor locations that remain synchronized with lower-limb kinematics, simplify battery and processing modules to further reduce device mass, and extend validation to elderly and pediatric populations.

16:55-17:00, Paper ThDT11.4
Instantaneous Walkability Determination Method for Almost Linear Passive Dynamic Walker with Nontrivial Limit Cycle Stability

Asano, Fumihiko	Japan Advanced Institute of Science and Technology
Sedoguchi, Taiki	Japan Advanced Institute of Science and Technology
Keywords: Passive Walking, Dynamics, Legged Robots Abstract: This paper proposes a novel passive dynamic walker with a body shape similar to an eight-legged rimless wheel that performs a natural swinging motion of the swing leg through storage and release of elastic energy. The generated motion is period-1 and asymptotically stable, but the inherent limit cycle stability is nontrivial because it does not achieve constraint on impact posture. Since it has almost linear dynamics, however, its walkability can be instantaneously determined using a linearized model without numerical integration. With the equations of linearized motion and exact collision, the step period and the state at the next collision can be obtained numerically and instantaneously using a bisection method based on the geometric constraint condition at impact. Then, by updating the state for each collision and repeating the same calculation, it is possible to instantaneously determine whether or not the walking motion continues stably for a long period of time. By comparing the results of this calculation with those of the numerical integration of the nonlinear and linearized models, the effectiveness of the proposed method is confirmed. Furthermore, using the proposed method, we analyze the period-doubling bifurcation phenomenon and the change in the singular values of the Poincaré map that occurs with the change in the elastic modulus.

17:00-17:05, Paper ThDT11.5
SAVR: Scooping Adaptation for Variable Food Properties Via Reinforcement Learning

Yow, J-Anne	Nanyang Technological University
Ang, Wei Tech	Nanyang Technological University
Keywords: Physically Assistive Devices, Dexterous Manipulation, Reinforcement Learning Abstract: Personalizing bite sizes is crucial for robot-assisted feeding, as users have diverse dietary needs and preferences. However, precisely controlling the amount of food scooped remains a challenge due to variations in food properties, such as texture, granularity and cohesion. This work introduces SAVR (Scooping Adaptation for Variable food properties via Reinforcement learning), a learning-based framework that enables robots to scoop a targeted amount of food while adapting to different food characteristics. SAVR integrates Dynamic Motion Primitives (DMPs), with Reinforcement Learning (RL), where DMPs provide a structured motion representation, and RL refines execution by modifying the force term within the DMP formulation. This formulation enables efficient learning by allowing the RL agent to fine-tune the scooping trajectory rather than learning entire trajectories from scratch. Through ablation studies, we show that segmented spoon and food masks, combined with force-torque data, are essential for accurate scooping, significantly improving sim-to-real transfer. We validate SAVR on a real robotic system, demonstrating substantial improvements in accuracy and adaptability over baselines, without any additional fine-tuning.

17:05-17:10, Paper ThDT11.6
FRANC: Feeding Robot for Adaptive Needs and Personalized Care

Yow, J-Anne	Nanyang Technological University
Toh, Luke Thien Luk	Nanyang Technological University Singapore
San, Yi Heng	Nanyang Technological University Singapore
Ang, Wei Tech	Nanyang Technological University
Keywords: Physically Assistive Devices, Human-Centered Robotics, Human Factors and Human-in-the-Loop Abstract: Robot-assisted feeding systems have the potential to significantly enhance the independence and quality of life of individuals with mobility impairments. While prior work has focused on personalizing bite sequences based on user feedback provided only at the start of the feeding process, this approach assumes that users can fully articulate their preferences upfront. In reality, it is cognitively challenging for users to anticipate every detail, and their preferences may evolve during feeding. Thus, there is a need for an adaptive system that supports iterative corrections across all stages of the feeding process while maintaining context and feeding history to interpret inputs relative to earlier instructions. In this paper, we present FRANC, a novel framework for personalized RAF that leverages large language models (LLMs) with a decomposed prompting strategy to dynamically adjust bite sequence, acquisition and transfer parameters during feeding. Our approach allows iterative corrections without sacrificing consistency and accuracy. In our user studies, FRANC improved bite sequencing accuracy from 65% to 93% and enhanced user satisfaction, with participants reliably perceiving when their preferences were being integrated despite occasional execution failures. We also provide a detailed failure analysis and offer insights for developing more adaptive and effective robot-assisted feeding systems.

17:10-17:15, Paper ThDT11.7
Incremental Learning for Robot Shared Autonomy

Tao, Yiran	Carnegie Mellon University
Qiao, Guixiu	National Institute of Standards and Technology
Ding, Dan	University of Pittsburgh
Erickson, Zackory	Carnegie Mellon University
Keywords: Physically Assistive Devices, Long term Interaction, Learning from Demonstration Abstract: Shared autonomy holds promise for improving the usability and accessibility of assistive robotic arms, but current methods often rely on costly expert demonstrations and remain static after pretraining, limiting their ability to handle real-world variations. Even with extensive training data, unforeseen challenges—especially those that fundamentally alter task dynamics, such as unexpected obstacles or spatial constraints—can cause assistive policies to break down, leading to ineffective or unreliable assistance. To address this, we propose ILSA, an Incrementally Learned Shared Autonomy framework that continuously refines its assistive policy through user interactions, adapting to real-world challenges beyond the scope of pre-collected data. At the core of ILSA is a structured fine-tuning mechanism that enables continual improvement with each interaction by effectively integrating limited new interaction data while preserving prior knowledge, ensuring a balance between adaptation and generalization. A user study with 20 participants demonstrates ILSA’s effectiveness, showing faster task completion and improved user experience compared to static alternatives. Code and videos are available at https://ilsa-robo.github.io/.


ThDT12	311B
Vision-Based Navigation 4	Regular Session
Chair: Guo, Shuxiang	Southern University of Science and Technology
Co-Chair: Wang, Guijin	Tsinghua University

16:40-16:45, Paper ThDT12.1
Context-Aware Graph Inference and Generative Adversarial Imitation Learning for Object-Goal Navigation in Unfamiliar Environment

Meng, Yiyue	Wuhan University
Guo, Chi	Wuhan University
Li, Aolin	Wuhan University
Luo, Yarong	Wuhan University
Keywords: Vision-Based Navigation, Semantic Scene Understanding, Imitation Learning Abstract: Object-goal navigation aims to guide an agent to find a specific target object in an unfamiliar environment based on first-person visual observations. It requires the agent to learn informative visual representation and robust navigation policy. To promote these two components, we proposed two complementary techniques, context-aware graph inference (CGI) and generative adversarial imitation learning (GAIL). CGI improves visual representation learning by integrating object relationships, including category proximity and spatial correlation. It uses the translation on hyperplane (TransH) method to infer context-aware object relationships under the guidance of various contexts over navigation episodes, including image, action, and memory. Both CGI and GAIL aim to improve robust navigation policy, enabling the agent to escape from deadlock states, such as looping or getting stuck. GAIL is an imitation learning (IL) technique that enables the agent to learn from expert demonstrations. Specifically, we propose GAIL to address the non-discriminative reward problem that exists in object-goal navigation. GAIL designs a dynamic reward function and combines it with environment rewards, thus providing guidance for effective navigation policy. Experiments in the AI2-Thor and RoboThor environments demonstrate that our method significantly improves the effectiveness and efficiency of navigation in unfamiliar environments.

16:45-16:50, Paper ThDT12.2
FrontierNet: Learning Visual Cues to Explore

Sun, Boyang	ETH Zurich
Chen, Hanzhi	Technical University of Munich
Leutenegger, Stefan	Technical University of Munich
Cadena, Cesar	ETH Zurich
Pollefeys, Marc	ETH Zurich
Blum, Hermann	Uni Bonn \| Lamarr Institute
Keywords: Vision-Based Navigation, Integrated Planning and Learning, Deep Learning for Visual Perception Abstract: Exploration of unknown environments is crucial for autonomous robots; it allows them to actively reason and decide on what new data to acquire for tasks such as mapping, object discovery, and environmental assessment. Existing methods, such as frontier-based methods, rely heavily on 3D map operations, which are limited by map quality and often overlook valuable context from visual cues. This work aims at leveraging 2D visual cues for efficient autonomous exploration, addressing the limitations of extracting goal poses from a 3D map. We propose a frontier-based exploration system, with FrontierNet as a core component developed in this work. FrontierNet is a learning-based model that (i) detects frontiers, and (ii) predicts their information gain, from posed RGB images enhanced by monocular depth priors. Our approach provides an alternative to existing 3D-dependent exploration systems, achieving a 16% improvement in early-stage exploration efficiency, as validated through extensive simulations and real-world experiments. Source code will be released publicly.

16:50-16:55, Paper ThDT12.3
Temporal Scene-Object Graph Learning for Object Navigation

Chen, Lu	Tongji University
He, Zongtao	Tongji University
Wang, Liuyi	Tongji University
Liu, Chengju	Tongji University
Chen, Qijun	Tongji University
Keywords: Vision-Based Navigation, Reinforcement Learning, Representation Learning Abstract: Object navigation tasks require agents to locate target objects within unfamiliar indoor environments. However, the first-person perspective inherently imposes limited visibility, complicating global planning. Hence, it becomes imperative for the agent to cultivate an efficient visual representation from this restricted viewpoint. To address this, we introduce a temporal scene-object graph (TSOG) to construct an informative and efficient ego-centric visual representation. Firstly, we develop a holistic object feature descriptor (HOFD) to fully describe object features from different aspects, facilitating the learning of relationships between observed and unseen objects. Next, we propose a scene-object graph (SOG) to simultaneously learn local and global correlations between objects and agent observations, granting the agent a more comprehensive and flexible scene understanding ability. This facilitates the agent to perform target association and search more efficiently. Finally, we introduce a temporal graph aggregation (TGA) module to dynamically aggregate memory information across consecutive time steps. TGA offers the agent a dynamic perspective on historical steps, aiding in navigation towards the target in longer trajectories. Extensive experiments in AI2THOR and Gibson datasets demonstrate our method's effectiveness and efficiency for ObjectNav tasks in unseen environments.

16:55-17:00, Paper ThDT12.4
Thinking before Decision: Efficient Interactive Visual Navigation Based on Local Accessibility Prediction

Liu, Qinrui	Xiangtan University
Luo, Biao	Central South University
Zhang, Dongbo	Xiangtan University
Chen, Renjie	Xiangtan University
Keywords: Vision-Based Navigation, Deep Learning Methods Abstract: Embodied AI has made prominent advances in interactive visual navigation tasks based on deep reinforcement learning. In the pursuit of higher success rates in navigation, previous work has typically focused on training embodied agents to push away interactable objects on the ground. However, such interactive visual navigation largely ignores the cost of interacting with the environment and interactions are sometimes counterproductive (e.g., push the obstacle but block the existing path). Considering these scenarios, we develop a efficient interactive visual navigation method. We propose Local Accessibility Prediction (LAP) Module to enable the agent to learn thinking about how the upcoming action will affect the environment and the navigation task before making a decision. Besides, we introduce the interaction penalty term to represent the cost of interacting with the environment. And different interaction penalties are imposed depending on the size of the obstacle pushed away. We introduce the average number of interactions as a new evaluation metric. Also, a two-stage training pipeline is employed to reach better learning performance. Our experiments in AI2-THOR environment show that our method outperforms the baseline in all evaluation metrics, achieving significant improvements in navigation performance.

17:00-17:05, Paper ThDT12.5
Splat-Nav: Safe Real-Time Robot Navigation in Gaussian Splatting Maps

Chen, Timothy	Stanford University
Shorinwa, Ola	Stanford University
Bruno, Joseph	Temple University
Swann, Aiden	Stanford
Yu, Javier	Stanford University
Zeng, Weijia	University of California, San Diego
Nagami, Keiko	Stanford University
Dames, Philip	Temple University
Schwager, Mac	Stanford University
Keywords: Visual-Based Navigation, Motion and Path Planning, Collision Avoidance, Pose Estimation Abstract: We present Splat-Nav, a real-time robot navigation pipeline for Gaussian Splatting (GSplat) scenes, a powerful new 3D scene representation. Splat-Nav consists of two components: 1) Splat-Plan, a safe planning module, and 2) Splat-Loc, a robust vision-based pose estimation module. Splat-Plan builds a safe-by-construction polytope corridor through the map based on mathematically rigorous collision constraints and then constructs a Bézier curve trajectory through this corridor. Splat-Loc provides real-time recursive state estimates given only an RGB feed from an on-board camera, leveraging the point-cloud representation inherent in GSplat scenes. Working together, these modules give robots the ability to recursively re-plan smooth and safe trajectories to goal locations. Goals can be specified with position coordinates, or with language commands by using a semantic GSplat. We demonstrate improved safety compared to point cloud-based methods in extensive simulation experiments. In a total of 126 hardware flights, we demonstrate equivalent safety and speed compared to motion capture and visual odometry, but without a manual frame alignment required by those methods. We show online re-planning at more than 2 Hz and pose estimation at about 25 Hz, an order of magnitude faster than Neural Radiance Field (NeRF)-based navigation methods, thereby enabling real-time navigation. We provide experiment videos on our project page at https://chengine.github.io/splatnav/. Our codebase and ROS nodes can be found at https://github.com/chengine/splatnav.

17:05-17:10, Paper ThDT12.6
Transferring Virtual Surgical Skills to Reality: AI Agents Mastering Surgical Decision-Making in Vascular Interventional Robotics (I)

Mei, Ziyang	Xiamen University
Wei, Jiayi	Xiamen University
Pan, Si	Xiamen University
Wang, Haoyun	Xiamen University
Wu, Dezhi	Xiamen University
Zhao, Yang	Xiamen University
Liu, Gang	Xiamen University
Shuxiang, Guo	Beijing Institute of Technology
Keywords: Virtual Reality and Interfaces, Reinforcement Learning, Vision-Based Navigation Abstract: Vascular interventional surgery offers advantages, such as minimal invasiveness, quick recovery, and low side-effects. Performing automatic guidewire navigation on vascular surgical robots can effectively assist doctors in performing surgery. Deep learning and reinforcement learning methods have been widely used for guidewire navigation tasks. However, the challenge remains in making delivery decisions for complex and extended pathways, with real-time images being the only data source. The development of network architecture, coupled with the formulation of an efficacious training regimen for this network is of significant importance and holds substantial meaning for the advancement of autonomous systems in vascular surgical robots. Therefore, this research proposes a virtual training environment that incorporates real vascular projections to create virtual environment. In this environment, the approach is enhanced by incorporating guidewire tip-to-target distance in the reward function, using real-time images as input states. This article also employs a multiprocess proximal policy optimization algorithm to accelerate training process and a multistage training approach to reduce the training difficulty. Results demonstrate the effectiveness in virtual automated guidewire navigation and improves success rates. This research proposes a method, which generates effective inputs for the reinforcement learning agent, and enables the pretrained agent to accomplish delivery tasks in real-world scenarios.

17:10-17:15, Paper ThDT12.7
GSplatVNM: Point-Of-View Synthesis for Visual Navigation Models Using Gaussian Splatting

Honda, Kohei	Nagoya University
Ishita, Takeshi	CyberAgent
Yoshimura, Yasuhiro	CyberAgent
Yonetani, Ryo	CyberAgent
Keywords: Vision-Based Navigation, AI-Enabled Robotics, Computer Vision for Automation Abstract: This paper presents a novel approach to image-goal navigation by integrating 3D Gaussian Splatting (3DGS) with Visual Navigation Models (VNMs), a method we refer to as GSplatVNM. VNMs offer a promising paradigm for image-goal navigation by guiding a robot through a sequence of point-of-view images without requiring metrical localization or environment-specific training. However, constructing a dense and traversable sequence of target viewpoints from start to goal remains a central challenge, particularly when the available image database is sparse. To address these challenges, we propose a 3DGS-based viewpoint synthesis framework for VNMs that synthesizes intermediate viewpoints to seamlessly bridge gaps in sparse data while significantly reducing storage overhead. Experimental results in a photorealistic simulator demonstrate that our approach not only enhances navigation efficiency but also exhibits robustness under varying levels of image database sparsity.


ThDT14	311D
Medical Robots and Systems 8	Regular Session
Chair: Zhang, Dandan	Imperial College London
Co-Chair: Qiu, Tian	German Cancer Research Center (DKFZ)

16:40-16:45, Paper ThDT14.1
Coarse-To-Fine Learning for Multi-Pipette Localisation in Robot-Assisted in Vivo Patch-Clamp

Wei, Lan	Imperial College London
Vera Gonzalez, Gema	Imperial College London
Kgwarae, Phatsimo O	Imperial College London
Timms, Alexander	Imperial College London
Zahorovsky, Denis	Imperial College London
Schultz, Simon	Imperial College London
Zhang, Dandan	Imperial College London
Keywords: Automation at Micro-Nano Scales, Medical Robots and Systems, Computer Vision for Medical Robotics Abstract: In vivo image-guided multi-pipette patch-clamp is essential for studying cellular interactions and network dynamics in neuroscience. However, current procedures predominantly rely on manual expertise, which limits both accessibility and scalability. Robotic automation presents a promising solution, yet achieving precise real-time detection of multiple pipettes remains a significant challenge. Existing methods primarily focus on ex vivo experiments or single-pipette applications, rendering them inadequate for in vivo multi-pipette scenarios. To address these challenges, we propose a heatmap-augmented coarse-to-fine learning technique to enable real-time localisation of multiple pipettes for robot-assisted in vivo patch-clamp. Specifically, we introduce a Generative Adversarial Network (GAN)-based module to suppress background noise and enhance pipette visibility. This is followed by a two-stage Transformer model that begins by predicting a coarse heatmap of the pipette tips and subsequently refines this via a fine-grained coordinate regression module for precise tip localisation. To ensure robust and accurate training, we employ the Hungarian algorithm to achieve optimal matching between predicted and ground-truth tip locations. Experimental results demonstrate that our method achieves over 98% accuracy within 10 μm and over 89% accuracy within 5 μm for multi-pipette tip localisation. The average mean squared error (MSE) is 2.52 μm.

16:45-16:50, Paper ThDT14.2
Three-Dimensional Anatomical Data Generation Based on Artificial Neural Networks

Müller, Ann-Sophia	German Cancer Research Center (DKFZ)
Jeong, Moonkwang	German Cancer Research Center (DKFZ)
Zhang, Meng	German Cancer Research Center (DKFZ)
Tian, Jiyuan	German Cancer Research Center
Miernik, Arkadiusz	University of Freiburg
Speidel, Stefanie	National Center for Tumor Diseases
Qiu, Tian	German Cancer Research Center (DKFZ)
Keywords: Computer Vision for Medical Robotics, Deep Learning Methods, Medical Robots and Systems Abstract: Surgical planning and training based on machine learning requires a large amount of 3D anatomical models reconstructed from medical imaging, which is currently one of the major bottlenecks. Obtaining these data from real patients and during surgery is very demanding, if even possible, due to legal, ethical, and technical challenges. It is especially difficult for soft tissue organs with poor imaging contrast, such as the prostate. To overcome these challenges, we present a novel workflow for automated 3D anatomical data generation using data obtained from physical organ models. We additionally use a 3D Generative Adversarial Network (GAN) to obtain a manifold of 3D models useful for other downstream machine learning tasks that rely on 3D data. We demonstrate our workflow using an artificial prostate model made of biomimetic hydrogels with imaging contrast in multiple zones. This is used to physically simulate endoscopic surgery. For evaluation and 3D data generation, we place it into a customized ultrasound scanner that records the prostate before and after the procedure. A neural network is trained to segment the recorded ultrasound images, which outperforms conventional, non-learning-based computer vision techniques in terms of intersection over union (IoU). Based on the segmentations, a 3D mesh model is reconstructed, and performance feedback is provided.

16:50-16:55, Paper ThDT14.3
R2Nav: Robust, Real-Time Test Time Adaptation for Robot Assisted Endoluminal Navigation

Wu, Junyang	Shanghai Jiao Tong University
Chu, Yimin	Tongren Hospital, Shanghai Jiao Tong University School of Medici
Peng, Haixia	Tongren Hospital, Shanghai Jiao Tong University School of Medici
Gu, Yun	Shanghai Jiao Tong University
Yang, Guang-Zhong	Shanghai Jiao Tong University
Keywords: Computer Vision for Medical Robotics, Medical Robots and Systems, Surgical Robotics: Planning Abstract: Robot assisted endoluminal intervention is an emerging tool for treating luminal lesions. Vision-based endoluminal navigation, particularly through video-CT registration, is a tangible way of obtaining absolute camera position information. By using pre-operative CT data, accurate endoscope localization can be achieved, without the need of additional tracking hardware intraoperatively. However, aligning preoperative CT with intraoperative domain remains a challenge. Although approaches such as style transfer have been explored, patient-specific textures and intra-operative artifacts can significantly complicate the task. To overcome these challenges, we propose R2Nav, a robust, real-time test time adaptation method for endoluminal navigation. R2Nav constructs a confidence buffer during the testing phase, refining the model only for frames with high uncertainty. We introduce a registration-augmented model refinement strategy, which enhances both accuracy and efficiency of the system by selecting relevant training samples from the virtual gallery. Additionally, we propose a novel warm-up strategy for the registration encoder during the initial testing phase, enabling the extraction of more robust features when the model is suboptimal. Extensive validation demonstrates that R2Nav outperforms the current state-of-the-art methods, offering significant advantages for real-time, intra-operative endoluminal navigation.

16:55-17:00, Paper ThDT14.4
Differentiable Rendering-Based Pose Estimation for Surgical Robotic Instruments

Liang, Zekai	Univeristy of California, San Diego
Chiu, Zih-Yun	University of California, San Diego
Richter, Florian	University of California, San Diego
Yip, Michael C.	University of California, San Diego
Keywords: Computer Vision for Medical Robotics, Surgical Robotics: Laparoscopy, Perception for Grasping and Manipulation Abstract: Robot pose estimation is a challenging and crucial task for vision-based surgical robotic automation. Typical robotic calibration approaches, however, are not applicable to surgical robots, such as the da Vinci Research Kit (dVRK), due to joint angle measurement errors from cable-drives and the partially visible kinematic chain. Hence, previous works in surgical robotic automation used tracking algorithms to estimate the pose of the surgical tool in real-time and compensate for the joint angle errors. However, a big limitation of these previous tracking works is the initialization step which relied on only keypoints and SolvePnP. In this work, we fully explore the potential of geometric primitives beyond just keypoints with differentiable rendering, cylinders, and construct a versatile pose matching pipeline in a novel pose hypothesis space. We demonstrate the state-of-the-art performance of our single-shot calibration method with both calibration consistency and real surgical tasks. As a result, this marker-less calibration approach proves to be a robust and generalizable initialization step for surgical tool tracking.

17:00-17:05, Paper ThDT14.5
Shape Completion and Real-Time Visualization in Robotic Ultrasound Spine Acquisitions

Gafencu, Miruna-Alexandra	Technical University of Munich
Shaban, Reem	Technical University of Munich
Velikova, Yordanka	TU Munich
Azampour, Mohammad Farid	Technical Univeristy of Munich
Navab, Nassir	TU Munich
Keywords: Computer Vision for Medical Robotics, Medical Robots and Systems, AI-Based Methods Abstract: Ultrasound (US) imaging is increasingly used in spinal procedures due to its real-time, radiation-free capabilities; however, its effectiveness is hindered by shadowing artifacts that obscure deeper tissue structures. Traditional approaches, such as CT-to-US registration, incorporate anatomical information from preoperative CT scans to guide interventions, but they are limited by complex registration requirements, differences in spine curvature, and the need for recent CT imaging. Recent shape completion methods can offer an alternative by reconstructing spinal structures in US data, while being pretrained on large set of publicly available CT scans. However, these approaches are typically offline and have limited reproducibility. In this work, we introduce a novel integrated system that combines robotic ultrasound with real-time shape completion to enhance spinal visualization. Our robotic platform autonomously acquires US sweeps of the lumbar spine, extracts vertebral surfaces from ultrasound, and reconstructs the complete anatomy using a deep learning-based shape completion network. This framework provides interactive, real-time visualization with the capability to autonomously repeat scans and can enable navigation to target locations. This can contribute to better consistency, reproducibility, and understanding of the underlying anatomy. We validate our approach through quantitative experiments assessing shape completion accuracy and evaluations of multiple spine acquisition protocols on a phantom setup. Additionally, we present qualitative results of the visualization on a volunteer scan.

17:05-17:10, Paper ThDT14.6
SurgiPose: Estimating Surgical Tool Kinematics from Monocular Video for Surgical Robot Learning

Chen, Juo-Tung	Johns Hopkins University
Chen, Xinhao	Johns Hopkins University
Kim, Ji Woong	Johns Hopkins University
Scheikl, Paul Maria	None
Cha, Jaepyeong	Children's National Hospital
Krieger, Axel	Johns Hopkins University
Keywords: Computer Vision for Medical Robotics, Surgical Robotics: Laparoscopy, Imitation Learning Abstract: Imitation learning (IL) has shown immense promise in enabling autonomous dexterous manipulations, including in learning surgical tasks. To fully unlock the potential of IL for surgery, access to clinical datasets is needed, which unfortunately lack the kinematic data required for current IL approaches. A promising source of large-scale surgical demonstrations is monocular surgical videos available online, making monocular pose estimation a crucial step toward enabling large-scale robot learning. Towards this end, we propose SurgiPose, a differentiable rendering-based approach to estimate kinematic information from monocular surgical videos, eliminating the need for direct access to ground-truth kinematics. Our method infers tool trajectories and joint angles by optimizing tool pose parameters to minimize the discrepancy between rendered and real images. To evaluate the effectiveness of our approach, we conduct experiments on two robotic surgical tasks—tissue lifting and needle pickup—using the da Vinci Research Kit Si (dVRK Si). We train imitation learning policies with both ground-truth measured kinematics and with estimated kinematics from video and compare their performance. Our results show that policies trained on estimated kinematics achieve comparable success rates to those trained on ground-truth data, demonstrating the feasibility of using monocular video-based kinematic estimation for surgical robot learning. By enabling kinematic estimation from monocular surgical videos, our work lays the foundation for large-scale learning of autonomous surgical policies from online surgical data.

17:10-17:15, Paper ThDT14.7
Portable and Versatile Catheter Robot for Image-Guided Cardiovascular Interventions (I)

Kantu, Nikhil Tej	North Carolina State University
Gao, Weibo	North Carolina State University
Srinivasan, Nitin	North Carolina State University
Buckner, Gregory	North Carolina State University
Su, Hao	New York University
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles, Telerobotics and Teleoperation Abstract: Cardiovascular disease remains the primary cause of death worldwide, necessitating the development of advancedendovascular instruments and procedures. These interventional procedures typically involve the use of guide catheters and guidewires, which are navigated through the vasculature under X-ray guidance. However, these procedures expose clinicians to prolonged radiation, posing potential health risks. Recent advances in endovascular catheter robots can mitigate the aforementioned risks by allowing teleoperation, but procedural efficacy has been hindered by their bulky designs that require designated facilities. In addition, these robots have limited compatibility with a wide range of instrument types and diameters, restricting their applicability to specific clinical interventions. To address these unmet needs, we have designed, fabricated, and experimentally validated a portable and versatile 4-DoF catheter robot that can manipulate commercially available cardiovascular instruments with diameters ranging from 1F to 9F. Furthermore, we analytically modeled the drive mechanism of the catheter robot and evaluated its tracking and insertion force and torque performance through experiments. Our portable robot (250 mm × 350 mm × 250 mm) is approximately 90% smaller than most state-of-the-art systems, e.g., Siemens Corindus system (1780 mm × 690 mm × 1170 mm), thanks to our highly integrated direct drive motors, mechatronics design, and modular instrument routing. Experimental evaluations confirm that the robot can actuate guidewires and guide catheters at clinically relevant force and torque amplitudes, speeds, and bandwidths without risking damage to delicate vascular tissues. The clinical potential of our catheter robot is demonstrated by performing a simulated percutaneous coronary intervention (PCI) using a 3D-printed model of the human heart. The portability and versatility of this catheter robot make it applicable to a wide range of cardiovascular procedures to potentially facilitate effective treatments.

17:15-17:20, Paper ThDT14.8
Learning Autonomous Surgical Irrigation and Suction with the Da Vinci Research Kit Using Reinforcement Learning (I)

Ou, Yafei	University of Alberta
Tavakoli, Mahdi	University of Alberta
Keywords: Surgical Robotics: Laparoscopy, Reinforcement Learning, Simulation and Animation Abstract: The irrigation-suction process is a common procedure to rinse and clean up the surgical field in minimally invasive surgery (MIS). In this process, surgeons first irrigate liquid, typically saline, into the surgical scene for rinsing and diluting the contaminant, and then suction the liquid out of the surgical field. While recent advances have shown promising results in the application of reinforcement learning (RL) for automating surgical subtasks, fewer studies have explored the automation of fluid-related tasks. In this work, we explore the automation of both steps in the irrigation-suction procedure and train two vision-based RL agents to complete irrigation and suction autonomously. To achieve this, a platform is developed for creating simulated surgical robot learning environments and for training agents, and two simulated learning environments are built for irrigation and suction with visually plausible fluid rendering capabilities. With techniques such as domain randomization (DR) and imitation learning, two agents are trained in the simulator and transferred to the real world. Individual evaluations of both agents show satisfactory real-world results. With an initial amount of around 5 grams of contaminants, the irrigation agent ultimately achieved an average of 2.21 grams remaining after a manual suction. As a comparison, fully manual operation by a human results in 1.90 grams remaining. The suction agent achieved 2.64 and 2.24 grams of liquid remaining across two trial groups with more than 20 and 30 grams of initial liquid in the container. Fully autonomous irrigation-suction trials reduce the contaminant in the container from around 5 grams to an average of 2.42 grams, although yielding a higher total weight remaining (4.40) due to residual liquid not suctioned. Further information about the project is available at https://tbs-ualberta.github.io/CRESSim/.


ThDT15	206
Perception for Grasping and Manipulation 5	Regular Session
Chair: Qu, Juntian	Tsinghua University
Co-Chair: Wang, Shengjin	Tsinghua University

16:40-16:45, Paper ThDT15.1
A Flexible Bending Sensor Based on C-Shaped FBG Array for Curvature and Gesture Recognition

Mao, Baijin	Tsinghua University
Xiang, Yuyaocen	Tsinghua Shenzhen International Graduate School
Huang, Yedong	Tsinghua University
Yuan, Qiangjing	Tsinghua University
Zhang, Yuzhu	Tsinghua University
Tang, Zhiwei	Tsinghua University
Qu, Juntian	Tsinghua University
Keywords: Perception for Grasping and Manipulation, Force and Tactile Sensing, In-Hand Manipulation Abstract: Human joints enable precise bending for fine manipulation and complex movements. Similarly, robotic flexibility relies on bending structures, where accurate bending perception is crucial for precise control and enhanced humanrobot interaction. This paper proposes a C-shaped fiber optic array, embedding a fiber Bragg Grating sensor array into a 2 mm thick silicone layer, successfully achieving a highly sensitive (300 pm/N) and electromagnetic interference-resistant bending sensor. The flexible sensor can sensitively detect external stimuli, such as the touch of a 1g weight or a feather, and exhibits a good linear relationship with curvature, facilitating accurate curvature classification. Additionally, leveraging the wearable nature of the sensor, we achieved the detection of finger bending angles. Finally, by attaching the sensor to the wrist and combining it with deep learning algorithms, we achieved 100% gesture recognition accuracy. This sensor holds significant potential for applications in fields such as fruit size classification, rehabilitation healthcare, and human-robot interaction.

16:45-16:50, Paper ThDT15.2
Learning Generalizable Feature Fields for Mobile Manipulation

Qiu, Ri-Zhao	University of California, San Diego
Hu, Yafei	Carnegie Mellon University
Song, Yuchen	UC San Diego
Yang, Ge	Massachusetts Institute of Technology
Fu, Yang	University of California San Diego
Ye, Jianglong	UC San Diego
Mu, Jiteng	University of San Diego
Yang, Ruihan	UC San Diego
Atanasov, Nikolay	University of California, San Diego
Scherer, Sebastian	Carnegie Mellon University
Wang, Xiaolong	UC San Diego
Keywords: Perception for Grasping and Manipulation, Mobile Manipulation, RGB-D Perception Abstract: An open problem in mobile manipulation is how to represent objects and scenes in a unified manner so that robots can use both for navigation and manipulation. The latter requires capturing intricate geometry while understanding fine-grained semantics, whereas the former involves capturing the complexity inherent at an expansive physical scale. In this work, we present GeFF (Generalizable Feature Fields), a scene-level generalizable neural feature field that acts as a unified representation for both navigation and manipulation that performs in real-time. To do so, we treat generative novel view synthesis as a pre-training task, and then align the resulting rich scene priors with natural language via CLIP feature distillation. We demonstrate the effectiveness of this approach by deploying GeFF on a quadrupedal robot equipped with a manipulator. We quantitatively evaluate GeFF's ability for open-vocabulary object-/part-level manipulation and show that GeFF outperforms point-based baselines in runtime and storage-accuracy trade-offs, with qualitative examples of semantics-aware navigation and articulated object manipulation.

16:50-16:55, Paper ThDT15.3
BookBot: A Robotic Manipulation Benchmark for Voice-Driven Book Recognition and Grasping in Cluttered Environments

Wang, Huaqiang	Tsinghua University
Wang, Yuan	Tsinghua University
Li, Xiang	Tsinghua University
Li, Yali	Tsinghua University
Wang, Shengjin	Tsinghua University
Keywords: Perception for Grasping and Manipulation, Data Sets for Robotic Vision, Service Robotics Abstract: Books, as enduring repositories of cultural heritage as well as knowledge, play a fundamental role in human development. Although advances in embodied AI and robotics revolutionize automation in domains, textit{e.g.}, manufacturing and logistics, robotic book manipulation remains an underexplored frontier. Two primary bottlenecks impede progress: (1) scarcity of fine-grained annotated datasets for benchmarking robotic book manipulation, and (2) lack of unified perception-action frameworks capable of dynamically coupling multi-modal sensing and manipulation in real-world scenarios. To these issues, we present textbf{THU-Book}, the first open-access benchmark featuring 643 3D scene captures, encompassing 11,298 high-fidelity book instances with rich annotations to support tasks from book recognition and localization to grasping and re-positioning. Building upon this foundation, we develop BookBot, a novel voice-interactive book manipulation pipeline to support textit{cross-environmental}, textit{multilingual}, and textit{multi-categorical} book manipulation. First, we utilize Large Language Models~(LLMs) to parse and comprehend ambiguity in user instructions. We further propose an instance segmentation module combined with OCR tool to link language to visual instances. Finally, we introduce a PCA-based manipulation policy to refine the robotic grasp pose, utilizing the principal components of the books' geometry, improving the precision and efficiency of grasping. Experiments conducted on the THU-Book benchmark validate the effectiveness of our BookBot. The dataset is available at href{https://github.com/wanghq-public/BookBot}{https://github.com/wanghq-public/BookBot}.

16:55-17:00, Paper ThDT15.4
Object Extrinsic Contact Surface Reconstruction through Extrinsic Contact Sensing from Visuo-Tactile Measurements

Kim, Yoonjin	Korea Advanced Institute of Science and Technology(KAIST)
Kim, Won Dong	Samsung Electronics Co., Ltd
Kim, Jung	KAIST
Keywords: Perception for Grasping and Manipulation, Force and Tactile Sensing, Contact Modeling Abstract: When manipulating an object, a robot must recognize not only the parts it directly grasps but also the surfaces in contact with the environment, which we refer to as extrinsic contact surfaces. These surfaces directly affect how the object interacts with its environment, and accurate surface estimation is critical for precise robotic manipulation. This study presents a novel framework for extrinsic contact surface reconstruction using vision-based tactile sensing. By leveraging marker-based tracking and analyzing kinematic constraints, we classify contact types and estimate the locations of both point and line contacts. To reconstruct the extrinsic contact surface, we compare three data integration methods: Mixed Vector Approach (MVA), Orthogonal Distance Regression (ODR), and Random Sample Consensus (RANSAC). Experimental results demonstrate that MVA achieves the highest accuracy in most cases by effectively integrating contact data while minimizing randomness. Experiments conducted on various object geometries validated the robustness of the proposed method, achieving an average positional error of 4.15 mm and an angular deviation of 4.58 °. The results confirm that extrinsic contact sensing enables more efficient and precise object shape estimation, providing a promising approach for robotic manipulation.

17:00-17:05, Paper ThDT15.5
ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?

Kim, Taewhan	Peking University
Bae, Hojin	Peking University
Li, Zeming	Peking University
Li, Xiaoqi	Peking University
Ponomarenko, Iaroslav	Peking University
Wu, Ruihai	Peking University
Dong, Hao	Peking University
Keywords: Perception for Grasping and Manipulation, Object Detection, Segmentation and Categorization Abstract: Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We create a dataset of 9.9k simulated and real images to bridge the visual sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improve part-level affordance segmentation, adapting the model’s in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems. Our project page is available at: https://lxkim814.github.io/ManipGPT_website/

17:05-17:10, Paper ThDT15.6
KineDepth: Utilizing Robot Kinematics for Online Metric Depth Estimation

Atar, Soofiyan	University of California San Diego
Zhi, Yuheng	University of California, San Diego
Richter, Florian	University of California, San Diego
Yip, Michael C.	University of California, San Diego
Keywords: Perception for Grasping and Manipulation, Visual Servoing, Perception-Action Coupling Abstract: Depth perception is essential for a robot's spatial and geometric understanding of its environment, with many tasks traditionally relying on hardware-based depth sensors like RGB-D or stereo cameras. However, these sensors face practical limitations, including issues with transparent and reflective objects, high costs, calibration complexity, spatial and energy constraints, and increased failure rates in compound systems. While monocular depth estimation methods offer a cost-effective and simpler alternative, their adoption in robotics is limited due to their output of relative rather than metric depth, which is crucial for robotics applications. In this paper, we propose a method that utilizes a single calibrated camera, enabling the robot to act as a "measuring stick" to convert relative depth estimates into metric depth in real-time as tasks are performed. Our approach employs an LSTM-based metric depth regressor, trained online and refined through probabilistic filtering, to accurately restore the metric depth across the monocular depth map, particularly in areas proximal to the robot's motion. Experiments with real robots demonstrate that our method significantly outperforms current state-of-the-art monocular metric depth estimation techniques, achieving a 22.1% reduction in depth error and a 52% increase in success rate for a downstream task.

17:10-17:15, Paper ThDT15.7
Eye-In-Finger: Smart Fingers for Delicate Assembly and Disassembly of LEGO

Tang, Zhenran	Carnegie Mellon University
Liu, Ruixuan	Carnegie Mellon University
Liu, Changliu	Carnegie Mellon University
Keywords: Perception for Grasping and Manipulation, Assembly, Computer Vision for Manufacturing Abstract: Abstract— Manipulation and insertion of small and tight- toleranced objects in robotic assembly remain a critical chal- lenge for vision-based robotics systems due to the required precision and cluttered environment. Conventional global or wrist-mounted cameras often suffer from occlusions when ei- ther assembling or disassembling from an existing structure. To address the challenge, this paper introduces “Eye-in-Finger”, a novel tool design approach that enhances robotic manipulation by embedding low-cost, high-resolution perception directly at the tool tip. We validate our approach using LEGO assembly and disassembly tasks, which require the robot to manipulate in a cluttered environment and achieve sub-millimeter accuracy and robust error correction due to the tight tolerances. Exper- imental results demonstrate that our proposed system enables real-time, fine corrections to alignment error, increasing the tolerance of calibration error from 0.4mm to up to 2.0mm for the LEGO manipulation robot.


ThDT16	207
Task Planning: AI-Based Methods	Regular Session
Chair: Atanasov, Nikolay	University of California, San Diego
Co-Chair: Kim, ChangHwan	Korea Institute of Science and Technology

16:40-16:45, Paper ThDT16.1
Safe and Efficient Target Singulation with Multi-Fingered Gripper Using Collision-Free Push-Stack Synergy

Kim, Hyojeong	Korea Institute of Science and Technology (KIST)
Park, Younghoon	KIST(Korea Insititute of Science and Technology)
Yu, DoHyeon	Korea Insitute of Science and Technology
Lim, Myo-Taeg	Korea University
Kim, ChangHwan	Korea Institute of Science and Technology
Keywords: Task and Motion Planning, Manipulation Planning Abstract: Target singulation involves a rearrangement of surrounding obstacles to create space for grasping the target. However, when objects are tightly packed in a confined workspace (i.e., a limited table boundary), it is not easy to relocate obstacles. Using non-prehensile manipulation, such as push, is suitable for creating space when objects are closely placed. However, it can be risky as collisions between objects might unintentionally push them beyond the table boundary. On the other hand, prehensile motion ensures safe relocation of objects. However, the objects that can be safely grasped are limited when the packing density is too high. Thus, it may not find a rearrangement plan for target singulation. To complement both methods, we suggest using textit{collision-free push-stack synergy} for rearrangement. Collision-free push prevents objects from moving out of the boundary while efficiently relocating objects and stack creates space in advance to safely push. Furthermore, we propose a modified algorithm of Local Obstacle-based Backward Search (LOBS), which generates a global rearrangement plan using only pick-and-place actions. To evaluate our method, we set up challenging scenarios - with a packing density of 50% and up to 70 objects. Compared to LOBS, the success rate increased significantly with no meaningful increase in planning time. Additionally, Our method outperformed other baselines as well.

16:45-16:50, Paper ThDT16.2
Least Commitment Planning for the Object Scouting Problem

Merlin, Max	Brown University
Yang, Ziyi	Brown University
Konidaris, George	Brown University
Paulius, David	Brown University
Keywords: Task Planning, Mobile Manipulation Abstract: State uncertainty is a primary obstacle to effective long-horizon robot task planning. State uncertainty can be decomposed into spatial uncertainty—resolved using SLAM—and uncertainty about the objects in the environment, formalized as the object scouting problem and modeled using the Locally Observable Markov Decision Process (LOMDP). We introduce a new planning framework specifically designed for object scouting with LOMDPs called the Scouting Partial-Order Planner (SPOP), which exploits the characteristics of partial order and regression planning to plan around gaps in knowledge the robot may have about the existence, location, and state of relevant objects in its environment. Our results highlight the benefits of partial-order planning, demonstrating its suitability for object scouting due to its ability to identify absent but task-relevant objects, and show that it outperforms comparable planners in plan length, computation time, and plan execution time.

16:50-16:55, Paper ThDT16.3
DRP: A Decomposition-Reflection-Prediction Framework for Long-Horizon Robot Task Planning Using Large Language Models

Zheng, Zhaowen	North China University of Technology
Zhao, Zhuofeng	North China University of Technology
Wang, Haocen	North China University of Technology
Wang, Jing	North China University of Technology
Keywords: Task Planning, AI-Based Methods, Semantic Scene Understanding Abstract: Large language models have demonstrated powerful reasoning capabilities, and their integration with robotics has revolutionized human-computer interaction and automated task planning. However, LLMs are unaware of environmental knowledge and possible state changes in the environment during planning, which makes the generated tasks unexecutable, particularly when dealing with complex long-horizon tasks involving crowded objects and dynamic relations. In this paper, we propose a LLM-based robot task planning framework with support for environmental knowledge injection, which is called DRP(Decomposition-Reflection-Prediction). The DRP framework combines LLMs with rule-based task decomposition, multi-perspective reflection and environmental prediction to generate admissible actions for complex long-horizon tasks. We only leverage few-shot prompting to implement our framework, which avoids the need for additional model training work. Experiments on VirtualHome household task dataset show that the task plans generated by our method have improved the executability by 25.23%, the subgoal success rate by 64.29%, and the success rate by 58.06%, in comparison to state-of-the-art baseline methods.

16:55-17:00, Paper ThDT16.4
Large Language Model-Based Robot Task Planning from Voice Command Transcriptions

Certo, Afonso	Instituto Superior Técnico
Martins, Bruno	University of Lisbon, IST and INESC-ID
Azevedo, Carlos	Instituto Superior Técnico - Institute for Systems and Robotics
Lima, Pedro U.	Instituto Superior Técnico - Institute for Systems and Robotics
Keywords: Task Planning, Service Robotics, AI-Enabled Robotics Abstract: One of the primary challenges in building a General Purpose Service Robot (GPSR), a robot capable of executing generic human commands, lies in understanding natural language instructions. These instructions often contain speech recognition errors and incomplete information, complicating the extraction of clear goals and the formulation of an efficient and effective action plan. This work presents an end-to-end pipeline that leverages a Large Language Model to directly translate instruction transcripts into coherent action plans. Furthermore, the pipeline integrates environmental context into the model’s input, allowing for the generation of more efficient and context-aware plans. The system’s performance was evaluated using a simulator based on Generalized Stochastic Petri Nets, achieving a success rate of around 55% on the ALFRED dataset, even in unseen environments. The entire pipeline was also successfully deployed at RoboCup 2024 in Eindhoven, where it secured second place in the GPSR task. The code, dataset and models are available at https://github.com/socrob/llm_gpsr.

17:00-17:05, Paper ThDT16.5
Safety Aware Task Planning Via Large Language Models in Robotics

Khan, Azal Ahmad	University of Minnesota
Andrev, Michael	University of Minnesota
Murtaza, Muhammad Ali	Georgia Institute of Technology
Aguilera, Sergio	Pontificia Universidad Catolica De Chile
Zhang, Rui	University of Minnesota
Ding, Jie	Harvard University
Hutchinson, Seth	Georgia Institute of Technology
Anwar, Ali	University of Minnesota
Keywords: Task Planning, Robot Safety, AI-Based Methods Abstract: The integration of large language models (LLMs) into robotic task planning has unlocked better reasoning capabilities for complex, long-horizon workflows. However, ensuring safety in LLM-driven plans remains a critical challenge, as these models often prioritize task completion over risk mitigation. This paper introduces proj (underline{S}afety-underline{A}ware underline{F}ramework for underline{E}xecution in underline{R}obotics), a multi-LLM framework designed to embed safety awareness into robotic task planning. SAFER employs a Safety Agent that operates alongside the primary task planner, providing safety feedback. Additionally, we introduce LLM-as-a-Judge, a novel metric leveraging LLMs as evaluators to quantify safety violations within generated task plans. Our framework integrates safety feedback at multiple stages of execution, enabling real-time risk assessment, proactive error correction, and transparent safety evaluation. We also integrate a control framework using Control Barrier Functions (CBFs) to ensure safety guarantees within SAFER’s task planning. We evaluate SAFER against state-of-the-art LLM planners on complex, long-horizon tasks involving heterogeneous robotic agents, demonstrating its effectiveness in reducing safety violations while maintaining task efficiency. We also verify the task planner and safety planner through actual hardware experiments involving multiple robots and a human.

17:05-17:10, Paper ThDT16.6
LATMOS: Latent Automaton Task Model from Observation Sequences

Zhan, Weixiao	University of California, San Diego
Dong, Qiyue	University of California, San Diego
Sebastián, Eduardo	University of Cambridge
Atanasov, Nikolay	University of California, San Diego
Keywords: Task Planning, Learning from Demonstration, Representation Learning Abstract: Robot task planning from high-level instructions is an important step towards deploying fully autonomous robot systems in the service sector. Three key aspects of robot task planning present challenges yet to be resolved simultaneously, namely, (i) factorization of complex tasks specifications into simpler executable subtasks, (ii) understanding of the current task state from raw observations, and (iii) planning and verification of task executions. To address these challenges, we propose LATMOS, an automata-inspired task model that, given observations from correct task executions, is able to factorize the task, while supporting verification and planning operations. LATMOS combines an observation encoder to extract the features from potentially high-dimensional observations with automata theory to learn a sequential model that encapsulates an automaton with symbols in the latent feature space. We conduct extensive evaluations in three task model learning setups: (i) abstract tasks described by logical formulas, (ii) real-world human tasks described by videos and natural language prompts and (iii) a robot task described by image and state observations. The results demonstrate the improved plan generation and verification capabilities of LATMOS across observation modalities and tasks.

17:10-17:15, Paper ThDT16.7
VLIN-RL: A Unified Vision-Language Interpreter and Reinforcement Learning Motion Planner Framework for Robot Dynamic Tasks

Jiang, Zewu	Tianjin University
Zhang, Junnan	Tianjin University
Wang, Ke	Tianjin University
Si, Chenyi	Tianjin University
Keywords: Task and Motion Planning, AI-Based Methods, Reinforcement Learning Abstract: Recently, with the development of Large Language Models (LLMs), Embodied AI represented by Vision-Language-Action Models (VLAs) has played a significant role in realizing the natural language interaction between humans and robots. Current VLA models can process and understand visual information and language instructions, while guiding robots to complete interactive tasks with the environment based on human language instructions. However, when tackling with the real-time and dynamic tasks, VLA has poor robustness and real-time planning and adjustment ability against changes in target objects, instructions, and environments. To handles these limitations, we propose VLIN-RL, a unified framework that consists of the Vision-Language Interpreter (VLIN) that owns excellent vision language information understanding and advanced task planning abilities and reinforcement learning (RL)-based motion planner with enhanced flexibility and broader applicability. If the environmental state changes during task execution, the RL planning module in VLIN-RL will directly make dynamic adjustments at the subtask level based on visual feedback to achieve the task goals, without the need for time-consuming information processing from VLIN. Experiments demonstrate that our model can complete multi-robot manipulation tasks more efficiently and stably. Finally, our work is verified by the pick-grasp tasks and real manipulators experiments. The test video is available at https://github.com/jzw-soulferryman/VLIN-RL.git.

17:15-17:20, Paper ThDT16.8
CLEA: Closed-Loop Embodied Agent for Enhancing Task Execution in Dynamic Environments

Lei, Mingcong	The Chinese University of Hong Kong, Shenzhen
Wang, Ge	Future Network of Intelligence Institution, the Chinese Universi
Zhao, Yiming	Harbin Engineering University
Mai, Zhixin	The Chinese University of Hong Kong, Shenzhen
Zhao, Qing	Room 4603, Changfu Jinmao Building, Hetao, Futian District, Shen
Guo, Yao	Shanghai Jiao Tong University
Li, Zhen	Shenzhen Research Institute of Big Data
Cui, Shuguang	Cush, Sz
Han, Yatong	Chinese University of Hongkong Shenzhen
Ren, Jinke	The Chinese University of Hong Kong, Shenzhen
Keywords: Task Planning, Agent-Based Systems, AI-Based Methods Abstract: Large Language Models (LLMs) exhibit remarkable capabilities in the hierarchical decomposition of complex tasks through semantic reasoning. However, their application in embodied systems faces challenges in ensuring reliable execution of subtask sequences and achieving one-shot success in long-term task completion. To address these limitations in dynamic environments, we propose Closed-Loop Embodied Agent (CLEA)--a novel architecture incorporating four specialized open-source LLMs with functional decoupling for closed-loop task management. The framework features two core innovations: (1) Interactive task planner that dynamically generates executable subtasks based on the environmental memory, and (2) Multimodal execution critic employing an evaluation framework to conduct a probabilistic assessment of action feasibility, triggering hierarchical re-planning mechanisms when environmental perturbations exceed preset thresholds. To validate CLEA's effectiveness, we conduct experiments in a real environment with manipulable objects, using two heterogeneous robots for object search, manipulation, and search-manipulation integration tasks. Across 12 task trials, CLEA outperforms the baseline model, achieving a 67.3% improvement in success rate and a 52.8% increase in task completion rate. These results demonstrate that CLEA significantly enhances the robustness of task planning and execution in dynamic environments. Our code is available at https://sp4595.github.io/CLEA/.


ThDT17	210A
Field Robots 4	Regular Session
Chair: Nikolakopoulos, George	Luleå University of Technology
Co-Chair: Basiri, Meysam	Instituto Superior Técnico

16:40-16:45, Paper ThDT17.1
Enhancing UAV Energy Efficiency and Versatility through Trimodal Ground, Hovering, and Fixed-Wing Locomotion Modes

Vale, Afonso	Instituto Superior Técnico
Basiri, Meysam	Instituto Superior Técnico
Afonso, Frederico	Instituto Superior Técnico
Keywords: Field Robots, Energy and Environment-Aware Automation, Motion Control Abstract: Multimodal Unmanned Aerial Vehicles (UAVs), capable of operating in different locomotion modes, offer greater versatility and optimized energy usage. This paper presents a novel trimodal UAV that integrates ground locomotion, hovering, and fixed-wing flight using a shared actuator system. The design features a quadcopter frame with modular components, including passive wheels for ground mobility and fixed wings for forward flight. A control system for ground locomotion was implemented within the ArduPilot framework, enabling autonomous waypoint navigation across all modes. The prototype was extensively tested, with a comprehensive energy efficiency evaluation conducted through wind tunnel experiments and flight trials. In forward flight, the vehicle’s range increased, although its endurance decreased. Ground mode saw significant gains in both. Wing incidence tuning in hover improved endurance and range but reduced controllability. Additionally, the vehicle was shown to be capable of climbing inclined surfaces, such as walls.

16:45-16:50, Paper ThDT17.2
MARSCalib: Multi-Robot, Automatic, Robust, Spherical Target-Based Extrinsic Calibration in Field and Extraterrestrial Environments

Jeong, Seokhwan	Inha University
Kim, Hogyun	Inha University
Cho, Younggun	Inha University
Keywords: Field Robots, Space Robotics and Automation, Robotics in Hazardous Fields Abstract: This paper presents a novel spherical target-based LiDAR-camera extrinsic calibration method designed for outdoor environments with multi-robot systems, considering both target and sensor corruption. The method extracts the 2D ellipse center from the image and the 3D sphere center from the pointcloud, which are then paired to compute the transformation matrix. Specifically, the image is first decomposed using the Segment Anything Model (SAM). Then, a novel algorithm extracts an ellipse from a potentially corrupted sphere, and the extracted ellipse’s center is corrected for errors caused by the perspective projection model. For the LiDAR pointcloud, points on the sphere tend to be highly noisy due to the absence of flat regions. To accurately extract the sphere from these noisy measurements, we apply a hierarchical weighted sum to the accumulated pointcloud. Through experiments, we demonstrated that the sphere can be robustly detected even under both types of corruption, outperforming other targets. We evaluated our method using three different types of LiDARs (spinning, solid-state, and non-repetitive) with cameras positioned in three different locations. Furthermore, we validated the robustness of our method to target corruption by experimenting with spheres subjected to various types of degradation. These experiments were conducted in both a planetary test and a field environment. Our code is available at https://github.com/sparolab/MARSCalib.

16:50-16:55, Paper ThDT17.3
A Novel Perspective for Source Localization in Underwater Active Electrosense Robots Based on Sparse Signal Reconstruction

Jiang, Guangyu	Xi'an Jiaotong University
Hu, Qiao	Xi'an Jiaotong University
Fu, Tongqiang	Xi'an Jiaotong University
Rong, Yi	Xi'an Jiaotong University
Li, Shuo	Xi'an Jiaotong University
Wang, Pengtao	Xi'an Jiaotong University
Han, Binya	Xi'an Jiaotong University
Keywords: Biomimetics, Marine Robotics Abstract: Weakly electric fish can detect and localize objects in dark and turbid environments by sensing the perturbations induced by objects in their self-generated electric field. Massive efforts have been made to develop active electrosense systems for underwater robots that can rival those of fish. However, a well-performed localization method is always a challenge for underwater active electrosense robots. In this paper, we investigated the underwater active electrolocation problem from the novel perspective of sparse signal reconstruction. The source localization problem is formulated as a sparse signal recovery problem. The weighted matrix is designed to enhance the sparsity, and a multi-resolution grid search strategy is introduced to reduce the computational complexity. Experiments were conducted on a carefully designed underwater robot prototype. The results validate the proposed method’s effectiveness and superiority over the existing works. Our work provides a promising and robust method to meet the precise localization needs of underwater small objects for robots in turbid and cramped environments.

16:55-17:00, Paper ThDT17.4
Compact Posture Control System for Jumping Robot Using an Air Reaction Wheel (I)

Kim, Myeong-Jin	Daegu Gyeonbuk Institute of Science and Technology
Kim, Jisu	Daegu Gyeongbuk Institute of Science and Technology (DGIST)
Yun, Dongwon	Daegu Gyeongbuk Institute of Science and Technology (DGIST)
Keywords: Biologically-Inspired Robots, Biomimetics, Legged Robots Abstract: In this article, we propose a novel balance control method called air reaction wheel (ARW) for a small-scale legged jumping robot, which can generate high torque while being compact and lightweight. The ARW generates torque in the same direction through the combination of the torque induced by pushing the air, and the moment of inertia of ARW with the angular acceleration of the motor, resulting in high torque performance while being lighter and more compact than conventional balance control mechanisms. To validate the torque performance of the ARW, we conduct dynamic analysis and computational fluid dynamics simulations of the ARW and utilize a central composite face model to find the optimal shape for generating high torque. Furthermore, we verify that the proposed method generates high torque while being compact and lightweight compared to the conventional methods through theoretical analysis and comparative experiments on the ARW and existing mechanisms regarding torque performance. Finally, we conducted jump and landing experiments by attaching the optimized ARW model to a jumping robot, and through experimental results, we verified that the proposed mechanism contributes to stable jumping and landing.

17:00-17:05, Paper ThDT17.5
A Meniscus-Like Structure in Anthropomorphic Joints to Attenuate Impacts

Yang, Lianxin	Beihang University
Zhao, Zhihua	Tsinghua University
Keywords: Biomimetics, Compliant Joint/Mechanism, Physical Human-Robot Interaction, Impact Attenuation Abstract: During robotic locomotion, shock forces from ground impact propagate through the leg and may cause fatigue or damage to joints and sensitive hardware. To attenuate impacts in diverse aspects and directions, multi-approaches, including active control strategies and passive compliant joints, are essential for providing a comprehensive solution. Here, inspired by human knees, a meniscus-like structure was developed for a compliant anthropomorphic joint to provide a complementary way of shock absorption, especially along the axial direction. The proposed meniscus-like structure comprises a pair of curved arms wrapped with preloaded elastic bands whose elongations produce restoring forces against the axial load. This structure simultaneously realized impact attenuation in three aspects: decreasing contact stress by designing consistently conformal contact interfaces under axial movement; reducing peak impact forces by tuning load-displacement curves to obtain a high-static-low-dynamic nonlinear stiffness; and dissipating energy by hysteresis due to sliding frictions. The effectiveness in attenuating impacts on robotic legs was further verified by both analytical analyses and impact experiments that it outperforms regular elastic buffers at multiple leg configurations. Inserting meniscus-like structures into anthropomorphic joints efficiently utilized the joint space to attenuate axial impacts, complementing the system of interaction safety for the robot community.

17:05-17:10, Paper ThDT17.6
Development of Variable Chain Motor with Shape and Speed-Torque Characteristics Variability and Its Application to a Humanoid

Tada, Hiromi	The University of Tokyo
Hirai, Jin	The University of Tokyo
Hiraoka, Takuma	The University of Tokyo
Konishi, Masanori	The University of Tokyo
Himeno, Tomoya	The University of Tokyo
Kojima, Kunio	The University of Tokyo
Okada, Kei	The University of Tokyo
Keywords: Actuation and Joint Mechanisms, Humanoid Robot Systems Abstract: Various methods have been proposed to achieve high output torque and a wide output range for fast and high-load robotic motions. However, in robots composed of slender frames, such as humanoid robots, the limited space available for actuators and transmission components restricts the application of conventional methods. In this paper, we propose a Variable Chain Motor (VC Motor), an electric actuator that features both shape variability and speed-torque characteristics variability. Shape variability refers to the ability of the actuator to change its form during operation. This property enhances output torque by enabling a dense motor arrangement even under spatial constraints imposed by the frame structure. For example, the actuator can be placed across adjacent frames and deform according to joint rotation. Speed-torque characteristics variability allows switching output characteristics during operation using a dedicated electrical circuit. This enables an expanded range of output speed and torque without significantly increasing size or weight. We evaluated the performance of the developed VC Motor by measuring output torque and efficiency. Furthermore, by applying the VC Motor to the elbow joint of a humanoid robot, we demonstrated its capability for high-speed and high-load operations.


ThDT18	210B
Mapping 4	Regular Session
Chair: Vidal-Calleja, Teresa A.	University of Technology Sydney
Co-Chair: Dong, Wei	Shanghai Jiao Tong University

16:40-16:45, Paper ThDT18.1
VDB-GPDF: Online Gaussian Process Distance Field with VDB Structure

Wu, Lan	University of Technology Sydney
Le Gentil, Cedric	University of Toronto
Vidal-Calleja, Teresa A.	University of Technology Sydney
Keywords: Mapping, RGB-D Perception Abstract: Robots reason about the environment through dedicated representations. Popular choices for dense representations exploit Truncated Signed Distance Functions (TSDF) and Octree data structures. However, TSDF provides a projective or non-projective signed distance obtained directly from depth measurements that overestimate the Euclidean distance. Octrees, despite being memory efficient, require tree traversal and can lead to increased runtime in large scenarios. Other representations based on the Gaussian Process (GP) distance fields are appealing due to their probabilistic and continuous nature, but the computational complexity is a concern. In this paper, we present an online efficient mapping framework that seamlessly couples GP distance fields and the fast-access OpenVDB data structure. This framework incrementally builds the Euclidean distance field and fuses other surface properties, like intensity or colour, into a global scene representation that can cater for large-scale scenarios. The key aspect is a latent Local GP Signed Distance Field (L-GPDF) contained in a local VDB structure that allows fast queries of the Euclidean distance, surface properties and their uncertainties for arbitrary points in the field of view. Probabilistic fusion is then performed by merging the inferred values of these points into a global VDB structure that is efficiently maintained over time. After fusion, the surface mesh is recovered, and a global GP Signed Distance Field (G-GPDF) is generated and made available for downstream applications to query accurate distance and gradients. A comparison with the state-of-the-art frameworks shows superior efficiency and accuracy of the inferred distance field and comparable reconstruction performance.

16:45-16:50, Paper ThDT18.2
Particle-Based Instance-Aware Semantic Occupancy Mapping in Dynamic Environments

Chen, Gang	Delft University of Technology
Wang, Zhaoying	Shanghai Jiao Tong University
Dong, Wei	Shanghai Jiao Tong University
Alonso-Mora, Javier	Delft University of Technology
Keywords: Mapping, Semantic Scene Understanding, RGB-D Perception, Dynamic Environment Representation Abstract: Representing the 3D environment with instance-aware semantic and geometric information is crucial for interaction-aware robots in dynamic environments. Nonetheless, creating such a representation poses challenges due to sensor noise, instance segmentation and tracking errors, and the objects' dynamic motion. This paper introduces a novel particle-based instance-aware semantic occupancy map to tackle these challenges. Particles with an augmented instance state are used to estimate the Probability Hypothesis Density (PHD) of the objects and implicitly model the environment. Utilizing a State-augmented Sequential Monte Carlo PHD (S2MC-PHD) filter, these particles are updated to jointly estimate occupancy status, semantic, and instance IDs, mitigating noise. Additionally, a memory module is adopted to enhance the map's responsiveness to previously observed objects. Experimental results on the Virtual KITTI 2 dataset demonstrate that the proposed approach surpasses state-of-the-art methods across multiple metrics under different noise conditions. Subsequent tests using real-world data further validate the effectiveness of the proposed approach.

16:50-16:55, Paper ThDT18.3
OpenGS-Fusion: Open-Vocabulary Dense Mapping with Hybrid 3D Gaussian Splatting for Refined Object-Level Understanding

Yang, Dianyi	Beijing Institute of Technology
Wang, Xihan	Beijing Institute of Technology
Gao, Yu	Beijing Institude of Technology
Shiyang, Liu	Beijing Institute of Technology
Ren, Bohan	Beijing Institute of Technology
Yue, Yufeng	Beijing Institute of Technology
Yang, Yi	Beijing Institute of Technology
Keywords: Mapping, Semantic Scene Understanding, RGB-D Perception Abstract: Recent advancements in 3D scene understanding have made significant strides in enabling interaction with scenes using open-vocabulary queries, particularly for VR/AR and robotic applications. Nevertheless, existing methods are hindered by rigid offline pipelines and the inability to provide precise 3D object-level understanding given open-end queries. In this paper, we present OpenGS-Fusion, an innovative open-vocabulary dense mapping framework that improves semantic modeling and refines object-level understanding. OpenGS-Fusion combines 3D Gaussian representations with a Truncated Signed Distance Field to facilitate lossless fusion of semantic features on-the-fly. Furthermore, we introduce a novel multimodal language-guided approach named MLLM-Assisted Adaptive Thresholding, which refines the segmentation of 3D objects by adaptively adjusting similarity thresholds, achieving an improvement 17% in 3D mIoU compared to the fixed threshold strategy. Extensive experiments demonstrate that our method outperforms existing methods in 3D object understanding and scene reconstruction quality, as well as showcasing its effectiveness in language-guided scene interaction.

16:55-17:00, Paper ThDT18.4
OpenMIGS: Multi-Granularity Information-Preserving Open-Vocabulary 3D Gaussian Splatting

Zhao, Jingyu	Beijing Institute of Technology
Wang, Jiahui	Beijing Institute of Technology
Deng, Yinan	Beijing Institute of Technology
Yue, Yufeng	Beijing Institute of Technology
Keywords: Mapping, Semantic Scene Understanding Abstract: Open-vocabulary scene understanding is critical for robotics, yet existing 3D Gaussian Splatting (3DGS) methods rely on compressed feature embeddings, compromising semantic fidelity and fine-grained interpretation. Although utilizing uncompressed high-dimensional features offers a potential solution, their direct integration imposes prohibitive memory and computational costs. To address this challenge, we propose OpenMIGS, a novel 3DGS-based framework for multi-granularity, information-preserving open-vocabulary understanding across both object and part levels. Specifically, OpenMIGS first constructs object-level Gaussian fields as structured carriers where a two-stage clustering strategy ensures global consistency in object labeling, and a codebook subsequently associates these object label with their uncompressed high-dimensional features. Building on this, a lightweight implicit field processes the geometric coordinates of object Gaussians to regress part-level high-dimensional features, enabling multi-granularity understanding. Experimental results on multiple datasets show that OpenMIGS outperforms existing methods in open-vocabulary understanding and retrieval tasks. It also supports multi-granularity scene editing for flexible semantic manipulation. The code is available at https://github.com/jingyuzhao1010/OpenMIGS.

17:00-17:05, Paper ThDT18.5
Towards Autonomous Indoor Parking: A Globally Consistent Semantic SLAM System and a Semantic Localization Subsystem

Sha, Yichen	Shanghai Jiao Tong University
Zhu, Siting	Shanghai Jiao Tong University
Guo, Hekui	DXR
Wang, Zhong	Shanghai Jiao Tong University
Wang, Hesheng	Shanghai Jiao Tong University
Keywords: Mapping, SLAM, Localization Abstract: We propose a globally consistent semantic SLAM system (GCSLAM) and a semantic-fusion localization subsystem (SF-Loc), which achieves accurate semantic mapping and robust localization in complex parking lots. Visual cameras (front-view and surround-view), IMU, and wheel encoder form the input sensor configuration of our system. The first part of our work is GCSLAM. GCSLAM introduces a semantic-constrained factor graph for the optimization of poses and semantic map, which incorporates innovative error terms based on multi-sensor data and BEV (bird's-eye view) semantic information. Additionally, GCSLAM integrates a Global Slot Management module that stores and manages parking slot observations. SF-Loc is the second part of our work, which leverages the semantic map built by GCSLAM to conduct map-based localization. SF-Loc integrates registration results and odometry poses with a novel factor graph. Our system demonstrates superior performance over existing SLAM on two real-world datasets, showing excellent capabilities in robust global localization and precise semantic mapping.

17:05-17:10, Paper ThDT18.6
II-NVM: Enhancing Map Accuracy and Consistency with Normal Vector-Assisted Mapping

Zhao, Chengwei	Hangzhou Qisheng Intelligent Techology Company Limited
Li, Yixuan	Xi'an Jiaotong University
Jian, Yina	Columbia University in the City of New York
Xu, Jie	Harbin Institute of Technology
Wang, Linji	George Mason University
Ma, Yongxin	Shandong University
Jin, Xinglai	Hangzhou Qisheng Intelligent Techology Co. Ltd
Keywords: Mapping, SLAM, Localization Abstract: SLAM technology plays a crucial role in indoor mapping and localization. A common challenge in indoor environments is the "double-sided mapping issue," where closely positioned walls, doors, and other surfaces are mistakenly identified as a single plane, significantly hindering map accuracy and consistency. To address this issue, this paper introduces a SLAM approach that ensures accurate mapping using normal vector consistency. We enhance the voxel map structure to store both point cloud data and normal vector information, enabling the system to evaluate consistency during nearest neighbor searches and map updates. This process distinguishes between the front and back sides of surfaces, preventing incorrect point-to-plane constraints. Moreover, we implement an adaptive radius KD-tree search method that dynamically adjusts the search radius based on the local density of the point cloud, thereby enhancing the accuracy of normal vector calculations. To further improve real-time performance and storage efficiency, we incorporate a Least Recently Used (LRU) cache strategy, which facilitates efficient incremental updates of the voxel map. The code is released as open-source and validated in both simulated environments and real indoor scenarios. Experimental results demonstrate that this approach effectively resolves the "double-sided mapping issue" and significantly improves mapping precision. Additionally, we have developed and open-sourced the first simulation and real-world dataset specifically tailored for the "double-sided mapping issue."

17:10-17:15, Paper ThDT18.7
LiV-GS: LiDAR-Vision Integration for 3D Gaussian Splatting SLAM in Outdoor Environments

Xiao, Renxiang	Harbin Institute of Technology, Shenzhen
Liu, Wei	Harbin Institute of Technology, Shenzhen
Chen, YuShuai	Harbin Institute of Technology, Shenzhen
Hu, Liang	Harbin Institute of Technology, Shenzhen
Keywords: Mapping, SLAM, Range Sensing Abstract: We present LiV-GS, a LiDAR-visual SLAM system in outdoor environments that leverages 3D Gaussian as a differentiable spatial representation. Notably, LiV-GS is the first method that directly aligns discrete and sparse LiDAR data with continuous differentiable Gaussian maps in large-scale outdoor scenes, overcoming the limitation of fixed resolution in traditional LiDAR mapping. The system aligns point clouds with Gaussian maps using shared covariance attributes for front-end tracking and integrates the normal orientation into the loss function to refines the Gaussian map. To reliably and stably update Gaussians outside the LiDAR field of view, we introduce a novel conditional Gaussian constraint that aligns these Gaussians closely with the nearest reliable ones. The targeted adjustment enables LiV-GS to achieve fast and accurate mapping with novel view synthesis at a rate of 7.98 FPS. Extensive comparative experiments demonstrate LiV-GS's superior performance in SLAM, image rendering and mapping. The successful cross-modal radar-LiDAR localization highlights the potential of LiV-GS for applications in cross-modal semantic positioning and object segmentation with Gaussian maps.

17:15-17:20, Paper ThDT18.8
SemGauss-SLAM: Dense Semantic Gaussian Splatting SLAM

Zhu, Siting	Shanghai Jiao Tong University
Qin, Renjie	Shanghai Jiao Tong University
Wang, Guangming	University of Cambridge
Liu, Jiuming	Shanghai Jiao Tong University
Wang, Hesheng	Shanghai Jiao Tong University
Keywords: Mapping, SLAM, Semantic Scene Understanding Abstract: We propose SemGauss-SLAM, a dense semantic SLAM system utilizing 3D Gaussian representation, that enables accurate 3D semantic mapping, robust camera tracking, and high-quality rendering simultaneously. In this system, we incorporate semantic feature embedding into 3D Gaussian representation, which effectively encodes semantic information within the spatial layout of the environment for precise semantic scene representation. Furthermore, we propose feature-level loss for updating 3D Gaussian representation, enabling higher-level guidance for 3D Gaussian optimization. In addition, to reduce cumulative drift in tracking and improve semantic reconstruction accuracy, we introduce semantic-informed bundle adjustment. By leveraging multi-frame semantic associations, this strategy enables joint optimization of 3D Gaussian representation and camera poses, resulting in low-drift tracking and accurate semantic mapping. Our SemGauss-SLAM demonstrates superior performance over existing radiance field-based SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets, while also showing excellent capabilities in high-precision semantic segmentation and dense semantic mapping. Code will be available at https://github.com/IRMVLab/SemGauss-SLAM.


ThDT19	210C
Aerial Systems: Applications 3	Regular Session
Co-Chair: Gao, Yongsheng	Griffith University

16:40-16:45, Paper ThDT19.1
Tendon-Driven Grasper Design for Aerial Robot Perching on Tree Branches

Li, Haichuan	China University of Pertroleum(East China)
Zhao, ZIang	University of Bristol
Wu, Ziniu	University of Bristol
Potdar, Parth	University of Cambridge
Tran, Ba Long	University of Bristol
Karasahin, Ali Tahir	Necmettin Erbakan University
Windsor, Shane	University of Bristol
Burrow, Stephen	University of Bristol
Kocer, Basaran Bahadir	Imperial College London
Keywords: Aerial Systems: Applications Abstract: Protecting and restoring forest ecosystems has become an important conservation issue. Although various robots have been used for field data collection to protect forest ecosystems, the complex terrain and dense canopy make the data collection less efficient. To address this challenge, an aerial platform with bio-inspired behaviour facilitated by a bio-inspired mechanism is proposed. The platform spends minimum energy during data collection by perching on tree branches. A raptor inspired vision algorithm is used to locate a tree trunk, and then a horizontal branch on which the platform can perch is identified. A tendon-driven mechanism inspired by bat claws which requires energy only for actuation, secures the platform onto the branch using the mechanism’s passive compliance. Experimental results show that the mechanism can perform perching on branches ranging from 30 mm to 80 mm in diameter. The real-world tests validated the system’s ability to select and adapt to target points, and it is expected to be useful in complex forest ecosystems.

16:45-16:50, Paper ThDT19.2
AnyTSR: Any-Scale Thermal Super-Resolution for UAV

Li, Mengyuan	Tongji University
Fu, Changhong	Tongji University
Lu, Ziyu	Tongji University
Zhang, Zijie	Tongji University
Zuo, Haobo	University of Hong Kong
Yao, Liangliang	Tongji University
Keywords: Aerial Systems: Applications, Deep Learning for Visual Perception, Aerial Systems: Perception and Autonomy Abstract: Thermal imaging can greatly enhance the application of intelligent unmanned aerial vehicles (UAV) in challenging environments. However, the inherent low resolution of thermal sensors leads to insufficient details and blurred boundaries. Super-resolution (SR) offers a promising solution to address this issue, while most existing SR methods are designed for fixedscale SR. They are computationally expensive and inflexible in practical applications. To address above issues, this work proposes a novel any-scale thermal SR method (AnyTSR) for UAV within a single model. Specifically, a new image encoder is proposed to explicitly assign specific feature code to enable more accurate and flexible representation. Additionally, by effectively embedding coordinate offset information into the local feature ensemble, an innovative any-scale upsampler is proposed to better understand spatial relationships and reduce artifacts. Moreover, a novel dataset (UAV-TSR), covering both land and water scenes, is constructed for thermal SR tasks. Experimental results demonstrate that the proposed method consistently outperforms state-of-the-art methods across all scaling factors as well as generates more accurate and detailed high-resolution images. The code is located at https://github.com/vision4robotics/AnyTSR.

16:50-16:55, Paper ThDT19.3
Shape-Adaptive Planning and Control for a Deformable Quadrotor

Wu, Yuze	Zhejiang University
Han, Zhichao	Zhejiang University
Wu, Xuankang	Northeastern University
Zhou, Yuan	Zhejiang University
Wang, Junjie	Zhejiang University
Fang, Zheng	Northeastern University
Gao, Fei	Zhejiang University
Keywords: Aerial Systems: Applications, Motion and Path Planning, Aerial Systems: Mechanics and Control Abstract: Drones have become essential in various applications, but conventional quadrotors face limitations in confined spaces and complex tasks. Deformable drones, which can adapt their shape in real-time, offer a promising solution to overcome these challenges, while also enhancing maneuverability and enabling novel tasks like object grasping. This paper presents a novel approach to autonomous motion planning and control for deformable quadrotors. We introduce a shape-adaptive trajectory planner that incorporates deformation dynamics into path generation, using a scalable kinodynamic A* search to handle deformation parameters in complex environments. The backend spatio-temporal optimization is capable of generating optimally smooth trajectories that incorporate shape deformation. Additionally, we propose an enhanced control strategy that compensates for external forces and torque disturbances, achieving a 37.3% reduction in trajectory tracking error compared to our previous work. Our approach is validated through simulations and real-world experiments, demonstrating its effectiveness in narrow-gap traversal and multi-modal deformable tasks.

16:55-17:00, Paper ThDT19.4
A Bioinspired Framework for Person Detection and Tracking Using Events and Frames on Flapping-Wing Aerial Robots

Tapia, Raul	University of Seville
Ijjeh, Abdalraheem	University of Seville, GRVC Lab
Martinez-de Dios, J.R.	University of Seville
Ollero, Anibal	AICIA. G41099946
Keywords: Aerial Systems: Applications, Aerial Systems: Perception and Autonomy, Human Detection and Tracking Abstract: Flapping-wing aerial robots offer significant advantages over conventional multirotors, including lower noise signatures, higher energy efficiency, and enhanced maneuverability. Despite these benefits, their application in surveillance, particularly person detection and tracking, remains largely underexplored. This paper proposes a bioinspired framework for person detection and tracking, specifically designed for flapping-wing aerial robots. Drawing inspiration from the dual pathways in biological vision, our method integrates an event-by-event blob tracker with a more accurate but slower frame-based detector. The event-based tracker leverages the high temporal resolution and robustness to motion blur of event cameras, effectively compensating for the strong vibrations caused by the flapping strokes of these robots. The frame-based detection (implemented using a deep neural network) periodically corrects and enhances the event-based tracking estimates, globally achieving a balanced trade-off between accuracy, responsiveness, and computational cost. Evaluation with both multirotor and flapping-wing aerial robots validates the effectiveness and efficiency of the approach.

17:00-17:05, Paper ThDT19.5
Partial Feedback Linearization Control of a Cable-Suspended Multirotor Platform for Stabilization of an Attached Load

Das, Hemjyoti	Technical University of Vienna
Ott, Christian	TU Wien
Keywords: Aerial Systems: Applications, Robotics and Automation in Construction, Aerial Systems: Mechanics and Control Abstract: In this work, we present a novel control approach based on partial feedback linearization (PFL) for the stabilization of a suspended aerial platform with an attached load. Such systems are envisioned for various applications in construction sites involving cranes, such as the holding and transportation of heavy objects. Our proposed control approach considers the underactuation of the whole system while utilizing its coupled dynamics for stabilization. We demonstrate using numerical stability analysis that these coupled terms are crucial for the stabilization of the complete system. We also carried out robustness analysis of the proposed approach in the presence of external wind disturbances, sensor noise, and uncertainties in system dynamics. As our envisioned target application involves cranes in outdoor construction sites, our control approaches rely on only onboard sensors, thus making it suitable for such applications. We carried out extensive simulation studies and experimental tests to validate our proposed control approach.

17:05-17:10, Paper ThDT19.6
View-Aware Decomposition and Unification for Fast Ground-To-Aerial Person Search

Wang, Qifei	Beijing Institute of Technology
Zhang, Pengcheng	Beihang University
Yu, Xiaohan	Macquarie University
Bai, Xiao	Beihang University
Gao, Yongsheng	Griffith University
Keywords: Aerial Systems: Applications, Representation Learning, Search and Rescue Robots Abstract: Ground-to-aerial person search leverages cooperative efforts between unmanned aerial vehicles (UAV) and ground surveillance cameras to locate person individuals. Despite the progress made by recent works, the impact of the discrepancy between the two views is underestimated. This limits the overall person search performance when training the model in a view-agnostic way. To address this, we propose a view-aware decomposition and unification (VADU) framework for ground-to-aerial person search. Specifically, we decompose the person search model to learn view-oriented modules for image feature encoding and person proposal generation. The data sampling and retrieval feature learning are also composed to cope with the decomposed model. This decomposition improves both person detection and discriminative feature learning within each view. On top of the decomposition, we propose view-aware unification to produce unified cross-view person features. Cross-view prototypical contrastive learning is introduced to enhance the unification between different views, enhancing model robustness to retrieve a target person in cameras of a different view. As the decomposed parts of the model are deployed on different devices for inference, this overall framework adds no extra computation cost in real-world applications. Extensive experiments demonstrate that the proposed method achieves superior person search performance and guarantees the efficiency of inference. The source code is available at https://github.com/QFWang-11/vadu.

17:10-17:15, Paper ThDT19.7
Vision-Based Cooperative MAV-Capturing-MAV

Zheng, Canlun	Westlake University
Mi, Yize	Westlake University
Guo, Hanqing	Westlake University
Chen, Huaben	西湖大学
Zhao, Shiyu	Westlake University
Keywords: Aerial Systems: Applications, Aerial Systems: Perception and Autonomy, Multi-Robot Systems Abstract: MAV-capturing-MAV (MCM) is one of the few effective ways to counter misused or malicious MAVs physically. In this paper, we developed a vision-based cooperative MCM system where multiple pursuer MAVs use onboard vision systems to detect, locate, and pursue a target MAV. To ensure robustness, a distributed state estimation and control frame- work is implemented, enabling the pursuers to coordinate their actions autonomously. Once the capture conditions (defined by the relative states of the pursuer and target) are met, the pursuit MAVs automatically deploy a flying net to intercept the target. Instead of solving the full net dynamics, which is computationally expensive, we introduce a real-time method to approximate the motion of the net, significantly reducing computational complexity. The pursuers’ trajectories are opti-mized using model predictive control (MPC) and executed via a low-level SO(3) controller. Both simulations and real-world experiments validate the proposed system. In real-world tests, our approach successfully captures a moving target traveling at 4 m/s with an acceleration of 1 m/s2, achieving a success rate of 64.7%.

17:15-17:20, Paper ThDT19.8
Automatic Generation of Aerobatic Flight in Complex Environments Via Diffusion Models

Zhong, Yuhang	Zhejiang Unviersity
Zhao, Anke	Zhejiang University
Wu, Tianyue	Zhejiang University
Zhang, Tingrui	Zhejiang University
Gao, Fei	Zhejiang University
Keywords: Aerial Systems: Applications, AI-Based Methods, Motion and Path Planning Abstract: Performing striking aerobatic flight in complex environments demands manual designs of key maneuvers in advance, which is intricate and time-consuming as the horizon of the trajectory performed becomes long. This paper presents a novel framework that leverages diffusion models to automate and scale up aerobatic trajectory generation. Our key innovation is the decomposition of complex maneuvers into aerobatic primitives, which are short frame sequences that act as building blocks, featuring critical aerobatic behaviors for tractable trajectory synthesis. The model learns aerobatic primitives using historical trajectory observations as dynamic priors to ensure motion continuity, with additional conditional inputs (target waypoints and optional action constraints) integrated to enable user-editable trajectory generation. During model inference, classifier guidance is incorporated with batch sampling to achieve obstacle avoidance. Additionally, the generated outcomes are refined through post-processing with spatial-temporal trajectory optimization to ensure dynamical feasibility. Extensive simulations and real-world experiments have validated the key component designs of our method, demonstrating its feasibility for deploying on real drones to achieve long-horizon aerobatic flight.


ThDT20	210D
Perception for Grasping and Manipulation 4	Regular Session
Chair: Kobayashi, Yuichi	Shizuoka University
Co-Chair: Gao, Yixing	Jilin University

16:45-16:50, Paper ThDT20.2
One-Shot Affordance Grounding of Deformable Objects in Egocentric Organizing Scenes

Jia, Wanjun	Hunan University
Yang, Fan	Hunan University
Duan, Mengfei	Hunan University
Chen, XianChi	Hunan University
Wang, Yinxi	Hunan University
Jiang, Yiming	Hunan University
Chen, Wenrui	Hunan University
Yang, Kailun	Hunan University
Li, Zhiyong	HUNAN UNIVERSITY
Keywords: Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation, Visual Learning Abstract: Deformable object manipulation in robotics presents significant challenges due to uncertainties in compo nent properties, diverse configurations, visual interference, and ambiguous prompts. These factors complicate both perception and control tasks. To address these challenges, we propose a novel method for One-Shot Affordance Grounding of De formable Objects (OS-AGDO) in egocentric organizing scenes, enabling robots to recognize previously unseen deformable objects with varying colors and shapes using minimal samples. Specifically, we first introduce the Deformable Object Semantic Enhancement Module (DefoSEM), which enhances hierarchical understanding of the internal structure and improves the ability to accurately identify local features, even under conditions of weak component information. Next, we propose the ORB Enhanced Keypoint Fusion Module (OEKFM), which optimizes feature extraction of key components by leveraging geometric constraints and improves adaptability to diversity and visual interference. Additionally, we propose an instance-conditional prompt based on image data and task context, which effectively mitigates the issue of region ambiguity caused by prompt words. To validate these methods, we construct a diverse real world dataset, AGDDO15, which includes 15 common types of deformable objects and their associated organizational actions. Experimental results demonstrate that our approach signifi cantly outperforms state-of-the-art methods, achieving improve ments of 6.2%, 3.2%, and 2.9% in KLD, SIM, and NSS metrics, respectively, while exhibiting high generalization performance. Source code and benchmark dataset are made publicly available at https://github.com/Dikay1/OS-AGDO.

16:50-16:55, Paper ThDT20.3
Uni-Zipper: A Multi-Modal Perception Framework of Deformable Objects with Unpaired Data

Xie, Qian	Tongji University
Zhou, Yanmin	Tongji University
Wang, Wei	Tongji University
Jin, Yiyang	Tongji University
Wang, Zhipeng	Tongji University
Jiang, Rong	Tongji University
Li, Xin	Tongji University
Sang, HongRui	Tongji University
He, Bin	TongJi University, Shanghai, China
Keywords: Perception for Grasping and Manipulation, Haptics and Haptic Interfaces, Deep Learning in Grasping and Manipulation Abstract: Multi-modal perception plays a crucial role in preventing deformation and damage during the robotic manipulation of deformable objects. However, integrating new heterogeneous modalities into existing robotic perception frameworks remains a significant challenge, primarily due to the need for massive amounts of paired data. In this paper, we propose Uni-Zipper, a scalable multi-modal fusion framework designed to expand new modalities with the help of semantic enhancement without relying on paired data. Uni-Zipper consists of a tokenizer that projects various modalities into a shared embedding space, a summary word embedding layer with a feature dictionary, a modality alignment space, and dynamic reconfigurable task heads. To facilitate efficient integration and extension of new modalities, the Zipper alignment mechanism is employed, effectively bridging the modality gap between different input types. Our experimental results demonstrate that Uni-Zipper successfully fuses four modalities and enhances performance in downstream tasks. Despite a 12% decrease in parameter count, Uni-Zipper maintains comparable performance.

16:55-17:00, Paper ThDT20.4
Generalizable Category-Level Topological Structure Learning for Clothing Recognition in Robotic Grasping

Zhu, Xingyu	Jilin University
Wu, Yan	A*STAR Institute for Infocomm Research
Tu, Zhiwen	Jilin University
Zhong, Haifeng	Jilin University
Gao, Yixing	Jilin University
Keywords: Object Detection, Segmentation and Categorization, Perception for Grasping and Manipulation Abstract: Recognizing various types of clothing is crucial for robotic clothing manipulation tasks, such as garment organization and robot-assisted dressing. Unlike rigid object recognition, clothing recognition remains a challenging task due to the diverse forms introduced by flexible deformations. However, existing classification models primarily focus on clothing color and texture while overlooking structural features, limiting their ability to distinguish between deformable clothing categories with similar color and texture. Moreover, due to the insufficient representation of structural features, these models heavily rely on manually annotated labels, making it difficult to accurately recognize unseen clothing items with new colors or textures. To address these challenges, we propose a novel topological structure representation and optimization strategy for category-level clothing structural feature learning. Additionally, we design a multi-clothing classification framework based on multiple mask generation to identify clothing regions within a scene. By leveraging our proposed structural feature learning strategy, our framework effectively generalizes to unseen clothing items. Finally, we introduce a fabric-specific grasping position estimation method and develop a corresponding robotic grasping system capable of selecting and grasping specified clothing items based on user instructions. Extensive real-world robotic experiments demonstrate the effectiveness of our system, and comprehensive comparisons with multiple baselines further validate the superiority of our approach.

17:00-17:05, Paper ThDT20.5
Self-Supervised Complementary Learning between Vision and Tactility by Probing Action into an Open-Mouth Container

Takamori, Daiki	Shizuoka University
Hayakawa, Tomohiro	Shizuoka University
Kobayashi, Yuichi	Shizuoka University
Hara, Kosuke	Sumitomo Heavy Industries, Ltd
Usui, Dotaro	Sumitomo Heavy Industries, Ltd
Keywords: Perception-Action Coupling, Probabilistic Inference, Recognition Abstract: Plastic bags are challenging objects for robot manipulation due to transparency and deformability. This paper proposes a learning approach for a robot to insert its hand into a container- or bag-shaped object based on visual and tactile sensing. The basic idea is to utilize probing action that allows to acquire rich information about the object even with a simple tactile sensor. The structure of the object is estimated by unsupervised learning with contact and reachability information. The result is transferred to visual recognition as self-supervised learning. Based on the unsupervised learning result, the robot can verify whether the hand truly reached the interior of the bag by additional probing actions. The proposed method was evaluated experimentally by a robot hand with a simple tactile sensor.

17:05-17:10, Paper ThDT20.6
PartGrasp: Generalizable Part-Level Grasping Via Semantic-Geometric Alignment

Lu, Haoyang	Beijing Institute of Techonology
Yang, Chengcai	Beijing Institute of Technology
Chen, Guangyan	Beijing Institute of Technology
Yue, Yufeng	Beijing Institute of Technology
Keywords: Perception-Action Coupling, Physical Human-Robot Interaction Abstract: The ability to perform generalizable and precise grasping on functional object parts is a prerequisite for robotic manipulation in open environments. Recent foundation models have demonstrated promising semantic correspondence capabilities in guiding robots to grasp similar parts across objects with resembling shapes and poses. However, existing works struggle to generalize precise grasp poses when the target objects exhibit substantial geometric and positional variations. To tackle this challenge, we present PartGrasp, a method that achieves precise part grasping through hierarchical integration of highly generalizable semantic correspondence and precise geometric registration. Specifically, we first build a grasp knowledge bank by extracting grasp poses and object meshes from demonstrations. Upon retrieving a reference from this bank, we initially perform a coarse alignment using semantic correspondence, followed by a fine registration that adapts to geometric variations. This approach achieves fine-grained generalization of part grasping that is robust to both shape and pose variations. Extensive experiments demonstrate the efficacy of our method in terms of both generalization capability and accuracy. Videos and more details are available on our project site: https://part-grasp.github.io/partgrasp/.

17:10-17:15, Paper ThDT20.7
RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation

Yang, Liudi	University of Freiburg
Bai, Yang	Ludwig Maximilian University of Munich
Eskandar, George	University of Stuttgart
Shen, Fengyi	Technical University of Munich
Altillawi, Mohammad	Huawei, Autonomous University of Barcelona,
Chen, Dong	Technische Universität München
Majumder, Soumajit	Huawei
Liu, Ziyuan	Huawei Group
Kutyniok, Gitta	The Ludwig Maximilian University of Munich
Valada, Abhinav	University of Freiburg
Keywords: Perception for Grasping and Manipulation, Manipulation Planning, Visual Learning Abstract: We address the problem of generating long- horizon videos for robotic manipulation tasks. Text-to-video diffusion models have made significant progress in photore- alism, language understanding, and motion generation but struggle with long-horizon robotic tasks. Recent works use video diffusion models for high-quality simulation data and predictive rollouts in robot planning. However, these works predict short sequences of the robot achieving one task and employ an autoregressive paradigm to extend to the long horizon, leading to error accumulations in the generated video and in the execution. To overcome these limitations, we propose a novel pipeline that bypasses the need for autoregressive generation. We achieve this through a threefold contribution: 1) we first decompose the high-level goals into smaller atomic tasks and generate keyframes aligned with these instructions. A second diffusion model then interpolates between each of the two generated frames, achieving the long-horizon video. 2) We propose a semantics preserving attention module to maintain consistency between the keyframes. 3) We design a lightweight policy model to regress the robot joint states from generated videos. Our approach achieves state-of-the-art results on two benchmarks in video quality and consistency while outperforming previous policy models on long-horizon tasks.


ThDT21	101
Machine Learning for Robot Control 4	Regular Session
Co-Chair: Xiao, Xuesu	George Mason University

16:40-16:45, Paper ThDT21.1
Learning to Exploit Leg Odometry Enables Terrain-Aware Quadrupedal Locomotion

Zhou, Yong	Wuhan University
Jiang, Jiawei	Wuhan University
Du, Bo	Wuhan University
Wang, Zengmao	Wuhan University
Keywords: Machine Learning for Robot Control, Legged Robots, Perception-Action Coupling Abstract: The geometry of terrain is crucial for developing terrain-aware locomotion policies. Recent advancements in quadrupedal locomotion based on learning rely on depth information obtained from LiDARs and depth cameras. Despite the capabilities of these locomotion policies on terrains, they pose challenges in processing high-dimensional data in real time with onboard hardware. In this study, we develop a lightweight framework that utilizes only the intrinsic sensors of a quadrupedal robot to facilitate terrain-aware locomotion. We introduce a learning-based leg odometry integrated with a locomotion policy trained through reinforcement learning. Utilizing blind localization from leg odometry alongside a pre-constructed height map enables the robot to navigate steps and stairs without incident. We assess the efficacy of our framework through simulations, where our results indicate that the robot achieves up to a 17% improvement in successful traversal rates and requires fewer point samples. By compensating for slippage occurring during locomotion, our learning-based leg odometry surpasses traditional inertial-leg odometry. Lastly, we validate the practical applicability of our models on a real robot, confirming their effectiveness in real-world settings.

16:45-16:50, Paper ThDT21.2
Autotuning Bipedal Locomotion MPC with GRFM-Net for Efficient Sim-To-Real Transfer

Chen, Qianzhong	Stanford University
Li, Junheng	University of Southern California
Cheng, Sheng	University of Illinois Urbana-Champaign
Hovakimyan, Naira	University of Illinois at Urbana-Champaign
Nguyen, Quan	University of Southern California
Keywords: Model Learning for Control, Humanoid and Bipedal Locomotion, Legged Robots Abstract: Bipedal locomotion control is essential for humanoid robots to navigate complex, human-centric environments. While optimization-based control designs are popular for integrating sophisticated models of humanoid robots, they often require labor-intensive manual tuning. In this work, we address the challenges of parameter selection in bipedal locomotion control using DiffTune, a model-based autotuning method that leverages differential programming for efficient parameter learning. A major difficulty lies in balancing model fidelity with differentiability. We address this difficulty using a low-fidelity model for differentiability, enhanced by a Ground Reaction Force-and-Moment Network (GRFM-Net) to capture discrepancies between MPC commands and actual control effects. We validate the parameters learned by DiffTune with GRFM-Net in hardware experiments, which demonstrates the parameters' optimality in a multi-objective setting compared with baseline parameters, reducing the total loss by up to 40.5% compared with the expert-tuned parameters. The results confirm the GRFM-Net's effectiveness in mitigating the sim-to-real gap, improving the transferability of simulation-learned parameters to real hardware.

16:50-16:55, Paper ThDT21.3
Probabilistic Motion Model Learning for Tendon Actuated Continuum Robots with Backlash

Chaari, Mahdi	ICube Laboratory, CNRS, University of Strasbourg
Zanne, Philippe	University of Strasbourg
Nageotte, Florent	University of Strasbourg
Keywords: Model Learning for Control, Modeling, Control, and Learning for Soft Robots, Medical Robots and Systems Abstract: In this paper, we propose a probabilistic motion model for tendon actuated continuum robots that experience actuation transmission non-linearities due to cable slack and cable-sheath friction. The model is based on a Lie group formulation of the robot's end-effector pose that incorporates a new simple backlash model. Bayesian parameter estimation is then employed to learn a probability distribution over the model's parameters. This allows the uncertainty over the parameters to be propagated in the prediction of the end-effector's trajectory. The model's predictive capabilities are compared against the static Cosserat-rod-based model and the Kirchhoff model in simulation and are validated with experiments on a robotized medical endoscope

16:55-17:00, Paper ThDT21.4
Bounding Distributional Shifts in World Modeling through Novelty Detection

Jing, Eric	Rutgers University
Boularias, Abdeslam	Rutgers University
Keywords: Model Learning for Control, Visual Learning Abstract: Recent work on visual world models shows significant promise in latent state dynamics obtained from pre-trained image backbones. However, most of the current approaches are sensitive to training quality, requiring near-complete coverage of the action and state space during training to prevent divergence during inference. To make a model-based planning algorithm more robust to the quality of the learned world model, we propose in this work to use a variational autoencoder as a novelty detector to ensure that proposed action trajectories during planning do not cause the learned model to deviate from the training data distribution. To evaluate the effectiveness of this approach, a series of experiments in challenging simulated robot environments was carried out, with the proposed method incorporated into a model-predictive control policy loop extending the DINO-WM architecture. The results clearly show that the proposed method improves over state-of-the-art solutions in terms of data efficiency.

17:00-17:05, Paper ThDT21.5
Iterative Learning Motion Control of Continuum Robots Based on Neural Ordinary Differential Equations

Liang, Zhenhan	Sun Yat-Sen University
Yu, Peng	Sun Yat-Sen University
Tan, Ning	Sun Yat-Sen University
Keywords: Model Learning for Control, Redundant Robots, Neural and Fuzzy Control Abstract: Traditional data-driven control methods often require large amounts of training data, posing significant challenges for continuum robots. Recently, neural ordinary differential equation (NODE) methods have demonstrated impressive capabilities for data-efficient modeling of continuum robots. However, existing NODE-based control methods still face limitations in terms of convergence and robustness. In this paper, we propose a data-driven iterative learning control system for continuum robots, leveraging NODE for modeling. Within this framework, by incorporating online parameter learning, the proposed control system continuously adapts to various uncertainties associated with continuum robots, resulting in improved convergence and robustness in repetitive tasks. The effectiveness of the proposed method is validated through simulations and physical experiments, and comparative analysis highlights its superior accuracy over existing approaches.

17:05-17:10, Paper ThDT21.6
Dom, Cars Don't Fly!---Or Do They? In-Air Vehicle Maneuver for High-Speed Off-Road Navigation

Pokhrel, Anuj	George Mason University
Datar, Aniket	George Mason University
Xiao, Xuesu	George Mason University
Keywords: Model Learning for Control, Dynamics, Motion and Path Planning Abstract: When pushing the speed limit for aggressive off- road navigation on uneven terrain, it is inevitable that vehicles may become airborne from time to time. During time-sensitive tasks, being able to fly over challenging terrain can also save time, instead of cautiously circumventing or slowly negotiating through. However, most off-road autonomy systems operate under the assumption that the vehicles are always on the ground and therefore limit operational speed. In this paper, we present a novel approach for in-air vehicle maneuver during high-speed off-road navigation. Based on a hybrid forward kinodynamic model using both physics principles and machine learning, our fixed-horizon, sampling-based motion planner ensures accurate vehicle landing poses and their derivatives within a short airborne time window using vehicle throttle and steering commands. We test our approach in extensive in-air experiments both indoors and outdoors, compare it against an error-driven control method, and demonstrate that precise and timely in-air vehicle maneuver is possible through existing ground vehicle controls.

17:10-17:15, Paper ThDT21.7
Transformer-Based Motion Model for Robust Target Tracking under Intermittent and Noisy Measurements

Pulido, Andres Pulido	University of Florida
Volle, Kyle	National Research Council Postdoctoral Program
Bell, Zachary	Department of Defense
Shin, Jaejeong	University of Florida
Keywords: Model Learning for Control, Surveillance Robotic Systems, AI-Based Methods Abstract: Target tracking under intermittent measurements is a fundamental challenge in autonomous systems. Traditional methods, including Kalman filters and deep learning-based models, often struggle when faced with sparse observations and high measurement noise. In this work, we present multiple transformer-based motion models desgined to learn target dynamics from noisy sensor measurements and occluded portions. By leveraging self-attention mechanisms, these models effectively capture temporal dependencies and infer motion trajectories under uncertainty. We evaluate various architectural formulations, including time-encoded position inputs to better handle occlusions. These learned motion models are then integrated with a particle filter for target estimation and with an information-driven planner to guide the tracking agent. Since the models influence the guidance logic through their predictions, we assess their effectiveness based on overall target tracking performance. Extensive simulation and hardware experiments demonstrate that our approach improves tracking accuracy and robustness compared to existing methods.

17:15-17:20, Paper ThDT21.8
Power Balance-Based Recursive Composite Learning Robot Control with Reduced Computational Burden

Shi, Tian	Sun Yat-Sen University
Zhu, Yuejiang	Sun Yat-Sen University
Li, Weibing	Sun Yat-Sen University
Pan, Yongping	Peng Cheng Laboratory
Keywords: Model Learning for Control, Robust/Adaptive Control, Motion Control Abstract: To enhance robustness against noise resulting from velocity measurement and acceleration estimation in robot online identification and adaptive control, the robot dynamics should be filtered and parameterized to generate a filtered regression matrix regarding identifiable parameters. However, generating a filtered regression matrix is complicated for robots with high degrees of freedom (DoFs). he power balance model (PBM) of robots with spatial notations stands out as an effective option for online applications owing to its simplicity in generating an easily computed and acceleration-free filtered regression vector. This paper proposes a PBM-based recursive composite learning robot control (RCLRC) method to enhance parameter convergence so as to boost tracking control. Based on the PBM, a filtered regressor with a computational complexity of O(n) (instead of O(n^2) to O(n^4) for its dynamic model-based counterpart) is employed to calculate an excitation matrix, and a generalized regression equation for composite parameter update is normalized to provide more uniform convergence rates across all parameter components. Experiments on a 7-DoF robot manipulator have shown that the proposed PBM-RCLRC outperforms state-of-the-art methods on parameter estimation and tracking control.


ThDT22	102A
Collision Avoidance 2	Regular Session
Chair: Chen, Xiaoqi	South China University of Technology

16:40-16:45, Paper ThDT22.1
Energy-Efficient Obstacle Avoidance Via Iterative B-Spline Optimization for a Mobile Manipulator in Dynamic Environments

Song, Kai-Tai	National Yang Ming Chiao Tung University
Hsieh, Yuan-Shuo	National Yang Ming Chiao Tung University
Keywords: Collision Avoidance, Motion and Path Planning, Energy and Environment-Aware Automation Abstract: This paper proposes an energy-efficient collision avoidance system for an autonomous mobile manipulator (AMM) in dynamic environments. The system enables safe obstacle avoidance and energy-efficient path generation. A B-Spline-based path optimization algorithm is developed, incorporating an energy cost function for planning collision-free, energy-efficient paths. The Bi-RRT algorithm generates an initial path, segmenting it at curvature maxima for energy-efficient optimization. The local path planning creates a path with seven control points, applying the same optimization for safe, low-energy avoidance of dynamic obstacles. A Model Predictive Control (MPC) system ensures precise path following. Experiments with a TM5M-900 robot validate the method's effectiveness in avoiding obstacles. Compared to Bi-RRT and Hybrid-RRT systems, it improves motion smoothness by 29.35% and reduces energy consumption by 16.84%, providing a more efficient solution for an AMM in dynamic environments.

16:45-16:50, Paper ThDT22.2
Radar-Based NLoS Pedestrian Localization for Darting-Out Scenarios Near Parked Vehicles with Camera-Assisted Point Cloud Interpretation

Kim, Hee-Yeun	Seoul National University
Park, Byeonggyu	Seoul National University
Choi, Byonghyok	Samsung Electro-Mechanics
Cho, Hansang	Samsung Electro-Mechanics
Kim, Byungkwan	Chungnam National University
Lee, Soomok	Ajou University
Jeon, Mingu	Seoul National University
Seo, Seung-Woo	Seoul National University
Kim, Seong-Woo	Seoul National University
Keywords: Collision Avoidance, Intelligent Transportation Systems, Sensor Fusion Abstract: The presence of Non-Line-of-Sight (NLoS) blind spots resulting from roadside parking in urban environments poses a significant challenge to road safety, particularly due to the sudden emergence of pedestrians. mmWave technology leverages diffraction and reflection to observe NLoS regions, and recent studies have demonstrated its potential for detecting obscured objects. However, existing approaches predominantly rely on predefined spatial information or assume simple wall reflections, thereby limiting their generalizability and practical applicability. A particular challenge arises in scenarios where pedestrians suddenly appear from between parked vehicles, as these parked vehicles act as temporary spatial obstructions. Furthermore, since parked vehicles are dynamic and may relocate over time, spatial information obtained from satellite maps or other predefined sources may not accurately reflect real-time road conditions, leading to erroneous sensor interpretations. To address this limitation, we propose an NLoS pedestrian localization framework that integrates monocular camera image with 2D radar point cloud (PCD) data. The proposed method initially detects parked vehicles through image segmentation, estimates depth to infer approximate spatial characteristics, and subsequently refines this information using 2D radar PCD to achieve precise spatial inference. Experimental evaluations conducted in real-world urban road environments demonstrate that the proposed approach enhances early pedestrian detection and contributes to improved road safety. Supplementary materials are available at https://hiyeun.github.io/NLoS/.

16:50-16:55, Paper ThDT22.3
Capsizing-Guided Trajectory Optimization for Autonomous Navigation with Rough Terrain

Zhang, Wei	Shandong University
Wang, Yinchuan	Shandong University
Lu, Wangtao	Zhejiang University
Zhang, Pengyu	Shandong University
Zhang, Xiang	School of Control Science and Engineering, Shandong University
Wang, Yue	Zhejiang University
Wang, Chaoqun	Shandong University
Keywords: Collision Avoidance, Wheeled Robots, Autonomous Vehicle Navigation Abstract: It is a challenging task for ground robots to autonomously navigate in harsh environments due to the presence of non-trivial obstacles and uneven terrain. This requires trajectory planning that balances safety and efficiency. The primary challenge is to generate a feasible trajectory that prevents robot from tip-over while ensuring effective navigation. In this paper, we propose a capsizing-aware trajectory planner (CAP) to achieve trajectory planning on the uneven terrain. The tip-over stability of the robot on rough terrain is analyzed. Based on the tip-over stability, we define the traversable orientation, which indicates the safe range of robot orientations. This orientation is then incorporated into a capsizing-safety constraint for trajectory optimization. We employ a graph-based solver to compute a robust and feasible trajectory while adhering to the capsizing-safety constraint. Extensive simulation and real-world experiments validate the effectiveness and robustness of the proposed method. The results demonstrate that CAP outperforms existing state-of-the-art approaches, providing enhanced navigation performance on uneven terrains.

16:55-17:00, Paper ThDT22.4
Dynamic Obstacle Avoidance through Uncertainty-Based Adaptive Planning with Diffusion

Punyamoorty, Vineet	Purdue University
Jutras-Dube, Pascal	Purdue University
Zhang, Ruqi	Purdue University
Aggarwal, Vaneet	Purdue University
Conover, Damon	DEVCOM Army Research Laboratory
Bera, Aniket	Purdue University
Keywords: Collision Avoidance, Motion and Path Planning, Deep Learning Methods Abstract: By framing reinforcement learning as a sequence modeling problem, recent work has enabled the use of generative models, such as diffusion models, for planning. While these models are effective in predicting long-horizon state trajectories in deterministic environments, they face challenges in dynamic settings with moving obstacles. Effective collision avoidance demands continuous monitoring and adaptive decision-making. While re-planning at every time step could ensure safety, it introduces substantial computational overhead due to the repetitive prediction of overlapping state sequences-a process that is particularly costly with diffusion models, known for their intensive iterative sampling procedure. We propose an adaptive generative planning approach that dynamically adjusts re-planning frequency based on the uncertainty of action predictions. Our method minimizes the need for frequent, computationally expensive, and redundant re-planning while maintaining robust collision avoidance performance. In experiments, we obtain a 13.5% increase in the mean trajectory length and 12.7% increase in mean reward over long-horizon planning, indicating a reduction in collision rates, and improved ability to navigate the environment safely.

17:00-17:05, Paper ThDT22.5
MIVG: Mode-Isolated Velocity-Guide Algorithm for Quadratic Optimization-Based Obstacle Avoidance

Lin, Hangyu	South China University of Technology
Chen, Xiaoqi	South China University of Technology
Cai, SongYin	South China University of Technology
Lin, Xiangrui	South China University of Technology
Wu, Kunpeng	South China University of Technology
Keywords: Collision Avoidance, Redundant Robots, Kinematics Abstract: Dynamic obstacle avoidance is a challenging problem in robotic control, with many algorithms developed to balance efficiency and real-time performance. Existing resolved-rate motion control (RRMC) methods formulate obstacle avoidance as a quadratic programming (QP) problem. However, the lack of directional guidance for obstacle avoidance and frequent constraint conflicts often lead to execution failures. In this work, we propose the Mode-Isolated Velocity-Guide (MIVG) algorithm that deploys a dual-mode isolation strategy combined with a Velocity-Guide Potential Field (VGPF). This novel approach separates obstacle avoidance from target-driven tasks while providing velocity-based directional guidance. Simulations on a 7-degree-of-freedom Franka Emika Panda robot demonstrate that our approach significantly enhances task success rates while maintaining real-time feasibility, achieving an increase in execution success rates of 35.0% ~ 52.0% compared to the baseline RRMC strategy (NEO). Additionally, we analyze the impact of key parameters through simulations, further validating the effectiveness of the proposed algorithm in dynamic environments.

17:05-17:10, Paper ThDT22.6
AccidentX: A Large-Scale Multimodal BEV Dataset for Traffic Accident Analysis and Prevention

Zhang, Muyang	Institute of Automation，Chinese Academy of Sciences
Feng, Zhe	Institute of Automation Chinese Academy of Sciences
Yang, JinMing	Institute of Automation Chinese Acadenmy of Sciences
Jia, Mingda	Institute of Automation，Chinese Academy of Sciences
Meng, Weiliang	Institute of Automation, Chinese Academy of Sciences
Wu, Wenxuan	U Oregon
Zhang, Jiguang	CASIA
Zhang, Xiaopeng	National Laboratory of Pattern Recognition, Institute of Automat
Keywords: Collision Avoidance, Autonomous Vehicle Navigation, Intelligent Transportation Systems Abstract: With the rapid development and widespread application of autonomous driving technology, the accurate analysis and prevention of traffic accidents have become critical challenges. However, current traffic accident datasets are often constrained by limited scale and diversity, impeding progress in this field. To address these limitations, we introduce textbf{AccidentX}, a large-scale multimodal dataset specifically curated for comprehensive traffic accident analysis and prevention. Our AccidentX comprises over 10,000 bird's-eye view (BEV) videos generated using the CARLA simulator, with detailed annotations covering a wide range of traffic scenarios. In comparison to existing datasets such as nuScenes, our AccidentX offers seven times more video frames and leverages Vision-Language Models (VLMs) and GPT-4o for enhanced scene understanding and decision-making. We also establish a benchmark for state-of-the-art Multimodal Large Language Models (MLLMs) on AccidentX, fostering further research and innovation within the community. AccidentX will be made available as a fully open source resource for the advancement of the autonomous driving safety algorithm community.

17:10-17:15, Paper ThDT22.7
Collision Avoidance with Differentiable Occupancy Functions in Object Rearrangement

Satoh, Roma	Institute of Science Tokyo
Inoue, Nakamasa	Tokyo Institute of Technology
Kawakami, Rei	Tokyo Institute of Technology
Keywords: Collision Avoidance Abstract: We address the challenge of object relocation by robots in environments where their behavior is expected to resemble that of humans. Existing methods typically learn to regress the position and orientation of objects specified by natural language commands using training data. However, these approaches do not account for physical constraints during training, often resulting in collisions between relocated objects. In this work, we introduce a collision avoidance loss based on functions that incorporate object size into the training process. Specifically, we propose a type of occupancy function in which particles are represented by a 3D Gaussian probability density function. By incorporating these functions into an additional training phase of existing models, we demonstrate a reduction in the number of collisions during rearrangement tasks. Notably, despite the decrease in collisions, the semantic structure of the relocation results is preserved.

17:15-17:20, Paper ThDT22.8
Boolean Subtraction for Proximity Queries with Applications to Path Planning Tasks with Collisions

Li, Yi	Fraunhofer-Chalmers Research Centre
Shellshear, Evan	FCC
Bjorkenstam, Staffan	Fraunhofer-Chalmers Research Centre
Bohlin, Robert	Fraunhofer-Chalmers Research Centre
Carlson, Johan	Fraunhofer-Chalmers Research Centre
Keywords: Collision Avoidance, Motion and Path Planning, Industrial Robots Abstract: Path planners are widely used in many different fields to determine a sequence of valid configurations for an object, such as a robot arm, between a start configuration and a goal configuration, where a valid configuration must be a collision-free one. However, sometimes configurations that are in collision should be valid. For example, during spot welding, the two electrodes of a spot welding gun fuse metal sheets together by applying an electrical current to melt the metal at the point of contact and penetrate into the metal sheets. Even though the distance between the electrodes and the metal sheets is zero, the corresponding configuration of the spot welding gun should still be considered to be valid by the path planner. Another example might be: during grasp planning, a hand should be allowed to touch a table top while picking up an item on the table. To ensure non-zero distances at contact points, we propose in this paper a novel method to efficiently compute the distances between a given geometry represented by points, line segments, or triangles and the geometry remaining after Boolean subtracting multiple convex bodies placed at the contact points from another geometry. By performing the subtraction operations during the query phase, the remaining geometry after subtraction is never explicitly constructed, creating a more flexible simulation environment that is easier to maintain and update during the development process.


ThDT23	102B
Networked System and Telerobotics	Regular Session
Chair: Fu, Junling	Politecnico Di Milano
Co-Chair: Mayer, Haley	University of Toronto

16:40-16:45, Paper ThDT23.1
Accurate Decentralized Information Communication for Effective Decisions in Robot Networks (I)

Safwat, Mohamed	University of Washington, Seattle
Devasia, Santosh	University of Washington
Keywords: Networked Robots, Distributed Robot Systems Abstract: Effective decision-making by robotic networks to collectively achieve coupled objectives requires accurate information communication. Typically, all environmental information (e.g., task values) are assumed to be centrally known by decision-making algorithms used for solving problems such as task assignment (TA). A challenge is that the task values might only emerge after information about the environment (that might only be available to some agents) is accurately shared by the agents. However, existing decentralized communication methods, such as the standard (consensus) method can lead to distortion in the information as it diffuses between distant agents, resulting in large settling times for task values, which in turn can lead to ineffective decisions—even if consensus is eventually achieved in the TA. The main contribution of this work is to improve decision-making in robot networks by improving the accuracy of shared environmental information needed to compute task values using a noise suppressing, delayed-self-reinforcement (DSR) approach that reduces the transient information distortion. DSR approximates the ideal, distortion-free, centralized information sharing using only decentralized information sharing, and does not require changes to the network topology or increased communication bandwidth. Furthermore, this work develops communication-error bounds for DSR in terms of the second time derivative of the communicated information. Experimental results show substantial improvement in the accuracy of the information with a 95% and 88% error reduction in position and speed information with the proposed method, respectively, when compared to the standard method, resulting in an 88% improvement in settling times for task values and 100% successful task capture rate with the proposed information communication method as opposed to loss of task capture using the standard method.

16:45-16:50, Paper ThDT23.2
A Compact Dual-Mode Twisting Retraction Device for Endoscopic Submucosal Dissection

Mayer, Haley	University of Toronto
Shlomovitz, Eran	University Health Network
Drake, James	Hospital for Sick Children, University of Toronto
Looi, Thomas	Hospital for Sick Children
Diller, Eric D.	University of Toronto
Keywords: Medical Robots and Systems, Mechanism Design, Surgical Robotics: Laparoscopy Abstract: Endoscopic submucosal dissection (ESD) is a technically difficult, minimally invasive, organ preserving resection technique that yields improved clinical outcomes when compared to current conventional procedures but requires experienced surgeons and specialized skills. Difficulty in applying tension during ESD is recognized as the single greatest barrier to wide adoption of the procedure, and solution of this problem is sure to have wide-reaching and immediate adoption. This work presents a compact wireless retraction device that is magnetically actuated and has a high force output with adaptable traction control. The retraction device is 25 mm long and 4 mm in diameter. The device has two modes of operation: first spooling to collect string slack, then transitions via external permanent magnet to internal string twisting to generate a large retraction force. In slack collection the device can contract 11 cm in length at a speed of 6.88 millimeters per second, then clutch to force mode to reach a peak retraction force of 1.33 N, leveraging the micro-transmission twisted string actuation. The wireless device is designed for endoscopic deployment to any surgical environment or lesion within the gastrointestinal tract.

16:50-16:55, Paper ThDT23.3
Underwater Remote Intervention Based on Satellite Communication

Yang, Xuejiao	Shenyang Institute of Automation, University of Chinese Academy
Zhang, Qifeng	Shenyang Institute of Automation, CAS
Zhang, Yunxiu	Shenyang Institute of Automation, CAS
Qiao, Yuqi	Shenyang Institute of Automation, University of Chinese Academy
Meng, Linghan	Shenyang Institute of Automation, University of Chinese Academy
Keywords: Marine Robotics, Telerobotics and Teleoperation, Engineering for Robotic Systems Abstract: This study proposes a novel solution for transferring the working environment of remotely operated vehicle (ROV) operators from the support ship at sea to land, including the establishment of a satellite-based communication link between the ocean and the land-based control center (LCC), which is used to transfer information efficiently. To alleviate the cognitive burden of latency on operators on land, a cross-domain underwater intervention hierarchical control architecture is designed to assist operators by introducing a shared control strategy. The effectiveness of the designed system and control strategy in realizing cross-domain underwater interventions are verified through field experiments.

16:55-17:00, Paper ThDT23.4
Dual-Arm Teleoperated Robotic Microsurgery System with Live Volumetric OCT Image Feedback

Liu, Jiawei	University of Michigan
Ma, Guangshen	Duke University
Zhou, Genggeng	Stanford University
Pan, Haochi	University of Michigan
Lam, Colin	University of Michigan
Jin, Catherine	University of Michigan
Valikodath, Nita	University of Michigan
Draelos, Mark	University of Michigan
Keywords: Medical Robots and Systems, Telerobotics and Teleoperation Abstract: In microsurgery, surgeons frequently encounter challenges due to the need for exceptional precision and dexterity, the lack of depth perception for micro-scale surgical maneuvers, and the inevitable effects of fatigue and hand tremor. In surgical robotics, conventional intraoperative perception systems normally provide real-time image feedback, but depth and volumetric information is typically lacking. To overcome these challenges, we propose a teleoperated robotic system with two arms to provide high-fidelity intraoperative volumetric imaging during micro-scale tissue manipulation. This system incorporates an optical coherence tomography sensor for real-time 3D visualization and a dual-arm teleoperated robot system controlled by haptic input devices for accurate and precise manipulation. We characterize the system’s performance through a precision positioning task and a vessel following task in a retinal model, which shows average positioning errors of approximately 232 μm and 83 μm, respectively. We demonstrate the fully integrated system through the completion of an eggshell membrane peeling task that simulates retinal membrane peeling.

17:00-17:05, Paper ThDT23.5
Optimal Motion Scaling for Delayed Telesurgery

Lim, Jason	University of Nevada, Reno
Richter, Florian	University of California, San Diego
Chiu, Zih-Yun	University of California, San Diego
Lee, Jaeyeon	US Army's Telemedicine and Advanced Technology Research Center (
Quist, Ethan	TATRC
Fisher, Nathan	US Army Telemedicine and Advanced Technology Research Center
Chambers, Jonathan	USARMY TATRC
Hong, Steven	University of Michigan
Yip, Michael C.	University of California, San Diego
Keywords: Medical Robots and Systems, Telerobotics and Teleoperation, Surgical Robotics: Laparoscopy Abstract: Robotic teleoperation over long communication distances poses challenges due to delays in commands and feedback from network latency. One simple yet effective strategy to reduce errors and increase performance under delay is to downscale the relative motion between the operating surgeon and the robot. The question remains as to what is the optimal scaling factor, and how this value changes depending on the level of latency as well as operator tendencies. We present user studies investigating the relationship between latency, scaling factor, and performance. The results of our studies demonstrate a statistically significant difference in performance between users and across scaling factors for certain levels of delay. These findings indicate that the optimal scaling factor for a given level of delay is specific to each user, motivating the need for personalized models for optimal performance. We present techniques to model the user-specific mapping of latency level to scaling factor for optimal performance, leading to an efficient and effective solution to optimizing performance of robotic teleoperation and specifically telesurgery under large communication delay.

17:05-17:10, Paper ThDT23.6
Human-Inspired Active Compliant and Passive Shared Control Framework for Robotic Contact-Rich Tasks in Medical Applications

Fu, Junling	Politecnico Di Milano
Maimone, Giorgia	Politecnico Di Milano
Iovene, Elisa	Politecnico Di Milano
Zhao, Jianzhuang	Istituto Italiano Di Tecnologia
Redaelli, Alberto	Politecnico Di Milano
Ferrigno, Giancarlo	Politecnico Di Milano
De Momi, Elena	Politecnico Di Milano
Keywords: Medical Robots and Systems, Telerobotics and Teleoperation, Compliance and Impedance Control, Human-Robot Collaboration Abstract: This work presents a compliant and passive shared control framework for teleoperated robot-assisted tasks. Inspired by the human operator's capability of continuously regulating the arm impedance to perform contact-rich tasks, a novel control schema, exploiting the variable impedance control framework for force tracking is proposed. Moreover, bilateral teleoperation and shared control strategies are implemented to alleviate the human operator's workload. Furthermore, a global energy tank-based approach is integrated to enforce the system's passivity. The proposed framework is first evaluated to assess the force-tracking capability when the robot autonomously performs contact-rich tasks, e.g., in an ultrasound scanning scenario. Then, a validation experiment is conducted utilizing the proposed shared control framework. Finally, the system's usability is investigated with 12 users. The experiment results in system assessment revealed a maximum median error of 0.25 N across all the force-tracking experiment setups, i.e., constant and time-varying ones. Then, the validation experiment demonstrated significant improvements regarding the force tracking tasks compared to conventional control methods, and the system passivity was preserved during the task execution. Finally, the usability experiment shows that the human operator workload is significantly reduced by 54.6 % compared to the other two control modalities. The proposed framework holds significant potential for the execution of remote robot-assisted medical procedures, such as palpation and ultrasound scanning, particularly in addressing deformation challenges while ensuring safety, compliance, and system passivity.

17:10-17:15, Paper ThDT23.7
MeSch: Multi-Agent Energy-Aware Scheduling for Task Persistence

Naveed, Kaleb Ben	University of Michigan, Ann Arbor
Dang, An	University of Michigan
Harish Kumar, Rahul	University of Michigan, Ann Arbor
Panagou, Dimitra	University of Michigan, Ann Arbor
Keywords: Motion and Path Planning, Path Planning for Multiple Mobile Robots or Agents, Planning, Scheduling and Coordination Abstract: This paper develops a scheduling protocol for a team of autonomous robots that operate on long-term persistent tasks. The proposed framework, called meSch, accounts for the limited battery capacity of the robots and ensures that the robots return to charge their batteries one at a time at the single charging station. The protocol is applicable to general nonlinear robot models under certain assumptions, does not require robots to be deployed at different times, and can handle robots with different discharge rates. We further consider the case when the charging station is mobile and its state information is subject to uncertainty. The feasibility of the algorithm in terms of ensuring persistent charging is given under certain assumptions, while the efficacy of meSch is validated through simulation and hardware experiments.


ThDT24	102C
Educational and Emotional Robotics	Regular Session
Chair: Song, Ran	Shandong University

16:40-16:45, Paper ThDT24.1
Investigating the Impact of Humor on Learning in Robot-Assisted Education

Hei, Xiaoxuan	ENSTA Paris, Institut Polytechnique De Paris
Zhang, Heng	ENSTA Paris, Institut Polytechnique De Paris
Tapus, Adriana	ENSTA Paris, Institut Polytechnique De Paris
Keywords: Education Robotics, Design and Human Factors Abstract: Social robots have shown significant potential in enhancing learning experiences, and humor has been proven to be beneficial for learning. This study investigates the impact of both the presence and timing of humor on students’ learning outcomes and overall learning experience. A total of 24 participants were randomly assigned to one of the three conditions: (C1) interact with a robot with no humor, (C2) interact with a robot with humor at pre-defined moments during the lesson, and (C3) interact with a robot that triggers humor based on engagement levels. The results revealed that the humor at pre-defined moments condition (C2) led to significantly better learning outcomes and longer interaction times compared to the other two conditions. While the adaptive humor in Condition C3 did not significantly outperform Condition C1, it showed positive effects on participants' perceived learning effectiveness and engagement. These findings contribute to the understanding of how humor, when strategically timed, can enhance the effectiveness of social robots in educational settings.

16:45-16:50, Paper ThDT24.2
Robotics Virtual Laboratory Featuring Serial, Parallel, Wheeled Robots, and Autonomous Off-Road Vehicles, and Covering Analysis, Control and Sensors

Nasrallah, Danielle Sami	Concordia University
Haddad, Georges	OPAL-RT TECHNOLOGIES
Chrabieh, Angelo	OPAL-RT TECHNOLOGIES
Keywords: Education Robotics, Motion and Path Planning, Autonomous Vehicle Navigation Abstract: The presence of robots is growing rapidly throughout the world. The robotics education community should follow the trend and modernize the tools used accordingly. The authors introduce here a Robotic Virtual Laboratory that runs in real time and covers four categories of robots, namely serial manipulators, parallel manipulators, wheeled mobile robots, and autonomous off-road vehicles obtained as a combination of wheeled platform with serial manipulators. The topics covered start with motion analysis, then control design, and end up with sensors perception, thus rendering this lab an essential tool for robotics and control engineers as well as computer scientists. The use of (i) physics-based engine, (ii) real-time simulation, and (iii) cosimulation framework represents a backbone for the success of this laboratory. The results of a survey of students who participated in this pilot project are shown.

16:50-16:55, Paper ThDT24.3
Educational SoftHand-A: Building an Anthropomorphic Hand with Soft Synergies Using LEGO® MINDSTORMS®

Lepora, Jared	Bristol Grammar School
Li, Haoran	University of Bristol
Psomopoulou, Efi	University of Bristol
Lepora, Nathan	University of Bristol
Keywords: Education Robotics, Multifingered Hands, Underactuated Robots Abstract: This paper introduces an anthropomorphic robot hand built entirely using LEGO MINDSTORMS: the Educational SoftHand-A, a tendon-driven, highly-underactuated robot hand based on the Pisa/IIT SoftHand and related hands. To be suitable for an educational context, the design is constrained to use only standard LEGO pieces with tests using common equipment available at home. The hand features dual motors driving an agonist/antagonist opposing pair of tendons on each finger, which are shown to result in reactive fine control. The finger motions are synchonized through soft synergies, implemented with a differential mechanism using clutch gears. Altogether this design results in an anthropomorphic hand that can adaptively grasp a broad range of objects using a simple actuation and control mechanism. Since the hand can be constructed from LEGO pieces and uses state-of-the-art design concepts for robotic hands, it has the potential to educate and inspire children to learn about the frontiers of modern robotics.

16:55-17:00, Paper ThDT24.4
AirSwarm: Enabling Cost-Effective Multi-UAV Research with COTS Drones

Li, Xiaowei	Nanyang Technological University
Xu, Kuan	Nanyang Technological University
Liu, Fen	Nanyang Technological University
Bai, Ruofei	Nanyang Technological University
Yuan, Shenghai	Nanyang Technological University
Xie, Lihua	NanyangTechnological University
Keywords: Education Robotics, Art and Entertainment Robotics, Aerial Systems: Applications Abstract: Traditional unmanned aerial vehicle (UAV) swarm missions rely heavily on expensive custom-made drones with onboard perception or external positioning systems, limiting their widespread adoption in research and education. To address this issue, we propose AirSwarm. AirSwarm democratizes multi-drone coordination using low-cost commercially available drones such as Tello or Anafi, enabling affordable swarm aerial robotics research and education. Key innovations include a hierarchical control architecture for reliable multi-UAV coordination, an infrastructure-free visual SLAM system for precise localization without external motion capture, and a ROS-based software framework for simplified swarm development. Experiments demonstrate cm-level tracking accuracy, low-latency control, communication failure resistance, formation flight, and trajectory tracking. By reducing financial and technical barriers, AirSwarm makes multi-robot education and research more accessible. The complete instructions and open source code will be available at https://github.com/vvEverett/tello_ros.

17:00-17:05, Paper ThDT24.5
A Human Reasons-Based Supervision Framework for Ethical Decision-Making in Automated Vehicles

Suryana, Lucas Elbert	Delft University of Technology
Rahmani, Saeed	Delft University of Technology
Calvert, Simeon Craig	Delft University of Technology
Zgonnikov, Arkady	Delft University of Technology
van Arem, Bart	Delft University of Technology
Keywords: Ethics and Philosophy, Motion and Path Planning, Autonomous Vehicle Navigation Abstract: Ethical dilemmas are a common challenge in everyday driving, requiring human drivers to balance competing priorities such as safety, efficiency, and rule compliance. However, much of the existing research in automated vehicles (AVs) has focused on high-stakes "trolley problems," which involve extreme and rare situations. Such scenarios, though rich in ethical implications, are rarely applicable in real-world AV decision-making. In practice, when AVs confront everyday ethical dilemmas, they often appear to prioritise strict adherence to traffic rules. By contrast, human drivers may bend the rules in context-specific situations, using judgement informed by practical concerns such as safety and efficiency. According to the concept of meaningful human control, AVs should respond to human reasons, including those of drivers, vulnerable road users, and policymakers. This work introduces a novel human reasons-based supervision framework that detects when AV behaviour misaligns with expected human reasons to trigger trajectory reconsideration. The framework integrates with motion planning and control systems to support real-time adaptation, enabling decisions that better reflect safety, efficiency, and regulatory considerations. Simulation results demonstrate that this approach could help AVs respond more effectively to ethical challenges in dynamic driving environments by prompting replanning when the current trajectory fails to align with human reasons. These findings suggest that our approach offers a path toward more adaptable, human-centered decision-making in AVs.

17:05-17:10, Paper ThDT24.6
Recognizing and Generating Novel Emotional Behaviors on Two Robotic Platforms

Baral, Rista	Boise State University
Grenz, Bethany	Boise State University
Kennington, Casey	Boise State University
Keywords: Emotional Robotics, AI-Based Methods, Social HRI Abstract: Recent advancements in language modeling have enabled robots to more easily generate complex behaviors. However, ensuring that the generated behaviors align with the intended emotional states of the robot is necessary in many domains where robots are used. In this paper, we present an adversarial-like training regime in which a generative model of emotional behavior is enhanced through feedback from both an emotion discriminator and a novelty loss, to ensure that the generated behaviors are non-redundant. Our generative model, fine-tuned on a dataset of robot behaviors labeled with emotions, generates behavior sequences perceived as reflecting the emotional qualities of the input emotion labels. Through our training regime, the generative model is refined by minimizing the discrepancies in both emotion classification and behavioral novelty. We evaluated our approach through multiple experiments and human evaluations, where participants were asked to appraise the emotions conveyed by robot behaviors and rate the novelty of the behaviors. Experimental results demonstrate that our two models, one for classifying and one for generating emotional behaviors, are effective, with the generative model producing emotionally rich behaviors that differ from previously generated outputs.

17:10-17:15, Paper ThDT24.7
Awakening Facial Emotional Expressions in Human-Robot

Zhu, Yongtong	University of Shanghai for Science and Technology
Li, Lei	University of Shanghai for Science and Technology
Qian, Zeyu	Zhejiang University of Technology
Zhou, Wenbin	Shanghai Droid Robot Co., Ltd
Yuan, Ye	USST
Li, Qingdu	University of Shanghai for Science and Technology
Liu, Na	University of Shanghai for Science and Technology
Zhang, Jianwei	University of Hamburg
Keywords: Emotional Robotics, Learning from Demonstration, Data Sets for Robotic Vision Abstract: The facial expression generation capability of humanoid social robots is critical for achieving natural and human-like interactions, playing a vital role in enhancing the fluidity of human-robot interactions and the accuracy of emotional expression. Currently, facial expression generation in humanoid social robots still relies on pre-programmed behavioral patterns, which are manually coded at high human and time costs. To enable humanoid robots to autonomously acquire generalized expressive capabilities, they need to develop the ability to learn human-like expressions through self-training. To address this challenge, we have designed a highly biomimetic robotic face with physical-electronic animated facial units and developed an end-to-end learning framework based on KAN (Kolmogorov-Arnold Network) and attention mechanisms. Unlike previous humanoid social robots, we have also meticulously designed an automated data collection system based on expert strategies of facial motion primitives to construct the dataset. Notably, to the best of our knowledge, this is the first open-source facial dataset for humanoid social robots. Comprehensive evaluations indicate that our approach achieves accurate and diverse facial mimicry across different test subjects.

17:15-17:20, Paper ThDT24.8
OMEGA: Open-Source and Multi-Mode Hopping Platform for Educational and Groundwork Aims

Chu, Xiangyu	The Chinese University of Hong Kong
Wong, Fei Yan Fiat	The Chinese University of Hong Kong
Fan, Chun Yin	Chinese University of Hong Kong
Zhang, Hongbo	The Chinese University of Hong Kong
Chen, Yanlin	South China University of Technology
Au, K. W. Samuel	The Chinese University of Hong Kong
Keywords: Education Robotics, Legged Robots Abstract: This paper presents OMEGA, a new open-source, multi-mode hopping platform. It consists of a rig and a middle-size robot equipped with an omnidirectional parallel 3-RSR leg, allowing for 1D, 2D, and 3D hopping modes. All modes can be easily interchanged via detachable mechanisms. A control framework is developed to operate all modes based on a 3D SLIP model. To our knowledge, few middle-size monopod robots can locomote in the field, making OMEGA a complementary addition to existing legged platforms. This versatile solution uses accessible manufacturing technologies such as 3D printing and water-jet cutting, and the implementation of detachable mechanisms, enabling operators to explore legged dynamic motion with a single robot across different modes, instead of requiring multiple robots for different purposes. A simulator is developed for initial hopping control learning. Extensive experiments in 1D/2D tethered and 3D untethered modes demonstrate the platform's mobility and versatility. The proposed platform has the potential to serve both educational and groundwork aims.


ThDT25	103A
Planning and AI-Based Methods	Regular Session
Chair: Chen, Fei	T-Stone Robotics Institute, the Chinese University of Hong Kong
Co-Chair: Yin, Xiang	Shanghai Jiao Tong Univ

16:40-16:45, Paper ThDT25.1
Generating Actionable Robot Knowledge Bases by Combining 3D Scene Graphs with Robot Ontologies

Nguyen, Giang	University of Bremen
Pomarlan, Mihai	Universitatea Politehnica Timisoara
Jongebloed, Sascha	University of Bremen
Leusmann, Nils	University of Bremen
Beetz, Michael	University of Bremen
Vu, Minh Nhat	TU Wien, Austria
Keywords: Semantic Scene Understanding, Simulation and Animation, Task Planning Abstract: In robotics, the effective integration of environmental data into actionable knowledge remains a significant challenge due to the variety and incompatibility of data formats commonly used in scene descriptions, such as MJCF, URDF, and SDF. This paper presents a novel approach that addresses these challenges by developing a unified scene graph model that standardizes these varied formats into the Universal Scene Description (USD) format. This standardization facilitates the integration of these scene graphs with robot ontologies through semantic reporting, enabling the translation of complex environmental data into actionable knowledge essential for cognitive robotic control. We evaluated our approach by converting procedural 3D environments into USD format, which is then annotated semantically and translated into a knowledge graph to effectively answer competency questions, demonstrating its utility for real-time robotic decision-making. Additionally, we developed a web-based visualization tool to support the semantic mapping process, providing users with an intuitive interface to manage the 3D environment.

16:45-16:50, Paper ThDT25.2
Hierarchical Reactive Task Planning with Temporal Logic and Visual Servoing for Bolt-Tightening Robots in Transmission Towers

You, Junyi	Hefei University of Technology
Du, Haibo	School of Electrical Engineering and Automation, Hefei Universit
Keywords: Task and Motion Planning, Climbing Robots, Reactive and Sensor-Based Planning Abstract: With the rapid deployment of unmanned bolt-tightening robots on transmission towers, traditional control algorithms face challenges in balancing long-term task logic and real-time adaptability, especially under unstructured environments such as missing bolts and unexpected obstacles. This paper proposes HTP-TV, a framework for hierarchical task planning with temporal logic and visual servoing, which integrates temporal logic-based planning with a vision-based reactive mechanism. HTP-TV decouples semantic goals such as bolt-tightening sequences from geometric path planning, enabling offline pre-planning via LTL-RRT* to generate constraint-compliant trajectories. In the online phase, real-time camera data dynamically updates the environmental model, which triggers adjustments in the incremental Büchi automaton to address missing bolts or obstacles. A semantic ID system encodes the bolt topology, supporting re-planning with axial constraints, while visual servoing techniques correct execution deviations. Through comparisons with the two methods (Offline LTL-RRT* and Online RRT*-based), the simulation results in CoppeliaSim demonstrate the efficiency, high safety compliance, and superiority of HTP-TV.

16:50-16:55, Paper ThDT25.3
Language As Cost: Proactive Hazard Mapping Using VLM for Robot Navigation

Oh, Mintaek	Seoul National University
Kim, Chan	Seoul National University
Seo, Seung-Woo	Seoul National University
Kim, Seong-Woo	Seoul National University
Keywords: Vision-Based Navigation, Human-Aware Motion Planning, Sensor Fusion Abstract: Robots operating in human-centric or hazardous environments must proactively anticipate and mitigate dangers beyond basic obstacle detection. Traditional navigation systems often depend on static maps, which struggle to account for dynamic risks, such as a person emerging from a suddenly opened door. As a result, these systems tend to be reactive rather than anticipatory when handling dynamic hazards. Recent advancements in pre-trained large language models and vision-language models (VLMs) present new opportunities for proactive hazard avoidance. In this work, we propose a zero-shot language-as-cost mapping framework that leverages VLMs to interpret visual scenes, assess potential dynamic risks, and assign risk-aware navigation costs preemptively, enabling robots to anticipate hazards before they materialize. By integrating this language-based cost map with a geometric obstacle map, the robot not only identifies existing obstacles but also anticipates and proactively plans around potential hazards arising from environmental dynamics. Experiments in simulated and diverse dynamic environments demonstrate that the proposed method significantly improves navigation success rates and reduces hazard encounters compared to reactive baseline planners. Code and supplementarymaterials are available at https://github.com/Taekmino/LaC.

16:55-17:00, Paper ThDT25.4
Open-World Task Planning for Humanoid Bimanual Dexterous Manipulation Via Vision-Language Models

Tang, Zixin	The Chinese University of Hong Kong
Li, Zhihao	The Chinese University of Hong Kong
Liu, Junjia	The Chinese University of Hong Kong
Li, Zhuo	The Chinese University of Hong Kong
Chen, Fei	T-Stone Robotics Institute, the Chinese University of Hong Kong
Keywords: Task Planning, Bimanual Manipulation, Dexterous Manipulation Abstract: Open-world task planning, characterized by handling unstructured and dynamic environments, has been increasingly explored to integrate with long-horizon robotic manipulation tasks. However, existing evaluations of the capabilities of these planners primarily focus on single-arm systems in structured scenarios with limited skill primitives, which is insufficient for numerous bimanual dexterous manipulation scenarios prevalent in the real world. To this end, we introduce OBiMan-Bench, a large-scale benchmark designed to rigorously evaluate open-world planning capabilities in bimanual dexterous manipulation, including task-scenario grounding, workspace constraint handling, and long-horizon cooperative reasoning. In addition, we propose OBiMan-Planner, a vision-language model-based zero-shot planning framework tailored for bimanual dexterous manipulation. OBiMan-Planner comprises two key components, the scenario grounding module for grounding open-world task instructions with specific scenarios and the task planning module for generating sequential stages. Extensive experiments on OBiMan-Bench demonstrate the effectiveness of our method in addressing complex bimanual dexterous manipulation tasks in open-world scenarios. The code, benchmark, and supplementary material are released at https://github.com/Zixin-Tang/OBiMan.

17:00-17:05, Paper ThDT25.5
Jacobian Exploratory Dual-Phase Reinforcement Learning for Dynamic Surgical Navigation of Deformable Continuum Robots

Tian, Yu	The Chinese University of Hong Kong
Ng, Chi Kit	The Chinese University of Hong Kong
Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)
Keywords: Surgical Robotics: Planning, Reinforcement Learning, Machine Learning for Robot Control Abstract: Deformable continuum robots (DCRs) present unique planning challenges due to nonlinear deformation mechanics and partial state observability, violating the Markov assumptions of conventional reinforcement learning (RL) methods. While Jacobian-based approaches offer theoretical foundations for rigid manipulators, their direct application to DCRs remains limited by time-varying kinematics and underactuated deformation dynamics. This paper proposes Jacobian Exploratory Dual-Phase RL (JEDP-RL), a framework that decomposes planning into phased Jacobian estimation and policy execution. During each training step, we first perform small-scale local exploratory actions to estimate the deformation Jacobian matrix, then augment the state representation with Jacobian features to restore approximate Markovianity. Extensive SOFA surgical dynamic simulations demonstrate JEDP-RL's three key advantages over proximal policy optimization (PPO) baselines: 1)Convergence speed: 3.2× faster policy convergence, 2)Navigation efficiency: requires 25% fewer steps to reach the target, and 3)Generalization ability: achieve 92% success rate under material property variations and achieve 83% (33% higher than PPO) success rate in the unseen tissue environment.

17:05-17:10, Paper ThDT25.6
ET-Plan-Bench: Embodied Task-Level Planning Benchmark towards Spatial-Temporal Cognition with Foundation Models

Zhang, Lingfeng	Huawei Noah's Ark Lab Canada
Wang, Yuening	Huawei Noah's Ark Lab
Hongjian, Gu	Huawei
Hamidizadeh, Atia	Huawei
Zhang, Zhanguang	Huawei Noah's Ark Lab
Liu, Yuecheng	Huawei Noah's Ark Lab
Wang, Yutong	Huawei
Arcos Bravo, David Gamaliel	Huawei
Dong, Junyi	Cornell University
Zhou, Shunbo	Huawei
Cao, Tongtong	Noah's Ark Lab, Huawei Technologies
Quan, Xingyue	Huawei
Zhuang, Yuzheng	Huawei Technologies Company
Zhang, Yingxue	Huawei Noah's Ark Lab
Hao, Jianye	Noah's Ark Lab
Keywords: Task Planning, Data Sets for Robot Learning, Agent-Based Systems Abstract: Recent advancements in Large Language Models (LLMs) have catalyzed numerous efforts to apply these technologies to embodied tasks, with a particular focus on high-level task planning and task decomposition. LLMs face challenges in understanding the physical world, especially regarding spatial, temporal, and causal relationships among objects and actions. Moreover, the current benchmarks for evaluating these relationships are limited. To further investigate this domain, we introduce a novel embodied task planning benchmark, ET-Plan-Bench. This benchmark features a controllable and diverse array of embodied tasks, varying in levels of difficulty and complexity. It is designed to evaluate two critical dimensions of LLMs' application in embodied task understanding: spatial understanding (including relation constraints and occlusion of target objects) and temporal and causal comprehension of sequences of actions within an environment. Utilizing multi-source simulators as the backend simulator, ET-Plan-Bench provides immediate environmental feedback to LLMs, enabling dynamic interaction with the environment and the capacity for re-planning as necessary. We evaluated state-of-the-art open-source and closed-source foundational models, including GPT-4, Llama, and Mistral, using our proposed benchmark. While these models perform adequately on simple navigation tasks, their performance significantly deteriorates when confronted with tasks that demand a deeper understanding of spatial, temporal, and causal relationships. Consequently, our benchmark distinguishes itself as a large-scale, quantifiable, highly automated, and fine-grained diagnostic framework that presents a substantial challenge to the latest foundational models. We hope it will inspire and propel further research in embodied task planning utilizing foundational models.


ThDT26	103B
Human and Humanoid Motion Analysis and Synthesis	Regular Session
Chair: Furukawa, Tomonari	University of Virginia
Co-Chair: Liu, Mengyuan	Peking University

16:40-16:45, Paper ThDT26.1
MotionScript: Natural Language Descriptions for Expressive 3D Human Motions

Jome Yazdian, Payam	Simon Fraser University
Lagasse, Rachel	Simon Fraser University
Hamid, Mohammadi	University of Alberta
Liu, Eric	Simon Fraser University
Cheng, Li	University of Alberta
Lim, Angelica	Simon Fraser University
Keywords: Datasets for Human Motion, Human and Humanoid Motion Analysis and Synthesis, Gesture, Posture and Facial Expressions Abstract: We introduce MotionScript, a novel framework for generating highly detailed, natural language descriptions of 3D human motions. Unlike existing motion datasets that rely on broad action labels or generic captions, MotionScript provides fine-grained, structured descriptions that capture the full complexity of human movement—including expressive actions (e.g., emotions, stylistic walking) and interactions beyond standard motion capture datasets. MotionScript serves as both a descriptive tool and a training resource for text-to-motion models, enabling the synthesis of highly realistic and diverse human motions from text. By augmenting motion datasets with MotionScript captions, we demonstrate significant improvements in out-of-distribution motion generation, allowing large language models (LLMs) to generate motions that extend beyond existing data. Additionally, MotionScript opens new applications in animation, virtual human simulation, and robotics, providing an interpretable bridge between intuitive descriptions and motion synthesis. To the best of our knowledge, this is the first attempt to systematically translate 3D motion into structured natural language without requiring training data. Code, dataset, and examples are available at https://pjyazdian.github.io/MotionScript

16:45-16:50, Paper ThDT26.2
Recognizing Skeleton-Based Actions As Points

Yin, Baiqiao	Sun Yat-Sen University
Lin, Jiaying	Sun Yat-Sen University
Wen, Jiajun	Sun Yat-Sen University
Li, Yue	Sun Yat-Sen University
Liu, Jinfu	Sun Yat-Sen University
Wang, Yanfei	Sun Yat-Sen University
Liu, Mengyuan	Peking University
Keywords: Recognition, Human and Humanoid Motion Analysis and Synthesis Abstract: Recent advances in skeleton-based action recognition have been primarily driven by Graph Convolutional Networks (GCNs) and skeleton transformers. While conventional approaches focus on modeling joint co-occurrences through skeletal connections, they overlook the inherent positional information in 3D coordinates. Although Hyper-graphs partially address the limitation of pairwise aggregation in capturing higher-order kinematic dependencies, challenges remain in their topological definitions. To solve these problems, this paper proposes a skeleton-to-point network textbf{(Skeleton2Point)} to model joints' position relationships directly in three-dimensional space without fixed topology limitation, which is the first to regard skeleton recognition as point clouds. However, simply considering the raw 3D coordinates would result in the loss of the anatomical identity of each keypoint and its temporal position in the sequence. To address this limitation, we augment the three-dimensional spatial coordinates with two additional dimensions: the anatomical index of each keypoint and its corresponding frame number with a proposed textbf{I}nformation textbf{T}ransform textbf{M}odule (ITM). This transformation extends the representation from a three-dimensional to a five-dimensional feature space. Furthermore, we propose a textbf{C}luster-textbf{D}ispatch-based textbf{I}nteraction module (CDI) to enhance the discrimination of local-global information. In comparison with existing methods on NTU-RGB+D 60 and NTU-RGB+D 120 datasets, Skeleton2Point has demonstrated state-of-the-art performance on both joint modality and stream fusion. Especially, on the challenging NTU-RGB+D 120 dataset under the X-Sub and X-Set setting, the accuracies reach 90.63% and 91.92%.

16:50-16:55, Paper ThDT26.3
VET: A Visual-Electronic Tactile System for Immersive Human-Machine Interaction

Zhang, Cong	Tsinghua University
Yang, Yisheng	Tsinghua University
Mu, Shilong	Tsinghua University
Lyu, Chuqiao	Tsinghua Shenzhen International Graduate School
Li, Shoujie	Tsinghua Shenzhen International Graduate School
Chai, Xinyue	Tsinghua University
Ding, Wenbo	Tsinghua University
Keywords: Haptics and Haptic Interfaces, Force and Tactile Sensing, Virtual Reality and Interfaces Abstract: In pursuit of deeper immersion in human-machine interaction, researchers have widely explored new approaches to human-machine interface design. Achieving higher-dimensional information input and output on a single interface has become a key research focus. This study introduces the visual-electronic tactile (VET) System, which builds upon vision-based tactile sensors (VBTS) and integrates electrical stimulation feedback to enable duplex communication. While VBTS recognizes multi-dimensional input information through visuotactile signals, the integration of electrical stimulation feedback avoids interference with visuotactile information by directly stimulating neural pathways. The practicality of the VET system has been demonstrated through experiments on finger electrical stimulation sensitivity zones, as well as its applications in flight simulation games and robotic arm teleoperation. By utilizing the VET system, users generally achieve significantly reduced game completion times compared to using a mouse and keyboard while also enhancing grasping efficiency in robotic arm teleoperation.

16:55-17:00, Paper ThDT26.4
Tracking Highly Dynamic Humanoid Motion with Dynamic IMU Measurement Fusion

Cox, Jeronimo	University of Virginia
Zhang, Wei	University of Virginia
Furukawa, Tomonari	University of Virginia
Keywords: Human and Humanoid Motion Analysis and Synthesis, Kinematics, Sensor Fusion Abstract: Inertial sensing estimation methods allows human motion tracking in the absence of optical tracking and joint encoders, but the methods are rather developed for quasistatic motion due to the limited motion capability of humanoids. This paper presents a new method that tracks highly dynamic motion using Inertial Measurement Unit (IMU) measurements. Unlike conventional methods dependent on quasistatic motion for inclination correction with the measured gravity vector, the proposed method uses accelerometers to correct the rotational rate. This is achieved by placing sensors on the ends of links, and converting the acceleration measured at the ends to angular rate based on centrifugal forces. Measuring human motions of low and high intensities is used to identify any strengths and weaknesses of the proposed method with different applications. The proposed technique maintains an acceptable error for both quasistatic and highly dynamic motions and can be used to accurately visualize measured motions.

17:00-17:05, Paper ThDT26.5
Normalized Triangulation for Calibrated Dual-View 3D Human Pose Estimation

Zhang, Zijian	Beijing University of Posts and Telecommunications
Muqing, Wu	Beijing University of Posts and Telecommunications
Ma, Tianyi	Beijing University of Posts and Telecommunications
Keywords: Deep Learning Methods, Human and Humanoid Motion Analysis and Synthesis, Gesture, Posture and Facial Expressions Abstract: In this work, we decouple calibrated dual-view 3D human pose estimation (HPE) into the well-studied problems of 2D pose estimation, and 2D-to-3D pose lifting, focusing on the latter task. The key challenges stem from: 1) 2D pose is noisy and unreliable due to occlusion and motion blur, and 2) the trained model cannot generalize well to unseen camera configurations. To overcome these limitations, we propose three interconnected innovations: First, a Normalized Triangulation that transforms the 2D pose from pixel space to 3D normalized rays, which makes our approach robust to the camera parameters change. Second, a hybrid neural-geometry framework (i.e., including refinement and triangulation) that explicitly incorporates multi-view geometry into our models. Third, an analytical inverse kinematics (AnalyIK) solver that decomposes articulated motion with human topology, which simultaneously considers symmetry constraint and joint angle limit. Experiments show that the proposed framework achieves state-of-the-art performance on two widely used benchmarks (i.e., Huamn3.6M and HumanEva-I). Code is available at: https://github.com/Z-Z-J/Normalized-Triangulation.

17:05-17:10, Paper ThDT26.6
LS-HAR: Language Supervised Human Action Recognition with Salient Fusion, Construction Sites As a Use-Case

Mahdavian, Mohammad	Simon Fraser University
Loni, Mohammad	MDU/VCE
Samuelsson, Ted	Volvo Construction Equipment
Chen, Mo	Simon Fraser University
Keywords: Human and Humanoid Motion Analysis and Synthesis, Multi-Modal Perception for HRI, Deep Learning for Visual Perception Abstract: Detecting human actions is a crucial task for autonomous robots and vehicles, often requiring the integration of various data modalities for improved accuracy. In this study, we introduce a novel approach to Human Action Recognition (HAR) using language supervision named LS-HAR based on skeleton and visual cues. Our method leverages a language model to guide the feature extraction process in the skeleton encoder. Specifically, we employ learnable prompts for the language model conditioned on the skeleton modality to optimize feature representation. Furthermore, we propose a fusion mechanism that combines dual-modality features using a salient fusion module, incorporating attention and transformer mechanisms to address the modalities' high dimensionality. This fusion process prioritizes informative video frames and body joints, enhancing the recognition accuracy of human actions. Additionally, we introduce a new dataset tailored for real-world robotic applications in construction sites, featuring visual, skeleton, and depth data modalities, named VolvoConstAct. This dataset serves to facilitate the training and evaluation of machine learning models to instruct autonomous construction machines for performing necessary tasks in real-world construction sites. To evaluate our approach, we conduct experiments on our dataset as well as three widely used public datasets: NTU-RGB+D, NTU-RGB+D 120, and NW-UCLA. Results reveal that our proposed method achieves promising performance across all datasets, demonstrating its robustness and potential for various applications. The code, dataset, and demonstration of real-machine experiments are available at: https://mmahdavian.github.io/ls_har/

17:10-17:15, Paper ThDT26.7
Robust and Expressive Humanoid Motion Retargeting Via Optimization-Based Rig Unification

Jeong, Taemoon	Korea University
Byun, Taehyun	Korea University
Kim, Jihoon	CINAMON
Choi, Keunjun	Rainbow Robotics
Oh, Jaesung	NAVER LABS
Lee, SungPyo	NAVER LABS
Darwish, Omar	University of Illinois Urbana-Champaign
Kim, Joohyung	University of Illinois Urbana-Champaign
Choi, Sungjoon	Korea University
Keywords: Human and Humanoid Motion Analysis and Synthesis, Optimization and Optimal Control Abstract: Humanoid robots are increasingly being developed for seamless interaction with humans in diverse domains, yet generating expressive and physically-feasible motions remains a core challenge. We propose a robust and automated pipeline for motion retargeting that enables the generation of natural, stable, and highly expressive motions for a wide variety of humanoid robots using different motion data sources, including noisy pose estimations. To ensure robustness, our approach unifies motions from different kinematic structures into a common canonical rig, systematically refines the motion trajectory to address infeasible poses, enforces foot-contact constraints, and enhances stability. The retargeted motion is then refined to closely follow the source motion while respecting each robot's physical limits. Through extensive experiments on 12 simulated robots and validation on three real robots, we show that our methodology reliably produces expressive upper-body movements with consistent foot contact. This work represents an important step towards automating robust and expressive motion generation for humanoid robots, enabling deployment in various real-world scenarios.

17:15-17:20, Paper ThDT26.8
Motion Capture-Based Robotic Imitation: A Keyframeless Implementation Method Using Multivariate Empirical Mode Decomposition (I)

Dong, Ran	Chukyo University
Qiong, Chang	Institute of Science Tokyo
Er, Meng Joo	Dalian Maritime University
Zhong, Junpei	The Hong Kong Polytechnic University
Ikuno, Soichiro	Tokyo University of Technology
Keywords: Human and Humanoid Motion Analysis and Synthesis, Motion Control, Deep Learning Methods Abstract: Robotic imitation faces challenges due to the lack of nuanced movements when employing keyframe methods, which can potentially lead to the uncanny valley effect due to constraints in fitting data within motor speed capacities. This research proposes a keyframeless motion-transferring method for robotic imitation using motion capture data. Initially, we implement motion capture data into the NAO 6 robot, retargeting the joint angles from a hierarchical human body structure to the motor rotations. Second, to biomechanically optimize robotic imitation, we adopt multivariate empirical mode decomposition (MEMD) to decompose and analyze the motion capture data in the frequency domain. Third, we demonstrate that MEMD outperforms the Fourier transform (FT) in motion-capture-based robotic imitation and introduce an optimization algorithm. Finally, we evaluate four types of robotic motion imitations (picking-up, walking, punching, and Bunraku puppet motion) across five implementation methods (original data implementation, Laban keyframe method, FT, convolutional neural network autoencoder, and our method) using both NAO 6 robot sensors and a motion capture system. The results indicated that our proposed keyframeless motion-transferring method outperforms others in applying and controlling complex, nuanced nonlinear motion capture data for robotic imitation, offering a potential approach to studying the uncanny valley issue.


ThDT27	103C
Flexible Robotics	Regular Session
Chair: Ikemoto, Shuhei	Kyushu Institute of Technology

16:40-16:45, Paper ThDT27.1
Hybrid Motion Control of a Fiber-Based Soft Robotic Instrument for Minimally Invasive Surgery

Yang, Ziqi	Imperial College London
Tian, Libaihe	Imperial College London
Xiang, Yuchen	Imperial College London
Posma, Joram M.	Imperial College London
Temelkuran, Burak	Imperial College London
Keywords: Flexible Robotics, Motion Control, Machine Learning for Robot Control Abstract: Minimally Invasive Surgery (MIS) reduces surgical risks and recovery times by enabling precise interventions around lesion sites. However, accessing these sites requires flexible instruments, which often compromise the precision afforded by rigid devices. This study proposes a thermally drawn fiber-based, tendon-driven soft robotic instrument with a two-stage hybrid motion control framework to enhance accuracy. Several learning-based inverse kinematic (IK) models were developed, with the LSTM-based model showing the best performance and used to guide open-loop large-scale motion, while a closed-loop controller refines accuracy via real-time feedback. The system is validated by path-following tasks and a simulated endometrial ablation in a vaginal phantom. Results show that the IK model realizes stable open-loop control with Euclidean error below 2 mm, while hybrid control further reduces errors to below 1 mm. This combination offers a promising MIS solution with high precision in difficult-to-reach surgical sites.

16:45-16:50, Paper ThDT27.2
Estimating Continuum Robot Shape under External Loading Using Spatiotemporal Neural Networks

Enyi, Wang	Imperial College London
Deng, Zhen	Fuzhou University
Pan, Chuanchuan	Fuzhou University
He, Bingwei	Fuzhou University
Zhang, Jianwei	University of Hamburg
Keywords: Flexible Robotics Abstract: This paper presents a learning-based approach for accurate 3D shape estimation of flexible continuum robots subjected to external loads. The proposed method introduces a spatiotemporal neural network architecture that fuses multi-modal inputs, including current and historical tendon displacement data and RGB images, to generate point clouds representing the robot's deformed configuration. The network integrates a recurrent neural module for temporal feature extraction, an encoding module for spatial feature extraction, and a multi-modal fusion module to combine spatial features extracted from visual data with temporal dependencies from historical actuator inputs. Continuous 3D shape reconstruction is achieved by fitting Bézier curves to the predicted point clouds. Experimental validation demonstrates that our approach achieves high precision, with mean shape estimation errors of 0.08 mm (unloaded) and 0.22 mm (loaded), outperforming state-of-the-art methods in shape sensing for TDCRs. The results validate the efficacy of deep learning-based spatiotemporal data fusion for precise shape estimation under loading conditions.

16:50-16:55, Paper ThDT27.3
Compliant Tensegrity Robotic Arm with Continuously Adjustable Stiffness for Versatile Operation

Herrmann, David	OTH Regensburg
Lehmann, Lukas	OTH Regensburg
Schaeffer, Leon	OTH Regensburg
Schmitt, Lukas	OTH Regensburg
Jochum, Manuel	OTH Regensburg
Weigert, Paul	OTH Regensburg
Boehm, Valter	OTH Regensburg
Keywords: Flexible Robotics, Soft Robot Materials and Design, Soft Robot Applications Abstract: This paper presents a compliant tensegrity robotic arm design that overcomes limitations related to stiffness variation and cascaded actuation. Due to the special design and actuation strategy, the system offers a large workspace using a small number of actuators and system parts. Key features include intrinsic compliance, enhanced stability in various configurations, and a modular tendon-driven actuation system that facilitates continuous stiffness adjustment for adaptive manipulation tasks. The system’s kinematics and actuation strategy are validated experimentally. Results demonstrate an increased workspace and precise control, offering potential applications in dynamic and human-interactive environments.

16:55-17:00, Paper ThDT27.4
Uncertainty-Aware Motion Planning Based on Stochastic Forward/Inverse Kinematics Models for Tensegrity Manipulators

Yoshimitsu, Yuhei	Kyushu Institute of Technology
Osa, Takayuki	RIKEN
Ben Amor, Heni	Arizona State University
Ikemoto, Shuhei	Kyushu Institute of Technology
Keywords: Flexible Robotics, Modeling, Control, and Learning for Soft Robots, Tendon/Wire Mechanism Abstract: Robots whose shape and stiffness are determined by internal forces generally have complex shape-stiffness relationships that depend on their structure. As a result, there are difficulties such as a decrease in shape reproducibility when the robot is not stiff, and a decrease in the range of motion when the robot is stiff. In this study, we propose a motion planning method that balances shape and stiffness by learning forward and inverse kinematics using a stochastic neural network (NN) and using the uncertainty that can be evaluated by the NN. Through experiments using a tensegrity manipulator with 40 actuators and 20 degrees of freedom in bending posture, we verify the validity of the proposed method.

17:00-17:05, Paper ThDT27.5
DTactive: A Vision-Based Tactile Sensor with Active Surface

Xu, Jikai	Shanghai Qi Zhi Institute
Wu, Lei	Huazhong University of Science and Technology
Lin, Changyi	Carnegie Mellon University
Zhao, Ding	Carnegie Mellon University
Xu, Huazhe	Tsinghua University
Keywords: Force and Tactile Sensing, In-Hand Manipulation, Dexterous Manipulation Abstract: The development of vision-based tactile sensors has significantly enhanced robots' perception and manipulation capabilities, especially for tasks requiring contact-rich interactions with objects. In this work, we present DTactive, a novel vision-based tactile sensor with active surfaces. DTactive inherits and modifies the tactile 3D shape reconstruction method of DTact while integrating a mechanical transmission mechanism that facilitates the mobility of its surface. Thanks to this design, the sensor is capable of simultaneously performing tactile perception and in-hand manipulation with surface movement. Leveraging the high-resolution tactile images from the sensor and the magnetic encoder data from the transmission mechanism, we propose a learning-based method to enable precise angular trajectory control during in-hand manipulation. In our experiments, we successfully achieved accurate rolling manipulation within the range of [-180°,180°] on various objects, with the root mean square error between the desired and actual angular trajectories being less than 12° on nine trained objects and less than 19° on three novel objects. The results demonstrate the potential of DTactive for in-hand object manipulation in terms of effectiveness, robustness and precision.

17:05-17:10, Paper ThDT27.6
A Spatial Position-Based Visual Servoing Obstacle-Avoidable Shape Control Framework for an 11-DOF Hybrid Continuum Robot

Zhu, Puchen	The Chinese University of Hong Kong
Lai, Wenkai	Cuhk
Ma, Xin	Chinese Univerisity of HongKong
Wang, Xuchen	The Chinese University of Hong Kong
Zhou, Jianshu	University of California, Berkeley
Cheng, Shing Shin	The Chinese University of Hong Kong
Au, K. W. Samuel	The Chinese University of Hong Kong
Keywords: Flexible Robotics, Surgical Robotics: Steerable Catheters/Needles, Visual Servoing Abstract: As one of the effective closed-loop control methods, visual servoing control methods are widely applied to continuum robots. However, existing visual servoing control methods mostly focus on accurate control of the robot's end-effector, with less consideration given to the robot's shape. In this work, a spatial position-based visual servoing obstacle-avoidable shape control framework for an 11-degree-of-freedom (DOF) hybrid continuum robot is proposed. In the control framework, a set of markers representing the shape of the continuum robot are measured and two spatial arcs are used to fit the shape. When controlling the redundant DOFs of the robot, position-based visual servoing shape control combined with obstacle avoidance is formulated as a quadratic programming problem, yielding the optimal solution at each sample time for the joint velocity vector of the 11-DOF hybrid continuum robot. Several experiments are conducted to validate the proposed control framework, which indicates the accuracy of the shape control achieves 0.88 mm.

17:10-17:15, Paper ThDT27.7
TransSoft: The Low-Cost, Adaptable, and Radial Reconfigurable Soft Hand for Diverse Object Grasping

Gu, Yongchong	Fudan University
Lin, Haitao	Tencent
Fu, Yanwei	Fudan University
Keywords: Flexible Robotics, Soft Robot Applications, Perception for Grasping and Manipulation Abstract: This paper presents Transoft, a novel soft robotic hand with a reconfigurable design for grasping objects of varying properties. While recent soft robotic hands have improved grasping capabilities, they often struggle with a limited range of manipulable object categories and tasks due to hardware constraints. Transoft addresses these limitations with a scalable, low-cost, and highly adaptable structure that significantly expands the diversity of graspable objects and executable tasks. Unlike previous designs, Transoft features a unique kinematic structure that enhances radial reconfigurability , allowing it to adjust grasping strategies dynamically based on object size, shape, and material properties. The hand is cost-effective, built from off-the-shelf components in three hours for just 200. We evaluate Transoft through extensive real-world grasping experiments and benchmark it against existing soft grippers, demonstrating its superior adaptability and performance. Additionally, we provide a detailed comparison with related soft grippers to highlight Transoft’s advantages. Supplementary materials, including design details and experiment results, are available on our project website.

17:15-17:20, Paper ThDT27.8
Prototypes, Mathematical Modeling and Motion Analysis of Heptagonal Passive Rotating Locomotion Robots with Elastic Elements Arranged on Diagonal Lines

Asano, Fumihiko	Japan Advanced Institute of Science and Technology
Komori, Mikito	Japan Advanced Institute of Science and Technology
Sedoguchi, Taiki	Japan Advanced Institute of Science and Technology
Tokuda, Isao T.	Ritsumeikan University
Keywords: Flexible Robotics, Dynamics, Passive Walking Abstract: The authors have proposed a passive rotating locomotion robot that forms a convex heptagonal body by connecting seven identical linear rigid frames via viscoelastic rotational joints. In our previous study, it was confirmed through both numerical simulations and actual experiments that stable and passive rotating motion on a downhill could be generated. This paper proposes two new models in which the seven rigid frames are used as robust exoskeletons as they are, but the elastic elements attached to the rotating joints are removed and repositioned on the diagonals of the convex heptagon to reproduce the flexibility of the internal tissue. The elastic elements form a star-shaped polygon called a heptagram, which is formed by connecting seven vertices with a single stroke. The seven vertices can be connected in two different ways to form two different heptagram shapes. We report the basic numerical results of the change in the motion characteristics of the two models with respect to the slope angle and elastic modulus. An overview of the prototypes developed and the results of basic experiments are also reported.


ThDT28	104
Medical Vision	Regular Session
Chair: Ban, Yutong	Shanghai Jiao Tong University

16:40-16:45, Paper ThDT28.1
Self-Supervised 3D Reconstruction of Tibia and Fibula from Biplanar X-Rays

Pan, Kai	University of Technology Sydney
Zhang, Yanhao	Beijing Academy of Artificial Intelligence
Zhao, Liang	The University of Edinburgh
Huang, Shoudong	University of Technology, Sydney
Keywords: Computer Vision for Medical Robotics, Visual Learning Abstract: With the growing number of patients experiencing knee-related conditions, total knee arthroplasty (TKA) has become a common procedure, where a 3D visualisation of the patient's tissue is essential for preoperative planning. Traditional imaging techniques, such as computed tomography (CT), often expose patients to high levels of radiation or impose significant financial costs. As an alternative, this paper proposes a novel approach that reconstructs a 3D model of the tibia and fibula using only two X-ray images (taken from the coronal and sagittal planes) and a general template, significantly reducing radiation exposure and financial burden. Our algorithm of 3D reconstruction for patient-specific anatomies combines point-based deformation with deep learning techniques. Initially, the general model undergoes a preliminary deformation to match the patient tibia and fibula dimensions. This pre-deformed model then serves as a template, followed by a fine deformation process via a self-supervised graph convolutional network (GCN), whose parameters are trained iteratively by comparing the template projection and the X-ray measurements. Following tests in simulations, cadaver experiments, and in-vivo experiments, our proposed algorithm demonstrates state-of-the-art accuracy and exceptional robustness across different evaluation metrics. Our code is available at https://github.com/DrKaiPan/tfDeform_GCN.git

16:45-16:50, Paper ThDT28.2
Towards Autonomous Robotic Electrosurgery Via Thermal Imaging

Riaziat, Naveed Dennis	Johns Hopkins University
Chen, Joseph	Johns Hopkins University
Krieger, Axel	Johns Hopkins University
Brown, Jeremy DeLaine	Johns Hopkins University
Keywords: Computer Vision for Medical Robotics, Medical Robots and Systems, Sensor-based Control Abstract: Electrosurgery is a surgical technique that can improve tissue cutting by reducing cutting force and bleed- ing. However, electrosurgery adds a risk of thermal injury to surrounding tissue. Expert surgeons estimate desirable cutting velocities based on experience but have no quantifiable reference to indicate if a particular velocity is optimal. Furthermore, prior demonstrations of autonomous electrosurgery have primarily used constant tool velocity, which is not robust to changes in electrosurgical tissue characteristics, power settings, or tool type. Thermal imaging feedback provides information that can be used to reduce thermal injury while balancing cutting force by controlling tool velocity. We introduce Thermography for Electrosurgical Rate Modulation via Optimization (ThERMO) to autonomously reduce thermal injury while balancing cutting force by intelligently controlling tool velocity. We demonstrate ThERMO in tissue phantoms and compare its performance to the constant velocity approach. Overall, ThERMO improves cut success rate by a factor of three and can reduce peak cutting force by a factor of two. ThERMO responds to varying environmental disturbances, reduces damage to tissue, and completes cutting tasks that would otherwise result in catastrophic failure for the constant velocity approach.

16:50-16:55, Paper ThDT28.3
DRTT : A Diffusion-Based Framework for 4DCT Generation, Robust Thoracic Registration and Tumor Deformation Tracking

Li, Dongyuan	Shanghai Jiao Tong University
Shan, Yixin	Shanghai Jiao Tong University
Mao, Yuxuan	Shanghai Jiao Tong University
Shi, Haochen	Shanghai Jiao Tong University
Huang, Shenghao	Tongji University
Sun, Weiyan	Tongji University
Chen, Chang	Tongji University
Chen, Xiaojun	Shanghai Jiao Tong University
Keywords: Computer Vision for Medical Robotics, Medical Robots and Systems Abstract: In minimally invasive robotic thoracic surgery, the unavoidable respiratory motion of the patient causes lung lesions to move and deform, making precise tumor localization a significant challenge for surgeons. To address this, we introduce an RDDM (Recursive Deformable Diffusion Model)-based framework designed for real-time intraoperative tumor tracking, which can be used for registration and navigation in robot-assisted thoracic surgery. The RDDM reduces training complexity and enhances dataset utilization by employing a simplified DDM (Diffusion Deformable Model) iteratively, significantly lowering computational demands while maximizing the extraction of valuable information from limited 4D-CT (four-dimensional computed tomography) datasets. Considering the robustness required for intraoperative registration and navigation, we incorporate an ICP (Iterative Closest Point)-based point cloud registration method into the framework and validate our approach using publicly available datasets and volunteer trials. This innovation has the potential to reduce radiation exposure, trauma, and the risk of complications for patients undergoing minimally invasive thoracic surgery, and enables downstream tasks such as RAPNB (robot-assisted percutaneous needle biopsy) and radiation therapy.

16:55-17:00, Paper ThDT28.4
From Monocular Vision to Autonomous Action: Guiding Tumor Resection Via 3D Reconstruction

Acar, Ayberk	Vanderbilt University
Smith, Mariana	Vanderbilt University
Al-Zogbi, Lidia	Johns Hopkins University
Watts, Tanner	University of Utah
Li, Fangjie	Vanderbilt University
Li, Hao	Vanderbilt University
Yilmaz, Nural	Marmara University
Scheikl, Paul Maria	None
d'Almeida, Jesse	Vanderbilt University
Sharma, Susheela	Vanderbilt University
Branscombe, Lauren	Virtuoso Surgical
Ertop, Tayfun Efe	Vanderbilt University
Webster III, Robert James	Vanderbilt University
Oguz, Ipek	Vanderbilt University
Kuntz, Alan	University of Utah
Krieger, Axel	Johns Hopkins University
Wu, Jie Ying	Vanderbilt University
Keywords: Computer Vision for Medical Robotics, Medical Robots and Systems Abstract: Surgical automation requires precise guidance and understanding of the scene. Current methods in the literature rely on bulky depth cameras to create maps of the anatomy; however, this does not translate well to space-limited clinical applications. Monocular cameras are small and allow minimally invasive surgeries in tight spaces, but additional processing is required to generate 3D scene understanding. We propose a 3D mapping pipeline that uses only RGB images to create segmented point clouds of the target anatomy. To ensure the most accurate reconstruction, we compare different structure from motion algorithms' performance on mapping the central airway obstructions, and test the pipeline on a downstream task of tumor resection. In several metrics, including post-procedure percentage tissue charring, our pipeline performs comparably to RGB-D cameras and, in some cases, even surpasses their downstream task performance. These promising results demonstrate that automation guidance can be achieved in minimally invasive procedures with monocular cameras. This study is a step toward the complete autonomy of surgical robots.

17:00-17:05, Paper ThDT28.5
Video-Rate 4D OCT Segmentation Based on Motion-Aware Probabilistic A-Scan Sampling

Dehghani, Shervin	TUM
Sommersperger, Michael	Technical University of Munich
Navab, Nassir	TU Munich
Keywords: Computer Vision for Medical Robotics, Medical Robots and Systems, Surgical Robotics: Planning Abstract: Recent advancements in robotic eye surgery and intraoperative 4D optical coherence tomography (iOCT) imaging could enable fully or partially autonomous robotic procedures and enhanced surgical visualization. A fundamental requirement for such applications is rapid semantic segmentation of intraoperative 4D OCT data, which is capable of acquiring volumes at video rate, to provide real-time three-dimensional scene perception. Significant advancements have been made in learning-based 2D and 3D OCT segmentation techniques, pushing the boundaries of accuracy and performance. However, despite these achievements, the computational demands of 2D and 3D convolutions make real-time intraoperative processing of 4D OCT infeasible, even with substantial computational resources. This work introduces a novel real-time iOCT volume segmentation methodology. The novelty consists of a dynamic motion-aware A-scan sampling strategy, followed by an efficient segmentation approach, guaranteeing both speed and accuracy of segmentation. Our A-scan-based processing network leverages a 1D convolution approach to resolve the complexities of multi-dimensional kernels and allow for maximum parallelization, resulting in significantly faster performance. We further show that OCT volume segmentation can be reconstructed from a sparse A-scan sampling strategy that prioritizes areas in which inter-volume motion was detected, and that even missing anatomical surface information below the surgical tools can be reconstructed. Our results show high segmentation performance in dynamic surgical environments and video-rate segmentation performance meeting the demanding processing requirements of 4D OCT and leading to substantial speed improvements over previous methods.

17:05-17:10, Paper ThDT28.6
Tracking-Aware Deformation Field Estimation for Non-Rigid 3D Reconstruction in Robotic Surgeries

Wang, Zeqing	Shanghai Jiao Tong University
Fang, Han	Shanghai Jiao Tong University
Xu, Yihong	Valeo.ai
Ban, Yutong	Shanghai Jiao Tong University
Keywords: Computer Vision for Medical Robotics, Surgical Robotics: Laparoscopy Abstract: Minimally invasive procedures have been advanced rapidly by the robotic laparoscopic surgery. The latter greatly assists surgeons in sophisticated and precise operations with reduced invasiveness. Nevertheless, it is still safety critical to be aware of even the least tissue deformation during instrument-tissue interactions, especially in 3D space. To address this, recent works rely on NeRF to render 2D videos from different perspectives and eliminate occlusions. However, most of the methods fail to predict the accurate 3D shapes and associated deformation estimates robustly. Differently, we propose Tracking-Aware Deformation Field (TADF), a novel framework which reconstructs the 3D mesh along with the 3D tissue deformation simultaneously. It first tracks the key points of soft tissue by a foundation vision model, providing an accurate 2D deformation field. Then, the 2D deformation field is smoothly incorporated with a neural implicit reconstruction network to obtain tissue deformation in the 3D space. Finally, we experimentally demonstrate that the proposed method provides more accurate deformation estimation compared with other 3D neural reconstruction methods in two public datasets. The code will be publicly available after the paper acceptance.

17:10-17:15, Paper ThDT28.7
High-Precision Pose Estimation of Medical Targets Using a Distortion Compensation Model for Robotic Surgical Navigation

Kong, Weifeng	Hohai University
Tan, Zhiying	Hohai University
Xue, You	Hohai University
Wang, Yimin	The First People’s Hospital of Changzhou
Keywords: Computer Vision for Medical Robotics, Medical Robots and Systems Abstract: Medical tracking is a significant issue in vision-based robotic-assisted surgical navigation, especially for distal locking of intramedullary nails. Existing solutions face limitations such as high manufacturing costs for targets, complex tracking schemes, and low positioning precision. This paper proposes a novel method to estimate the pose of medical targets through pre-calibration and the Perspective-n-Point (PnP) algorithm, which determines the position of the distal intramedullary nail hole and projects this position onto the monitor. The precision of medical target positioning is highly affected by the distortion coefficient of camera internal parameters. To address this, we design and construct a distortion compensation model to reduce its impact on positioning precision. Additionally, to mitigate the effect of illumination variations, automatic exposure of polarized vision is utilized. Through 50 reprojection experiments, the proposed distortion model achieves a positioning precision of 0.284 mm at a working distance of one meter, significantly outperforming the 0.4 mm precision of the division model and the 0.426 mm precision of the polynomial model with a relative improvement of 30% and 34.2%. This method enhances the accuracy and reliability of robot-assisted surgical navigation, facilitating more precise and efficient surgical procedures.


ThDT29	105
Manipulation Planning	Regular Session
Chair: Calinon, Sylvain	Idiap Research Institute
Co-Chair: Pokorny, Florian T.	KTH Royal Institute of Technology

16:40-16:45, Paper ThDT29.1
Understanding the Impact of Modeling Abstractions on Motion Planning for Deformable Linear Objects

Envall, Jimmy	ETH Zurich
Thomaszewski, Bernhard	Université De Montréal
Coros, Stelian	ETH Zurich
Keywords: Manipulation Planning, Motion and Path Planning, Simulation and Animation Abstract: Robotic manipulation of deformable objects remains challenging due to the high dimensional configuration space and complex dynamics. In this work we demonstrate how the abstraction level used for modeling deformable objects can significantly impact the difficulty of the motion planning problem. We specifically focus on buckling — a nonlinear instability phenomenon that arises in response to compression of slender deformable objects. Using deformable linear objects (DLOs) as a case of study, we show that eliminating resistance to compression in the simulation model while penalizing compressed states in the planning objective increases both robustness and performance. We demonstrate our approach on a set of simulation examples and validate our results through physical robot experiments.

16:45-16:50, Paper ThDT29.2
Efficient and Real-Time Motion Planning for Robotics Using Projection-Based Optimization

Chi, Xuemin	Zhejiang University
Girgin, Hakan	Swiss Cobotics Competence Center
Löw, Tobias	Idiap Research Institute, EPFL
Xie, Yangyang	Zhejiang University
Xue, Teng	Idiap Research Institute and EPFL
Huang, Jihao	Zhejiang University
Hu, Cheng	Zhejiang University
Liu, Zhitao	Zhejiang University
Calinon, Sylvain	Idiap Research Institute
Keywords: Manipulation Planning, Task and Motion Planning, Motion and Path Planning Abstract: Generating motions for robots interacting with objects of various shapes is a complex challenge, further complicated by the robot’s geometry and multiple desired behaviors. While current robot programming tools (such as inverse kinematics, collision avoidance, and manipulation planning) often treat these problems as constrained optimization, many existing solvers focus on specific problem domains or do not exploit geometric constraints effectively. We propose an efficient first-order method, Augmented Lagrangian Spectral Projected Gradient Descent (ALSPG), which leverages geometric projections via Euclidean projections, Minkowski sums, and basis functions. We show that by using geometric constraints rather than full constraints and gradients, ALSPG significantly improves real-time performance. Compared to second-order methods like iLQR, ALSPG remains competitive in the unconstrained case. We validate our method through toy examples and extensive simulations, and demonstrate its effectiveness on a 7-axis Franka robot, a 6-axis P-Rob robot and a 1:10 scale car in real-world experiments. Source codes, experimental data and videos are available on the project webpage: url{https://sites.google.com/view/alspg-oc}

16:50-16:55, Paper ThDT29.3
Semantic-Geometric-Physical-Driven Robot Manipulation Skill Transfer Via Skill Library and Tactile Representation

Qi, Mingchao	Northwestern Polytechnical University
Li, Yuanjin	Northwestern Polytechnical University
Liu, Xing	Northwestern Polytechnical University
Liu, Zhengxiong	Northwestern Polytechnical University
Huang, Panfeng	Northwestern Polytechnical University
Keywords: Manipulation Planning, Transfer Learning, Force and Tactile Sensing Abstract: Developing general robotic systems capable of manipulating in unstructured environments is a significant challenge, particularly as the tasks involved are typically long-horizon and rich-contact, requiring efficient skill transfer across different task scenarios. To address these challenges, we propose knowledge graph-based skill library construction method. This method hierarchically organizes manipulation knowledge using “task graph” and “scene graph” to represent task-specific and scene-specific information, respectively. Additionally, we introduce “state graph” to facilitate the interaction between high-level task planning and low-level scene information. Building upon this foundation, we further propose a novel hierarchical skill transfer framework based on the skill library and tactile representation, which integrates high-level reasoning for skill transfer and low-level precision for execution. At the task level, we utilize large language models (LLMs) and combine contextual learning with a four-stage chain-of-thought prompting paradigm to achieve subtask sequence transfer. At the motion level, we develop an adaptive trajectory transfer method based on the skill library and the heuristic path planning algorithm. At the physical level, we propose an adaptive contour extraction and posture perception method based on tactile representation. This method dynamically acquires high-precision contour and posture information from visual-tactile images, adjusting parameters such as contact position and posture to ensure the effectiveness of transferred skills in new environments. Experiments demonstrate the skill transfer and adaptability capabilities of the proposed methods across different task scenarios.

16:55-17:00, Paper ThDT29.4
Throwing Planning Diffusion: A Solution to Learning and Planning of Robotic Throwing

Xu, Ziqi	Xidian University
Li, Haodu	Xidian University
Liu, Lihao	Xidian University
Liu, Jun	Xidian University
Duan, Xuechao	Xidian University
Keywords: Manipulation Planning, Learning from Experience, Task and Motion Planning Abstract: Dynamic manipulation enables efficient interaction tasks, such as throwing, which rely on finding one or more high-quality trajectories from the initial state to the goal state. While model-free learning methods have been used to acquire efficient robot manipulation configurations, traditional planning algorithms often struggle with multi-task specifications, high-dimensional, and multi-modal trajectory data. Prior generative model-based approaches, have made significant progress in the field of motion planning. Diffusion models, as an emerging generative model, have been widely applied to planning tasks in various environments and have gained attention for their ability in encoding multidimensional and multimodal trajectories. Here we propose our method that combines the diffusion model and model-free throw methods. Specifically, we use a backward reachable tube to search for throwing configurations, and sample from posterior trajectory distribution conditioned on the throwing configurations. Several trajectory optimization methods are used to ensure the generation of effective throwing trajectories. Experimental results show that our method is effective in generating feasible, smooth, and collision-free throwing trajectories in both simulated and real-world tasks. Additionally, different trajectories are provided to enhance the multimodality of the throwing task.

17:00-17:05, Paper ThDT29.5
Trajectory Optimization for In-Hand Manipulation with Tactile Force Control

Lee, Haegu	University of Southern Denmark
Kim, Yitaek	University of Southern Denmark
Staven, Victor Melbye	University of Southern Denmark
Sloth, Christoffer	University of Southern Denmark
Keywords: Manipulation Planning, In-Hand Manipulation, Multifingered Hands Abstract: The strength of the human hand lies in its ability to manipulate objects precisely and robustly. In contrast, simple robotic grippers have low dexterity and fail to handle small objects effectively. This is why many automation tasks remain unsolved by robots. This paper presents an optimization-based framework for in-hand manipulation with a robotic hand equipped with compact Magnetic Tactile Sensors (MTSs). We formulate a trajectory optimization problem using Nonlinear Programming (NLP) for finger movements while ensuring contact points to change along the geometry of the fingers. Using the optimized trajectory from the solver, we implement and test an open-loop controller for rolling motion. To fur- ther enhance robustness and accuracy, we introduce a force controller for the fingers and a state estimator for the object utilizing MTSs. The proposed framework is validated through comparative experiments, showing that incorporating the force control with compliance consideration improves the accuracy and robustness of the rolling motion. Rolling an object with the force controller is 30% more likely to succeed than running an open-loop controller. The demonstration video is available at https://youtu.be/6J_muL_AyE8.

17:05-17:10, Paper ThDT29.6
Complex Robotic Manipulation Via Hindsight Goal Diffusion and Graph-Based Experience Replay

Sun, Zihao	Shandong University
Li, Zihan	Shandong University
He, Jinrui	Shandong University
Song, Yong	Shandong University
Liu, Pingping	Shandong University
Xu, Qingyang	Shandong University
Yuan, Xianfeng	Shandong University
Song, Rui	Shandong University
Keywords: Manipulation Planning, Reinforcement Learning, Machine Learning for Robot Control Abstract: Goal-conditioned reinforcement learning (GCRL) is an effective method for multi-goal robotic manipulation tasks. Many studies based on hindsight experience replay (HER) and hindsight goal generation (HGG) have achieved the autonomous acquisition of robotic manipulation in reward-sparse environments and have greatly improved the learning efficiency of GCRL. However, these methods perform poorly in environments with obstacles and distant goals. In this paper, we propose hindsight goal diffusion and graph-based experience replay (HGD-GER) for complex robotic manipulation. First, obstacle-avoiding graphs in environments with obstacles are constructed, and the graph-based distance metric between different goals is established. Second, the proposed HGD approach utilizes the inherent denoising mechanism of diffusion models and obstacle-avoiding graph-based distance to generate exploration goals, thereby promoting the exploration of obstacle-bypassing areas. Then, GER module modifies the reward value of experience replay by graph-based distance, thereby avoiding the bias introduced by HER and improving the learning performance of the RL algorithm under sparse reward conditions. Finally, we conducted experiments on three robotic manipulation tasks with obstacles and distant goals, and the results show that the proposed HGD-GER achieves excellent learning performance. Additionally, the proposed method is deployed on the physical robot.

17:10-17:15, Paper ThDT29.7
Transferring Kinesthetic Demonstrations across Diverse Objects for Manipulation Planning

Das, Dibyendu	Stony Brook University
Patankar, Aditya	Stony Brook University
Chakraborty, Nilanjan	Stony Brook University
Ramakrishnan, C. R.	Stony Brook University
Ramakrishnan, Iv	Stony Brook University
Keywords: Manipulation Planning, Learning from Demonstration, Motion and Path Planning Abstract: Given a demonstration of a complex manipulation task, such as pouring liquid from one container to another, we seek to generate a motion plan for a new task instance involving objects with different geometries. This is nontrivial since we need to simultaneously ensure that the implicit motion constraints are satisfied (glass held upright while moving), that the motion is collision-free, and that the task is successful (e.g., liquid is poured into the target container). We solve this problem by identifying the positions of critical locations and associating a reference frame (called motion transfer frames) on the manipulated object and the target, selected based on their geometries and the task at hand. By tracking and transferring the path of the motion transfer frames, we generate motion plans for arbitrary task instances with objects of different geometries and poses. We show results from simulation as well as robot experiments on physical objects to evaluate the effectiveness of our solution.

17:15-17:20, Paper ThDT29.8
CageCoOpt: Enhancing Manipulation Robustness through Caging-Guided Morphology and Policy Co-Optimization

Dong, Yifei	KTH
Han, Shaohang	KTH, Royal Institute of Technology
Cheng, Xianyi	Duke University
Friedl, Werner	German AerospaceCenter (DLR)
Cabral Muchacho, Rafael Ignacio	KTH Royal Institute of Technology
Roa, Maximo A.	German Aerospace Center (DLR)
Tumova, Jana	KTH Royal Institute of Technology
Pokorny, Florian T.	KTH Royal Institute of Technology
Keywords: Manipulation Planning, Grasping, Dexterous Manipulation Abstract: Uncertainties in contact dynamics and object geometry remain significant barriers to robust robotic manipulation. Caging helps mitigate these uncertainties by constraining an object's mobility without requiring precise contact modeling. Existing caging research often treats morphology and policy optimization as separate problems, overlooking their synergy. In this paper, we introduce CageCoOpt, a hierarchical framework that jointly optimizes manipulator morphology and control policy for robust caging-based manipulation. The framework employs reinforcement learning for policy optimization at the lower level and multi-task Bayesian optimization for morphology optimization at the upper level. We incorporate a caging metric into both optimization levels to encourage caging configurations and thereby improve manipulation robustness. The evaluation consists of four manipulation tasks and demonstrates that co-optimizing morphology and policy improves task performance under uncertainties, establishing caging-guided co-optimization as a viable approach for robust manipulation.


ThDT30	106
Embedded Systems for Robotics and Automation	Regular Session
Chair: Chu, Xiangyu	The Chinese University of Hong Kong
Co-Chair: Wu, Zhenyu	Beijing University of Posts and Telecommunications

16:40-16:45, Paper ThDT30.1
FPGA Hardware Neural Control of CartPole and F1TENTH Race Car

Paluch, Marcin	University of Zurich
Bolli, Florian	University of Zurich
Deng, Xiang	ETH Zurich
Rios-Navarro, Antonio	University of Seville
Gao, Chang	Delft University of Technology
Delbruck, Tobi	Univ. of Zurich & ETH Zurich
Keywords: Embedded Systems for Robotic and Automation, Machine Learning for Robot Control, Imitation Learning Abstract: Latency and computational cost often limit the use of Nonlinear Model Predictive Control (NMPC) in real-time robotics. To address this limitation, our work investigates FPGA-implemented Neural Controllers (NC) trained through supervised learning, mimicking NMPC. We show that inexpensive embedded FPGA hardware is sufficient to implement these neural controllers for high-frequency control of robotic systems. We demonstrate kilohertz control rates for a cartpole and offload control to the FPGA hardware on the F1TENTH race car. The FPGA NC outperforms NMPC on the cartpole, due to the faster control rate afforded by faster NC inference. The code and hardware implementation for this paper are available at https://github.com/SensorsINI/Neural-Control-Tools.

16:45-16:50, Paper ThDT30.2
QLIO: Quantized LiDAR-Inertial Odometry

Lou, Boyang	Beijing University of Posts and Telecommunications
Yuan, Shenghai	Nanyang Technological University
Yang, Jianfei	Nanyang Technological University
Su, Wenju	Beijing University of Posts and Telecommunications
Zhang, Yingjian	Beijing University of Posts and Telecommunications
Hu, Enwen	Beijing University of Posts and Telecommunications
Keywords: Embedded Systems for Robotic and Automation, Sensor-based Control, Distributed Robot Systems Abstract: LiDAR-Inertial Odometry (LIO) is widely used for autonomous navigation, but its deployment on Size, Weight, and Power (SWaP)-constrained platforms remains challenging due to the computational cost of processing dense point clouds. Conventional LIO frameworks rely on a single onboard processor, leading to computational bottlenecks and high memory demands, making real-time execution difficult on embedded systems. To address this, we propose QLIO, a multi-processor distributed quantized LIO framework that reduces computational load and bandwidth consumption while maintaining localization accuracy. QLIO introduces a quantized state estimation pipeline, where a co-processor pre-processes LiDAR measurements, compressing point-to-plane residuals before transmitting only essential features to the host processor. Additionally, an rQ-vector-based adaptive resampling strategy intelligently selects and compresses key observations, further reducing computational redundancy. Real-world evaluations demonstrate that QLIO achieves a 14.1× reduction in per-observation residual data while preserving localization accuracy. Furthermore, we release an open-source implementation to facilitate further research and real-world deployment. These results establish QLIO as an efficient and scalable solution for real-time autonomous systems operating under computational and bandwidth constraints.

16:50-16:55, Paper ThDT30.3
Automatic Real-To-Sim-To-Real System through Iterative Interactions for Robust Robot Manipulation Policy Learning with Unseen Objects

Kang, Minjae	Seoul National University (SNU)
Kee, Hogun	Seoul National University
Lee, Ho Sung	Hanyang Univ
Oh, Songhwai	Seoul National University
Keywords: Embedded Systems for Robotic and Automation, Machine Learning for Robot Control, Deep Learning in Grasping and Manipulation Abstract: Real-to-sim-to-real systems have been studied to overcome the challenges of robot policy learning in the real world by creating a virtual environment that mimics the actual workspace. However, previous studies have limitations, requiring human assistance, such as observing the workspace with a hand-held camera or manipulating objects with a hand. To solve these limitations, we propose a novel real-to-sim-to-real framework, ARIC, that performs without human help. First, ARIC observes real objects by repeatedly changing the object poses through the pre-trained robot policy via reinforcement learning. Through iterative interactions between the robot and the environment, ARIC gradually improves the accuracy of 3D object reconstruction. Next, ARIC learns task-specific robot policies in simulation using replicated objects and applies the policies to real-world scenarios without fine-tuning. We confirm that ARIC efficiently learns robotic tasks by achieving a success rate of 83.3% on average for three real-world tasks.

16:55-17:00, Paper ThDT30.4
Embodied Instruction Following in Unknown Environments

Wu, Zhenyu	Beijing University of Posts and Telecommunications
Wang, Ziwei	Nanyang Technological University
Xu, Xiuwei	Tsinghua University
Yin, Hang	Tsinghua University
Liang, Yinan	Tsinghua University
Ma, Angyuan	Tsinghua University
Lu, Jiwen	Tsinghua University
Yan, Haibin	Beijing University of Posts and Telecommunications
Keywords: Embodied Cognitive Science Abstract: Enabling embodied agents to complete complex human instructions from natural language is crucial to autonomous systems in household services. Conventional methods can only accomplish human instructions in the known environment where all interactive objects are provided to the embodied agent, and directly deploying the existing approaches for the unknown environment usually generates infeasible plans that manipulate non-existing objects. On the contrary, we propose an embodied instruction following (EIF) method for complex tasks in the unknown environment, where the agent efficiently explores the unknown environment to generate feasible plans with existing objects to accomplish abstract instructions. Specifically, we build a hierarchical embodied instruction following framework including the high-level task planner and the low-level exploration controller with multimodal large language models. We then construct a semantic representation map of the scene with dynamic region attention to demonstrate the known visual clues, where the goal of task planning and scene exploration is aligned for human instruction. For the task planner, we generate the feasible step-by-step plans for human goal accomplishment according to the task completion process and the known visual clues. For the exploration controller, the optimal navigation or object interaction policy is predicted based on the generated step-wise plans and the known visual clues. The experimental results demonstrate that our method can achieve 45.09% success rate in 204 complex human instructions such as making breakfast and tidying rooms in large house-level scenes. Code and supplementary are available at https://gary3410.github.io/eif unknown/.

17:05-17:10, Paper ThDT30.6
L-SNI: A Language-Driven Semantic Navigation System for Inspection Tasks

Ma, Jiawang	Shenzhen Technology University
Guo, Weichen	Shenzhen Technology University
Wu, Xuan	Shenzhen Technology University
Zhuang, Zinan	Shenzhen Technology University
Zeng, Rongxiang	Shenzhen Technology University
Shi, Yongliang	Qiyuan Lab
Ma, Gan	Shenzhen Technology University
Keywords: Embodied Cognitive Science, Mapping, Task and Motion Planning Abstract: For inspection robots to achieve generalizability, stability, and ease of use, it is crucial that they understand natural language commands and navigate accurately to specified target objects. We propose L-SNI, a semantic navigation system adapted for inspection tasks, offering generalizability, robust stability, and practical ease of use. In the perception phase, L-SNI constructs a precise geometric depth map of the environment using LiDAR, while RGB images are employed to extract object categories, which are then combined with depth data to generate a semantic map. To enable the large language model (LLM) to interpret the environment, L-SNI encodes the 3D semantic map into a plain text representation. During single-task execution, L-SNI decodes human commands into inspection primitives using an LLM constrained by system initial prompts. These inspection primitives guide the robot's low-level planner for task execution. To address the challenge of traditional 3D LiDAR localization and navigation systems in accurately positioning the robot around target objects during inspection tasks, we propose a target cost gradient to assist in optimizing the robot's target point selection and attitude control in maps with semantic information. Upon reaching the target, L-SNI uses a visual language model (VLM) to describe the scene, which is simplified by the LLM into a user-friendly response. Through testing on 18 indoor scenes from the Matterport 3D dataset, L-SNI achieves a 46.9% improvement in Success Rate (SR) and a 58.3% increase in Success weighted by Path Length (SPL) over existing state-of-the-art (SOTA) solutions, while also demonstrating superior target image understanding. Moreover, it can be easily deployed on real-world robots without complex initialization.

Technical Program for Thursday October 23, 2025